Esxi Server Freeze.
msalem
87 Posts
January 13, 2019, 10:33 amQuote from msalem on January 13, 2019, 10:33 amHello Admin,
we have two different Esxi clusters that are using PetaSAN as storage. however it not in use only mounted.
Last week - we lost two server in the cluster and after vmware investigation they stated that there was a LUN (ISCSI) that was not mounted properly and esxi got overwhelmed trying to connect to it. after a day we lost other host and when we disconnected the ISCSI from PetaSAN all hosts were back online.
I have checked the logs in PetaSAN and its filled with errors ?
Have you encountered an issue like this or whats the solution for this bug in iscsi. ?
Thanks
Hello Admin,
we have two different Esxi clusters that are using PetaSAN as storage. however it not in use only mounted.
Last week - we lost two server in the cluster and after vmware investigation they stated that there was a LUN (ISCSI) that was not mounted properly and esxi got overwhelmed trying to connect to it. after a day we lost other host and when we disconnected the ISCSI from PetaSAN all hosts were back online.
I have checked the logs in PetaSAN and its filled with errors ?
Have you encountered an issue like this or whats the solution for this bug in iscsi. ?
Thanks
admin
2,930 Posts
January 13, 2019, 11:06 amQuote from admin on January 13, 2019, 11:06 amthere was a LUN (ISCSI) that was not mounted properly and esxi got overwhelmed trying to connect to it.
This is either a hardware error, or the PetaSAN nodes are overloaded to the point they do not respond to connection login request on time: some SAN vendors recommend you increase the ESXi iSCSI parameters for login timeout and login retries. It is better however to correctly load the cluster and assign enough hardware resources on the PetaSAN nodes to match you expected load.
there was a LUN (ISCSI) that was not mounted properly and esxi got overwhelmed trying to connect to it.
This is either a hardware error, or the PetaSAN nodes are overloaded to the point they do not respond to connection login request on time: some SAN vendors recommend you increase the ESXi iSCSI parameters for login timeout and login retries. It is better however to correctly load the cluster and assign enough hardware resources on the PetaSAN nodes to match you expected load.
Last edited on January 13, 2019, 11:06 am by admin · #2
msalem
87 Posts
January 13, 2019, 4:55 pmQuote from msalem on January 13, 2019, 4:55 pmHello admin,
The stranhe part is these luns are empty , we havnt used it for any load we have . And we have setup the iscsi based on the recommendations in your docs. It started off by the cluster going from clean to 36 OSD's failure and while recovering the esxi stopped responding.
When we reported this issue to VMware they checked the logs and they stated as I mention before .
What would cause such a break without any load , maybe an IScSi bug ??
Thanks
Hello admin,
The stranhe part is these luns are empty , we havnt used it for any load we have . And we have setup the iscsi based on the recommendations in your docs. It started off by the cluster going from clean to 36 OSD's failure and while recovering the esxi stopped responding.
When we reported this issue to VMware they checked the logs and they stated as I mention before .
What would cause such a break without any load , maybe an IScSi bug ??
Thanks
admin
2,930 Posts
January 13, 2019, 6:38 pmQuote from admin on January 13, 2019, 6:38 pmThe stranhe part is these luns are empty , we havnt used it for any load we have
The login failure will not depend on the lun being empty or not, but rather on either a hardware/network failure or the PetaSAN nodes being overloaded.
we have setup the iscsi based on the recommendations in your docs
We have docs on how to connect iSCSI but that does not mean your hardware will match your required load. We added a lot of features to enable you to measure your hardware and view resource utilization but it is up to you to make sure your hardware will meet your load demands in terms of iops and throughput.
It started off by the cluster going from clean to 36 OSD's failure and while recovering the esxi stopped responding
Sounds either a hardware issue or a load issue. Also the 36 OSDs going down does not point to a problem on the iSCSI layer but something at the lower layers. Also when OSDs go down for whatever reason, the recovery process will kick in and this will put further load on the cluster.
maybe an IScSi bug
If there was no load or hardware issue + Ceph is up + we can reproduce this, yes.
The stranhe part is these luns are empty , we havnt used it for any load we have
The login failure will not depend on the lun being empty or not, but rather on either a hardware/network failure or the PetaSAN nodes being overloaded.
we have setup the iscsi based on the recommendations in your docs
We have docs on how to connect iSCSI but that does not mean your hardware will match your required load. We added a lot of features to enable you to measure your hardware and view resource utilization but it is up to you to make sure your hardware will meet your load demands in terms of iops and throughput.
It started off by the cluster going from clean to 36 OSD's failure and while recovering the esxi stopped responding
Sounds either a hardware issue or a load issue. Also the 36 OSDs going down does not point to a problem on the iSCSI layer but something at the lower layers. Also when OSDs go down for whatever reason, the recovery process will kick in and this will put further load on the cluster.
maybe an IScSi bug
If there was no load or hardware issue + Ceph is up + we can reproduce this, yes.
Last edited on January 13, 2019, 6:51 pm by admin · #4
msalem
87 Posts
January 14, 2019, 4:32 pmQuote from msalem on January 14, 2019, 4:32 pmI can upload the Logs if you really need to see, but As I mentioned before - there were no load on the severs and only 10% of the memory was used, in addition we have checked network logs and IPMI and it was clean.
Thanks
I can upload the Logs if you really need to see, but As I mentioned before - there were no load on the severs and only 10% of the memory was used, in addition we have checked network logs and IPMI and it was clean.
Thanks
admin
2,930 Posts
January 15, 2019, 3:10 pmQuote from admin on January 15, 2019, 3:10 pmI would recommend you look into why 36 OSDs failed without any load on them.
OSDs are at a lower layer in the stack than iSCSI, their failure will affect iSCSI and not the other way. From our experience, OSDs do not fail in bulk unless there is severe overloading (which causes them to flap) or pure hardware/network issue. For iSCSI not connecting, it is most probably related to the above issue.
I would recommend you look into why 36 OSDs failed without any load on them.
OSDs are at a lower layer in the stack than iSCSI, their failure will affect iSCSI and not the other way. From our experience, OSDs do not fail in bulk unless there is severe overloading (which causes them to flap) or pure hardware/network issue. For iSCSI not connecting, it is most probably related to the above issue.
msalem
87 Posts
February 6, 2019, 10:14 amQuote from msalem on February 6, 2019, 10:14 amHello admin,
seems after some reseach and testing, we found issued in the Consul service ISCSI now fails to start, and PetaSAN cannot delete any disks from the GUI.
We suspect that ISCSI drop at caused the cluster to freeze.
https://ibb.co/gTjWmGz
https://ibb.co/CHFLdFt
One image was from shutting down the server, and the second when trying to create a new LUN's.
Please look at the images attached, have you encountered this before . ?
Hello admin,
seems after some reseach and testing, we found issued in the Consul service ISCSI now fails to start, and PetaSAN cannot delete any disks from the GUI.
We suspect that ISCSI drop at caused the cluster to freeze.
One image was from shutting down the server, and the second when trying to create a new LUN's.
Please look at the images attached, have you encountered this before . ?
admin
2,930 Posts
February 6, 2019, 11:39 amQuote from admin on February 6, 2019, 11:39 amHi
no it is not something we see. my recommendation as per the prev post is to look a the 36 OSDs and see why they failed from logs. our experience is unless there is extreme load, the most likely failure of such a number is hardware failure. You also had network failures during deployment, i would recommend you get a good idea why these errors occurred before looking at the iSCSI layer.
Hi
no it is not something we see. my recommendation as per the prev post is to look a the 36 OSDs and see why they failed from logs. our experience is unless there is extreme load, the most likely failure of such a number is hardware failure. You also had network failures during deployment, i would recommend you get a good idea why these errors occurred before looking at the iSCSI layer.
msalem
87 Posts
February 6, 2019, 11:53 amQuote from msalem on February 6, 2019, 11:53 amHello Admin,
As we mentioned before the server have no load on them, they are not yet in production. the network issue that was caused before we were aware of - and after checking the error we did not find anything on the network layer or ESXI .. we have noticed that Ceph started to drop OSD's ramdonly
Here are are logs for your review. . thanks again for your help.
https://ufile.io/v55l4
Hello Admin,
As we mentioned before the server have no load on them, they are not yet in production. the network issue that was caused before we were aware of - and after checking the error we did not find anything on the network layer or ESXI .. we have noticed that Ceph started to drop OSD's ramdonly
Here are are logs for your review. . thanks again for your help.
admin
2,930 Posts
February 6, 2019, 2:21 pmQuote from admin on February 6, 2019, 2:21 pmthese are today's logs and the Ceph cluster is down, there are inactive and incomplete pgs. There are many osd connection refused errors.
As stated, this is not an iSCSI service issue, if Ceph is down, your iSCSI layer will naturally fail to function. The dashboard should show you that your cluster is not healthy.
I am not saying you have load on the them, i am saying we see large numbers of OSDs failing either due to load or hardware failures. I really do not know what network issues you had before and if they are now fixed, but you must know when Ceph went down did some other errors happen or was all fine and it went by itself ?
these are today's logs and the Ceph cluster is down, there are inactive and incomplete pgs. There are many osd connection refused errors.
As stated, this is not an iSCSI service issue, if Ceph is down, your iSCSI layer will naturally fail to function. The dashboard should show you that your cluster is not healthy.
I am not saying you have load on the them, i am saying we see large numbers of OSDs failing either due to load or hardware failures. I really do not know what network issues you had before and if they are now fixed, but you must know when Ceph went down did some other errors happen or was all fine and it went by itself ?
Last edited on February 6, 2019, 2:34 pm by admin · #10
Esxi Server Freeze.
msalem
87 Posts
Quote from msalem on January 13, 2019, 10:33 amHello Admin,
we have two different Esxi clusters that are using PetaSAN as storage. however it not in use only mounted.
Last week - we lost two server in the cluster and after vmware investigation they stated that there was a LUN (ISCSI) that was not mounted properly and esxi got overwhelmed trying to connect to it. after a day we lost other host and when we disconnected the ISCSI from PetaSAN all hosts were back online.
I have checked the logs in PetaSAN and its filled with errors ?
Have you encountered an issue like this or whats the solution for this bug in iscsi. ?
Thanks
Hello Admin,
we have two different Esxi clusters that are using PetaSAN as storage. however it not in use only mounted.
Last week - we lost two server in the cluster and after vmware investigation they stated that there was a LUN (ISCSI) that was not mounted properly and esxi got overwhelmed trying to connect to it. after a day we lost other host and when we disconnected the ISCSI from PetaSAN all hosts were back online.
I have checked the logs in PetaSAN and its filled with errors ?
Have you encountered an issue like this or whats the solution for this bug in iscsi. ?
Thanks
admin
2,930 Posts
Quote from admin on January 13, 2019, 11:06 amthere was a LUN (ISCSI) that was not mounted properly and esxi got overwhelmed trying to connect to it.
This is either a hardware error, or the PetaSAN nodes are overloaded to the point they do not respond to connection login request on time: some SAN vendors recommend you increase the ESXi iSCSI parameters for login timeout and login retries. It is better however to correctly load the cluster and assign enough hardware resources on the PetaSAN nodes to match you expected load.
there was a LUN (ISCSI) that was not mounted properly and esxi got overwhelmed trying to connect to it.
This is either a hardware error, or the PetaSAN nodes are overloaded to the point they do not respond to connection login request on time: some SAN vendors recommend you increase the ESXi iSCSI parameters for login timeout and login retries. It is better however to correctly load the cluster and assign enough hardware resources on the PetaSAN nodes to match you expected load.
msalem
87 Posts
Quote from msalem on January 13, 2019, 4:55 pmHello admin,
The stranhe part is these luns are empty , we havnt used it for any load we have . And we have setup the iscsi based on the recommendations in your docs. It started off by the cluster going from clean to 36 OSD's failure and while recovering the esxi stopped responding.
When we reported this issue to VMware they checked the logs and they stated as I mention before .
What would cause such a break without any load , maybe an IScSi bug ??
Thanks
Hello admin,
The stranhe part is these luns are empty , we havnt used it for any load we have . And we have setup the iscsi based on the recommendations in your docs. It started off by the cluster going from clean to 36 OSD's failure and while recovering the esxi stopped responding.
When we reported this issue to VMware they checked the logs and they stated as I mention before .
What would cause such a break without any load , maybe an IScSi bug ??
Thanks
admin
2,930 Posts
Quote from admin on January 13, 2019, 6:38 pmThe stranhe part is these luns are empty , we havnt used it for any load we have
The login failure will not depend on the lun being empty or not, but rather on either a hardware/network failure or the PetaSAN nodes being overloaded.
we have setup the iscsi based on the recommendations in your docs
We have docs on how to connect iSCSI but that does not mean your hardware will match your required load. We added a lot of features to enable you to measure your hardware and view resource utilization but it is up to you to make sure your hardware will meet your load demands in terms of iops and throughput.
It started off by the cluster going from clean to 36 OSD's failure and while recovering the esxi stopped responding
Sounds either a hardware issue or a load issue. Also the 36 OSDs going down does not point to a problem on the iSCSI layer but something at the lower layers. Also when OSDs go down for whatever reason, the recovery process will kick in and this will put further load on the cluster.
maybe an IScSi bug
If there was no load or hardware issue + Ceph is up + we can reproduce this, yes.
The stranhe part is these luns are empty , we havnt used it for any load we have
The login failure will not depend on the lun being empty or not, but rather on either a hardware/network failure or the PetaSAN nodes being overloaded.
we have setup the iscsi based on the recommendations in your docs
We have docs on how to connect iSCSI but that does not mean your hardware will match your required load. We added a lot of features to enable you to measure your hardware and view resource utilization but it is up to you to make sure your hardware will meet your load demands in terms of iops and throughput.
It started off by the cluster going from clean to 36 OSD's failure and while recovering the esxi stopped responding
Sounds either a hardware issue or a load issue. Also the 36 OSDs going down does not point to a problem on the iSCSI layer but something at the lower layers. Also when OSDs go down for whatever reason, the recovery process will kick in and this will put further load on the cluster.
maybe an IScSi bug
If there was no load or hardware issue + Ceph is up + we can reproduce this, yes.
msalem
87 Posts
Quote from msalem on January 14, 2019, 4:32 pmI can upload the Logs if you really need to see, but As I mentioned before - there were no load on the severs and only 10% of the memory was used, in addition we have checked network logs and IPMI and it was clean.
Thanks
I can upload the Logs if you really need to see, but As I mentioned before - there were no load on the severs and only 10% of the memory was used, in addition we have checked network logs and IPMI and it was clean.
Thanks
admin
2,930 Posts
Quote from admin on January 15, 2019, 3:10 pmI would recommend you look into why 36 OSDs failed without any load on them.
OSDs are at a lower layer in the stack than iSCSI, their failure will affect iSCSI and not the other way. From our experience, OSDs do not fail in bulk unless there is severe overloading (which causes them to flap) or pure hardware/network issue. For iSCSI not connecting, it is most probably related to the above issue.
I would recommend you look into why 36 OSDs failed without any load on them.
OSDs are at a lower layer in the stack than iSCSI, their failure will affect iSCSI and not the other way. From our experience, OSDs do not fail in bulk unless there is severe overloading (which causes them to flap) or pure hardware/network issue. For iSCSI not connecting, it is most probably related to the above issue.
msalem
87 Posts
Quote from msalem on February 6, 2019, 10:14 amHello admin,
seems after some reseach and testing, we found issued in the Consul service ISCSI now fails to start, and PetaSAN cannot delete any disks from the GUI.
We suspect that ISCSI drop at caused the cluster to freeze.
https://ibb.co/gTjWmGz
https://ibb.co/CHFLdFtOne image was from shutting down the server, and the second when trying to create a new LUN's.
Please look at the images attached, have you encountered this before . ?
Hello admin,
seems after some reseach and testing, we found issued in the Consul service ISCSI now fails to start, and PetaSAN cannot delete any disks from the GUI.
We suspect that ISCSI drop at caused the cluster to freeze.
One image was from shutting down the server, and the second when trying to create a new LUN's.
Please look at the images attached, have you encountered this before . ?
admin
2,930 Posts
Quote from admin on February 6, 2019, 11:39 amHi
no it is not something we see. my recommendation as per the prev post is to look a the 36 OSDs and see why they failed from logs. our experience is unless there is extreme load, the most likely failure of such a number is hardware failure. You also had network failures during deployment, i would recommend you get a good idea why these errors occurred before looking at the iSCSI layer.
Hi
no it is not something we see. my recommendation as per the prev post is to look a the 36 OSDs and see why they failed from logs. our experience is unless there is extreme load, the most likely failure of such a number is hardware failure. You also had network failures during deployment, i would recommend you get a good idea why these errors occurred before looking at the iSCSI layer.
msalem
87 Posts
Quote from msalem on February 6, 2019, 11:53 amHello Admin,
As we mentioned before the server have no load on them, they are not yet in production. the network issue that was caused before we were aware of - and after checking the error we did not find anything on the network layer or ESXI .. we have noticed that Ceph started to drop OSD's ramdonly
Here are are logs for your review. . thanks again for your help.
https://ufile.io/v55l4
Hello Admin,
As we mentioned before the server have no load on them, they are not yet in production. the network issue that was caused before we were aware of - and after checking the error we did not find anything on the network layer or ESXI .. we have noticed that Ceph started to drop OSD's ramdonly
Here are are logs for your review. . thanks again for your help.
admin
2,930 Posts
Quote from admin on February 6, 2019, 2:21 pmthese are today's logs and the Ceph cluster is down, there are inactive and incomplete pgs. There are many osd connection refused errors.
As stated, this is not an iSCSI service issue, if Ceph is down, your iSCSI layer will naturally fail to function. The dashboard should show you that your cluster is not healthy.
I am not saying you have load on the them, i am saying we see large numbers of OSDs failing either due to load or hardware failures. I really do not know what network issues you had before and if they are now fixed, but you must know when Ceph went down did some other errors happen or was all fine and it went by itself ?
these are today's logs and the Ceph cluster is down, there are inactive and incomplete pgs. There are many osd connection refused errors.
As stated, this is not an iSCSI service issue, if Ceph is down, your iSCSI layer will naturally fail to function. The dashboard should show you that your cluster is not healthy.
I am not saying you have load on the them, i am saying we see large numbers of OSDs failing either due to load or hardware failures. I really do not know what network issues you had before and if they are now fixed, but you must know when Ceph went down did some other errors happen or was all fine and it went by itself ?