Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Esxi Server Freeze.

Pages: 1 2 3 4

Hello Admin,

we have two different Esxi clusters that are using PetaSAN as storage. however it not in use only mounted.

Last week - we lost two server in the cluster and after vmware investigation they stated that there was a LUN (ISCSI) that was not mounted properly and esxi got overwhelmed trying to connect to it. after a day we lost other host and when we disconnected the ISCSI from PetaSAN all hosts were back online.

I have checked the logs in PetaSAN and its filled with errors ?

 

Have you encountered an issue like this or whats the solution for this bug in iscsi. ?

 

Thanks

there was a LUN (ISCSI) that was not mounted properly and esxi got overwhelmed trying to connect to it.

This is either a hardware error, or the PetaSAN nodes are overloaded to the point they do not respond to connection login request on time: some SAN vendors recommend you increase the ESXi iSCSI parameters for login timeout and login retries. It is better however to correctly load the cluster and assign enough hardware resources on the PetaSAN nodes to match you expected load.

Hello admin,

The stranhe part is these luns are empty , we havnt used it for any load we have . And we have setup the iscsi based on the recommendations in your docs. It started off by the cluster going from clean to 36 OSD's failure and while recovering the esxi stopped responding.

When we reported this issue to VMware they checked the logs and they stated as I mention before .

What would cause such a break without any load , maybe an IScSi bug ??

 

Thanks

The stranhe part is these luns are empty , we havnt used it for any load we have

The login failure will not depend on the lun being empty or not, but rather on either a hardware/network failure or the PetaSAN nodes being overloaded.

we have setup the iscsi based on the recommendations in your docs

We have docs on how to connect iSCSI but that does not mean your hardware will match your required load. We added a lot of features to enable you to measure your hardware and view resource utilization but it is up to you to make sure your hardware will meet your load demands in terms of iops and throughput.

It started off by the cluster going from clean to 36 OSD's failure and while recovering the esxi stopped responding

Sounds either a hardware issue or a load issue. Also the 36 OSDs going down does not point to a  problem on the iSCSI layer but something at the lower layers. Also when OSDs go down for whatever reason, the recovery process will kick in and this will put further load on the cluster.

maybe an IScSi bug

If there was no load or hardware issue + Ceph is up + we can reproduce this, yes.

 

I can upload the Logs if you really need to see, but As I mentioned before - there were no load on the severs and only 10% of the memory was used, in addition we have checked network logs and IPMI and it was clean.

 

Thanks

I would recommend you look into why 36 OSDs failed without any load on them.
OSDs are at a lower layer in the stack than iSCSI, their failure will affect iSCSI and not the other way. From our experience, OSDs do not fail in bulk unless there is severe overloading (which causes them to flap) or pure hardware/network issue. For iSCSI not connecting, it is most probably related to the above issue.

Hello admin,

 

seems after some reseach and testing, we found issued in the Consul service ISCSI now fails to start, and PetaSAN cannot delete any disks from the GUI.

We suspect that ISCSI drop at caused the cluster to freeze.

Capture
iscsi

One image was from shutting down the server, and the second when trying to create a new LUN's.

Please look at the images attached, have you encountered this before . ?

 

Hi

no it is not something we see. my recommendation as per the prev post is to look a the 36 OSDs  and see why they failed from logs. our experience is unless there is extreme load, the most likely failure of such a number is hardware failure. You also had network failures  during deployment, i would recommend you get a good idea why these errors occurred before looking at the iSCSI layer.

Hello Admin,

 

As we mentioned before the server have no load on them, they are not yet in production. the network issue that was caused before we were aware of - and after checking the error we did not find anything on the network layer or ESXI .. we have noticed that Ceph started to drop OSD's ramdonly

 

Here are are logs for your review. . thanks again for your help.

 

https://ufile.io/v55l4

 

 

these are today's logs and the Ceph cluster is down, there are inactive and incomplete pgs. There are many osd connection refused errors.

As stated, this is not an iSCSI service issue, if Ceph is down, your iSCSI layer will naturally fail to function. The dashboard should show you that your cluster is not healthy.

I am not saying you have load on the them, i am saying we see large numbers of OSDs failing either due to load or hardware failures. I really do not know what network issues you had before and if they are now fixed, but you must know when Ceph went down did some other errors happen or was all fine and it went by itself ?

Pages: 1 2 3 4