Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Hosts randomly power off spontaneously

Hello,

sometimes, while my 4 nodes test cluster is in OK and Healty state, suddenly one or more nodes go down without a clear reason, in some cases they completely power off and I have to phisically push the power button to make then start up again. Then they start, perform some PG recovery e work fine until the next similar event. My suspect is it is due to the old and maybe a bit unstable hardware, but anyway is there a place (log file) where I can look to investigate about the cause of this malfunction ?

Thanks, Ste.

This most likely due to fencing. If you switch off fencing in maintenance mode you can verify if this is correct.  When fencing is on a node will be killed it fails to report heartbeats back to the Consul cluster within 15 sec. This could be due to network problems or if the node is being overloaded (due to client io or background scrubbing by Ceph). It is best to fix the root cause (fix network or up the node resources) rather than switch off fencing or you will just mask the issue for a while.

Yeah correct, you got the point !

I disabled fencing and discovered the the issue is originated by host #4 that experiences network malfunction and suddenly becomes unreachable. So a recovery process starts on the remaining 3 nodes , and this overloads them so much to sometimes cause a shutdown. Now that fencing is disabled, recovery takes a lot of time but at least it has the chance to go to an end, leaving the 3 surviving nodes up and running.

Thanks, S.