Forums - PetaSAN

ForumGeneral DiscussionHosts randomly power off spontane …
You need to log in to create posts and topics. Login · Register
Hosts randomly power off spontaneously

Ste
133 Posts

May 15, 2018, 9:02 am
Quote from Ste on May 15, 2018, 9:02 am
Hello,

sometimes, while my 4 nodes test cluster is in OK and Healty state, suddenly one or more nodes go down without a clear reason, in some cases they completely power off and I have to phisically push the power button to make then start up again. Then they start, perform some PG recovery e work fine until the next similar event. My suspect is it is due to the old and maybe a bit unstable hardware, but anyway is there a place (log file) where I can look to investigate about the cause of this malfunction ?

Thanks, Ste.

Hello,

sometimes, while my 4 nodes test cluster is in OK and Healty state, suddenly one or more nodes go down without a clear reason, in some cases they completely power off and I have to phisically push the power button to make then start up again. Then they start, perform some PG recovery e work fine until the next similar event. My suspect is it is due to the old and maybe a bit unstable hardware, but anyway is there a place (log file) where I can look to investigate about the cause of this malfunction ?

Thanks, Ste.

#1

admin
2,961 Posts

May 15, 2018, 2:53 pm
Quote from admin on May 15, 2018, 2:53 pm
This most likely due to fencing. If you switch off fencing in maintenance mode you can verify if this is correct. When fencing is on a node will be killed it fails to report heartbeats back to the Consul cluster within 15 sec. This could be due to network problems or if the node is being overloaded (due to client io or background scrubbing by Ceph). It is best to fix the root cause (fix network or up the node resources) rather than switch off fencing or you will just mask the issue for a while.

This most likely due to fencing. If you switch off fencing in maintenance mode you can verify if this is correct. When fencing is on a node will be killed it fails to report heartbeats back to the Consul cluster within 15 sec. This could be due to network problems or if the node is being overloaded (due to client io or background scrubbing by Ceph). It is best to fix the root cause (fix network or up the node resources) rather than switch off fencing or you will just mask the issue for a while.

#2

Ste
133 Posts

May 18, 2018, 8:24 am
Quote from Ste on May 18, 2018, 8:24 am
Yeah correct, you got the point !

I disabled fencing and discovered the the issue is originated by host #4 that experiences network malfunction and suddenly becomes unreachable. So a recovery process starts on the remaining 3 nodes , and this overloads them so much to sometimes cause a shutdown. Now that fencing is disabled, recovery takes a lot of time but at least it has the chance to go to an end, leaving the 3 surviving nodes up and running.

Thanks, S.

Yeah correct, you got the point !

I disabled fencing and discovered the the issue is originated by host #4 that experiences network malfunction and suddenly becomes unreachable. So a recovery process starts on the remaining 3 nodes , and this overloads them so much to sometimes cause a shutdown. Now that fencing is disabled, recovery takes a lot of time but at least it has the chance to go to an end, leaving the 3 surviving nodes up and running.

Thanks, S.

#3

Post Reply: Hosts randomly power off spontaneously

Cancel