Hosts randomly power off spontaneously
Ste
125 Posts
May 15, 2018, 9:02 amQuote from Ste on May 15, 2018, 9:02 amHello,
sometimes, while my 4 nodes test cluster is in OK and Healty state, suddenly one or more nodes go down without a clear reason, in some cases they completely power off and I have to phisically push the power button to make then start up again. Then they start, perform some PG recovery e work fine until the next similar event. My suspect is it is due to the old and maybe a bit unstable hardware, but anyway is there a place (log file) where I can look to investigate about the cause of this malfunction ?
Thanks, Ste.
Hello,
sometimes, while my 4 nodes test cluster is in OK and Healty state, suddenly one or more nodes go down without a clear reason, in some cases they completely power off and I have to phisically push the power button to make then start up again. Then they start, perform some PG recovery e work fine until the next similar event. My suspect is it is due to the old and maybe a bit unstable hardware, but anyway is there a place (log file) where I can look to investigate about the cause of this malfunction ?
Thanks, Ste.
admin
2,930 Posts
May 15, 2018, 2:53 pmQuote from admin on May 15, 2018, 2:53 pmThis most likely due to fencing. If you switch off fencing in maintenance mode you can verify if this is correct. When fencing is on a node will be killed it fails to report heartbeats back to the Consul cluster within 15 sec. This could be due to network problems or if the node is being overloaded (due to client io or background scrubbing by Ceph). It is best to fix the root cause (fix network or up the node resources) rather than switch off fencing or you will just mask the issue for a while.
This most likely due to fencing. If you switch off fencing in maintenance mode you can verify if this is correct. When fencing is on a node will be killed it fails to report heartbeats back to the Consul cluster within 15 sec. This could be due to network problems or if the node is being overloaded (due to client io or background scrubbing by Ceph). It is best to fix the root cause (fix network or up the node resources) rather than switch off fencing or you will just mask the issue for a while.
Ste
125 Posts
May 18, 2018, 8:24 amQuote from Ste on May 18, 2018, 8:24 amYeah correct, you got the point !
I disabled fencing and discovered the the issue is originated by host #4 that experiences network malfunction and suddenly becomes unreachable. So a recovery process starts on the remaining 3 nodes , and this overloads them so much to sometimes cause a shutdown. Now that fencing is disabled, recovery takes a lot of time but at least it has the chance to go to an end, leaving the 3 surviving nodes up and running.
Thanks, S.
Yeah correct, you got the point !
I disabled fencing and discovered the the issue is originated by host #4 that experiences network malfunction and suddenly becomes unreachable. So a recovery process starts on the remaining 3 nodes , and this overloads them so much to sometimes cause a shutdown. Now that fencing is disabled, recovery takes a lot of time but at least it has the chance to go to an end, leaving the 3 surviving nodes up and running.
Thanks, S.
Hosts randomly power off spontaneously
Ste
125 Posts
Quote from Ste on May 15, 2018, 9:02 amHello,
sometimes, while my 4 nodes test cluster is in OK and Healty state, suddenly one or more nodes go down without a clear reason, in some cases they completely power off and I have to phisically push the power button to make then start up again. Then they start, perform some PG recovery e work fine until the next similar event. My suspect is it is due to the old and maybe a bit unstable hardware, but anyway is there a place (log file) where I can look to investigate about the cause of this malfunction ?
Thanks, Ste.
Hello,
sometimes, while my 4 nodes test cluster is in OK and Healty state, suddenly one or more nodes go down without a clear reason, in some cases they completely power off and I have to phisically push the power button to make then start up again. Then they start, perform some PG recovery e work fine until the next similar event. My suspect is it is due to the old and maybe a bit unstable hardware, but anyway is there a place (log file) where I can look to investigate about the cause of this malfunction ?
Thanks, Ste.
admin
2,930 Posts
Quote from admin on May 15, 2018, 2:53 pmThis most likely due to fencing. If you switch off fencing in maintenance mode you can verify if this is correct. When fencing is on a node will be killed it fails to report heartbeats back to the Consul cluster within 15 sec. This could be due to network problems or if the node is being overloaded (due to client io or background scrubbing by Ceph). It is best to fix the root cause (fix network or up the node resources) rather than switch off fencing or you will just mask the issue for a while.
This most likely due to fencing. If you switch off fencing in maintenance mode you can verify if this is correct. When fencing is on a node will be killed it fails to report heartbeats back to the Consul cluster within 15 sec. This could be due to network problems or if the node is being overloaded (due to client io or background scrubbing by Ceph). It is best to fix the root cause (fix network or up the node resources) rather than switch off fencing or you will just mask the issue for a while.
Ste
125 Posts
Quote from Ste on May 18, 2018, 8:24 amYeah correct, you got the point !
I disabled fencing and discovered the the issue is originated by host #4 that experiences network malfunction and suddenly becomes unreachable. So a recovery process starts on the remaining 3 nodes , and this overloads them so much to sometimes cause a shutdown. Now that fencing is disabled, recovery takes a lot of time but at least it has the chance to go to an end, leaving the 3 surviving nodes up and running.
Thanks, S.
Yeah correct, you got the point !
I disabled fencing and discovered the the issue is originated by host #4 that experiences network malfunction and suddenly becomes unreachable. So a recovery process starts on the remaining 3 nodes , and this overloads them so much to sometimes cause a shutdown. Now that fencing is disabled, recovery takes a lot of time but at least it has the chance to go to an end, leaving the 3 surviving nodes up and running.
Thanks, S.