strange shutdowns
therm
121 Posts
June 29, 2017, 1:04 pmQuote from therm on June 29, 2017, 1:04 pmHi,
today morning we had strang shutdowns of 2 of our 3 petasan servers. At the time hosts turned off our network admin was pluging cables and changing some network settings (MTU,Spanning Tree,Flow Control). In journallog it seems that there first was a short outage of the backend links and afterwards the system was shut down.
My Question: Is there something in the cluster which could shutdown the servers if there is a conenction problem?
Regards,
Dennis
Hi,
today morning we had strang shutdowns of 2 of our 3 petasan servers. At the time hosts turned off our network admin was pluging cables and changing some network settings (MTU,Spanning Tree,Flow Control). In journallog it seems that there first was a short outage of the backend links and afterwards the system was shut down.
My Question: Is there something in the cluster which could shutdown the servers if there is a conenction problem?
Regards,
Dennis
Last edited on June 29, 2017, 1:04 pm · #1
admin
2,930 Posts
June 29, 2017, 1:30 pmQuote from admin on June 29, 2017, 1:30 pmYes we do simple software based fencing, when a node is not able to connect to the cluster it will clean any current resources (ips/iSCSI paths) so they could be served by other nodes, but also the other nodes will try to kill it before failing over these resources, this happens when the failed node exceeds its timeout in reporting its health check heartbeat to the cluster ( via Consul ) .
In the future we will allow more advanced hardware based fencing such as using STONITH/IPMI.
Yes we do simple software based fencing, when a node is not able to connect to the cluster it will clean any current resources (ips/iSCSI paths) so they could be served by other nodes, but also the other nodes will try to kill it before failing over these resources, this happens when the failed node exceeds its timeout in reporting its health check heartbeat to the cluster ( via Consul ) .
In the future we will allow more advanced hardware based fencing such as using STONITH/IPMI.
Last edited on June 29, 2017, 1:32 pm · #2
therm
121 Posts
June 30, 2017, 5:39 amQuote from therm on June 30, 2017, 5:39 amAfter rebooting a node it happens often that the other nodes shut this node down again, how to prevent this?
Regards,
Dennis
After rebooting a node it happens often that the other nodes shut this node down again, how to prevent this?
Regards,
Dennis
admin
2,930 Posts
June 30, 2017, 9:39 amQuote from admin on June 30, 2017, 9:39 amJust wait a couple of minutes before starting the machine. This is the case in fencing, the other nodes cannot be 100% sure the suspected node is now OK or it is still dying, this will go on until all the other nodes agree on distributing the failed resources among them.
Just wait a couple of minutes before starting the machine. This is the case in fencing, the other nodes cannot be 100% sure the suspected node is now OK or it is still dying, this will go on until all the other nodes agree on distributing the failed resources among them.
strange shutdowns
therm
121 Posts
Quote from therm on June 29, 2017, 1:04 pmHi,
today morning we had strang shutdowns of 2 of our 3 petasan servers. At the time hosts turned off our network admin was pluging cables and changing some network settings (MTU,Spanning Tree,Flow Control). In journallog it seems that there first was a short outage of the backend links and afterwards the system was shut down.
My Question: Is there something in the cluster which could shutdown the servers if there is a conenction problem?
Regards,
Dennis
Hi,
today morning we had strang shutdowns of 2 of our 3 petasan servers. At the time hosts turned off our network admin was pluging cables and changing some network settings (MTU,Spanning Tree,Flow Control). In journallog it seems that there first was a short outage of the backend links and afterwards the system was shut down.
My Question: Is there something in the cluster which could shutdown the servers if there is a conenction problem?
Regards,
Dennis
admin
2,930 Posts
Quote from admin on June 29, 2017, 1:30 pmYes we do simple software based fencing, when a node is not able to connect to the cluster it will clean any current resources (ips/iSCSI paths) so they could be served by other nodes, but also the other nodes will try to kill it before failing over these resources, this happens when the failed node exceeds its timeout in reporting its health check heartbeat to the cluster ( via Consul ) .
In the future we will allow more advanced hardware based fencing such as using STONITH/IPMI.
Yes we do simple software based fencing, when a node is not able to connect to the cluster it will clean any current resources (ips/iSCSI paths) so they could be served by other nodes, but also the other nodes will try to kill it before failing over these resources, this happens when the failed node exceeds its timeout in reporting its health check heartbeat to the cluster ( via Consul ) .
In the future we will allow more advanced hardware based fencing such as using STONITH/IPMI.
therm
121 Posts
Quote from therm on June 30, 2017, 5:39 amAfter rebooting a node it happens often that the other nodes shut this node down again, how to prevent this?
Regards,
Dennis
After rebooting a node it happens often that the other nodes shut this node down again, how to prevent this?
Regards,
Dennis
admin
2,930 Posts
Quote from admin on June 30, 2017, 9:39 amJust wait a couple of minutes before starting the machine. This is the case in fencing, the other nodes cannot be 100% sure the suspected node is now OK or it is still dying, this will go on until all the other nodes agree on distributing the failed resources among them.
Just wait a couple of minutes before starting the machine. This is the case in fencing, the other nodes cannot be 100% sure the suspected node is now OK or it is still dying, this will go on until all the other nodes agree on distributing the failed resources among them.