2 Nodes shutting down after Network impact
eazyadm
25 Posts
April 20, 2022, 8:11 amQuote from eazyadm on April 20, 2022, 8:11 amHi,
we are Testing Petasan with iscsi and s3 in strange sczenarios to go for sure we can use it for production.
at the moment our environment hast 6 storage nodes and 3 monitoring nodes.
In the last test, we blockt all the networktraffic on both switches, so no communication between all nodes was possible. We left this state for about 6 hours.
The result was that two storage nodes (2 and 3) were shutted down automatic.
We rebooted the switches, network traffic was now possible again. We powered on both offline nodes, after they where up Node 1 and 3 shutted down.
How to debug that situation ?
Ceph Heath has following warnings:
Thanks
Kind Regards
Hi,
we are Testing Petasan with iscsi and s3 in strange sczenarios to go for sure we can use it for production.
at the moment our environment hast 6 storage nodes and 3 monitoring nodes.
In the last test, we blockt all the networktraffic on both switches, so no communication between all nodes was possible. We left this state for about 6 hours.
The result was that two storage nodes (2 and 3) were shutted down automatic.
We rebooted the switches, network traffic was now possible again. We powered on both offline nodes, after they where up Node 1 and 3 shutted down.
How to debug that situation ?
Ceph Heath has following warnings:
Thanks
Kind Regards
Last edited on April 21, 2022, 1:51 pm by eazyadm · #1
admin
2,930 Posts
April 20, 2022, 11:34 amQuote from admin on April 20, 2022, 11:34 amProbably the shutdown was due to fencing, you can disable fencing from maintenance page, but it is better to leave it.
Most cases, with a 2 switch setup and assume you have bonded interface for HA, you would want to test shutting 1 switch at a time to verify the network setup is highly available and the cluster keeps functioning.
Shutting both switches is essentially a cluster shutdown, all nodes lose connection to one another and you have no cluster. For nodes to re-peer, easiest thing is to restart all nodes.
Probably the shutdown was due to fencing, you can disable fencing from maintenance page, but it is better to leave it.
Most cases, with a 2 switch setup and assume you have bonded interface for HA, you would want to test shutting 1 switch at a time to verify the network setup is highly available and the cluster keeps functioning.
Shutting both switches is essentially a cluster shutdown, all nodes lose connection to one another and you have no cluster. For nodes to re-peer, easiest thing is to restart all nodes.
Last edited on April 20, 2022, 11:34 am by admin · #2
eazyadm
25 Posts
April 21, 2022, 6:51 amQuote from eazyadm on April 21, 2022, 6:51 amThanks for your answer.
That shouldn't be a normal situation, but we are testing things like that to see what can happen.
After serveral restarts of the offline nodes, all came back.
We did the test again and disabled fencing, and we we had no shutdown.
Thanks a lot, and have a nice day.
Thanks for your answer.
That shouldn't be a normal situation, but we are testing things like that to see what can happen.
After serveral restarts of the offline nodes, all came back.
We did the test again and disabled fencing, and we we had no shutdown.
Thanks a lot, and have a nice day.
2 Nodes shutting down after Network impact
eazyadm
25 Posts
Quote from eazyadm on April 20, 2022, 8:11 amHi,
we are Testing Petasan with iscsi and s3 in strange sczenarios to go for sure we can use it for production.
at the moment our environment hast 6 storage nodes and 3 monitoring nodes.
In the last test, we blockt all the networktraffic on both switches, so no communication between all nodes was possible. We left this state for about 6 hours.
The result was that two storage nodes (2 and 3) were shutted down automatic.
We rebooted the switches, network traffic was now possible again. We powered on both offline nodes, after they where up Node 1 and 3 shutted down.
How to debug that situation ?
Ceph Heath has following warnings:
Thanks
Kind Regards
Hi,
we are Testing Petasan with iscsi and s3 in strange sczenarios to go for sure we can use it for production.
at the moment our environment hast 6 storage nodes and 3 monitoring nodes.
In the last test, we blockt all the networktraffic on both switches, so no communication between all nodes was possible. We left this state for about 6 hours.
The result was that two storage nodes (2 and 3) were shutted down automatic.
We rebooted the switches, network traffic was now possible again. We powered on both offline nodes, after they where up Node 1 and 3 shutted down.
How to debug that situation ?
Ceph Heath has following warnings:
Thanks
Kind Regards
admin
2,930 Posts
Quote from admin on April 20, 2022, 11:34 amProbably the shutdown was due to fencing, you can disable fencing from maintenance page, but it is better to leave it.
Most cases, with a 2 switch setup and assume you have bonded interface for HA, you would want to test shutting 1 switch at a time to verify the network setup is highly available and the cluster keeps functioning.
Shutting both switches is essentially a cluster shutdown, all nodes lose connection to one another and you have no cluster. For nodes to re-peer, easiest thing is to restart all nodes.
Probably the shutdown was due to fencing, you can disable fencing from maintenance page, but it is better to leave it.
Most cases, with a 2 switch setup and assume you have bonded interface for HA, you would want to test shutting 1 switch at a time to verify the network setup is highly available and the cluster keeps functioning.
Shutting both switches is essentially a cluster shutdown, all nodes lose connection to one another and you have no cluster. For nodes to re-peer, easiest thing is to restart all nodes.
eazyadm
25 Posts
Quote from eazyadm on April 21, 2022, 6:51 amThanks for your answer.
That shouldn't be a normal situation, but we are testing things like that to see what can happen.
After serveral restarts of the offline nodes, all came back.
We did the test again and disabled fencing, and we we had no shutdown.
Thanks a lot, and have a nice day.
Thanks for your answer.
That shouldn't be a normal situation, but we are testing things like that to see what can happen.
After serveral restarts of the offline nodes, all came back.
We did the test again and disabled fencing, and we we had no shutdown.
Thanks a lot, and have a nice day.