Switch Failover Test Issues
exitsys
43 Posts
October 16, 2020, 11:19 amQuote from exitsys on October 16, 2020, 11:19 amToday we have tested Switch Failover.
We restarted one of the two 10G switches.
Each node has 2 dual 10G cards and one 4Port copper card.
On each card there is one backend configured in the balance-alb and one iSCSI 1 or iSCSI 2.
from card 1 backend bond to switch A
from card 1 iSCSi 1 to switch A
from card 2 backend bond to switch B
from card 2 iSCSI 2 to switch B
Management is connected via copper to a third switch that has a path to each of the other two switches.
If we now restart switch A, it will be very difficult to call up the management addresses in the browser. But ping is possible without interruption to all management addresses.
But what is much worse is that the iSCSI targets are stopped.
What could be the reason for this? or how do I limit the error?
In the meantime 2 nodes have shut down again.
Thanks for your help
Today we have tested Switch Failover.
We restarted one of the two 10G switches.
Each node has 2 dual 10G cards and one 4Port copper card.
On each card there is one backend configured in the balance-alb and one iSCSI 1 or iSCSI 2.
from card 1 backend bond to switch A
from card 1 iSCSi 1 to switch A
from card 2 backend bond to switch B
from card 2 iSCSI 2 to switch B
Management is connected via copper to a third switch that has a path to each of the other two switches.
If we now restart switch A, it will be very difficult to call up the management addresses in the browser. But ping is possible without interruption to all management addresses.
But what is much worse is that the iSCSI targets are stopped.
What could be the reason for this? or how do I limit the error?
In the meantime 2 nodes have shut down again.
Thanks for your help
Last edited on October 16, 2020, 11:22 am by exitsys · #1
admin
2,930 Posts
October 16, 2020, 12:01 pmQuote from admin on October 16, 2020, 12:01 pmif nodes were shut, it means connection on backend was lost. check backend bond has been setup correctly, both on the servers and switches., if the bond works on a 1 switch setup, it is probably an inter-switch setup issue.
if nodes were shut, it means connection on backend was lost. check backend bond has been setup correctly, both on the servers and switches., if the bond works on a 1 switch setup, it is probably an inter-switch setup issue.
exitsys
43 Posts
October 16, 2020, 3:39 pmQuote from exitsys on October 16, 2020, 3:39 pmSo I actually had a bug in the wiring. The problem with the stopping iSCSI is gone. But I still have the problem that one of the nodes is shutting down. I have triple checked the wiring now and can't find any error. The whole backend dashboard is hanging so hard, it is hardly usable at the moment the cluster was giving a warning.
8 osds down
1 host (8 osds) down
Long heartbeat ping times on back interface seen, longest is 11905.478 msec
Long heartbeat ping times on front interface seen, longest is 11905.207 msec
Degraded data redundancy: 18058/54174 objects degraded (33.333%), 534 pgs degraded, 1088 pgs undersized
16 slow ops, oldest one blocked for 98 sec, n03 has slow ops
1/3 mons down, quorum n01,n02
So I actually had a bug in the wiring. The problem with the stopping iSCSI is gone. But I still have the problem that one of the nodes is shutting down. I have triple checked the wiring now and can't find any error. The whole backend dashboard is hanging so hard, it is hardly usable at the moment the cluster was giving a warning.
8 osds down
1 host (8 osds) down
Long heartbeat ping times on back interface seen, longest is 11905.478 msec
Long heartbeat ping times on front interface seen, longest is 11905.207 msec
Degraded data redundancy: 18058/54174 objects degraded (33.333%), 534 pgs degraded, 1088 pgs undersized
16 slow ops, oldest one blocked for 98 sec, n03 has slow ops
1/3 mons down, quorum n01,n02
Last edited on October 16, 2020, 3:41 pm by exitsys · #3
admin
2,930 Posts
October 16, 2020, 5:17 pmQuote from admin on October 16, 2020, 5:17 pmYou could switch fencing off to avoid shutdown, but it is masking the problem and could cause data integrity issue.
Apart from hardware issues, it could be recover load being set too fast for your hardware, can you show the output of
ceph config get osd.* osd_recovery_sleep
ceph config get osd.* osd_recovery_max_active
ceph config get osd.* osd_max_backfills
You could switch fencing off to avoid shutdown, but it is masking the problem and could cause data integrity issue.
Apart from hardware issues, it could be recover load being set too fast for your hardware, can you show the output of
ceph config get osd.* osd_recovery_sleep
ceph config get osd.* osd_recovery_max_active
ceph config get osd.* osd_max_backfills
exitsys
43 Posts
October 16, 2020, 5:29 pmQuote from exitsys on October 16, 2020, 5:29 pmceph config get osd.* osd_recovery_sleep ---> 0.100000
ceph config get osd.* osd_recovery_max_active ---> 1
ceph config get osd.* osd_max_backfills ---> 1
Hardware is 3 x HPE DL360 Gen 9
on each following configuration
2 x Intel Xeon E5-2630LV3 SR209 8C Server Prozessor 8x 1,80 GHz
128GB RAM
2 x Dual 10G Nics Fiber
1 x 120GB Intel SSD for OS
8 x PM883 1,92GB SSD for OSDs
5 slow ops, oldest one blocked for 2261 sec, n03 has slow ops
ceph config get osd.* osd_recovery_sleep ---> 0.100000
ceph config get osd.* osd_recovery_max_active ---> 1
ceph config get osd.* osd_max_backfills ---> 1
Hardware is 3 x HPE DL360 Gen 9
on each following configuration
2 x Intel Xeon E5-2630LV3 SR209 8C Server Prozessor 8x 1,80 GHz
128GB RAM
2 x Dual 10G Nics Fiber
1 x 120GB Intel SSD for OS
8 x PM883 1,92GB SSD for OSDs
5 slow ops, oldest one blocked for 2261 sec, n03 has slow ops
Last edited on October 16, 2020, 5:44 pm by exitsys · #5
exitsys
43 Posts
October 16, 2020, 5:49 pmQuote from exitsys on October 16, 2020, 5:49 pmCould it have something to do with the backend bond balance-alb? Shall I change it to active backup? is it enough to change the cluster_info.json on all three nodes via ssh and reboot all nodes?
Could it have something to do with the backend bond balance-alb? Shall I change it to active backup? is it enough to change the cluster_info.json on all three nodes via ssh and reboot all nodes?
admin
2,930 Posts
October 16, 2020, 6:01 pmQuote from admin on October 16, 2020, 6:01 pmlook at the charts for any load on the node that would cause these ping delays.
look at the charts for any load on the node that would cause these ping delays.
exitsys
43 Posts
October 16, 2020, 6:07 pmQuote from exitsys on October 16, 2020, 6:07 pmi am currently only testing failover. nothing is connected to the cluster yet. No CPU usage 5% memory used. Absolutely no load on all three nodes.
What do you think about the bond change?
i am currently only testing failover. nothing is connected to the cluster yet. No CPU usage 5% memory used. Absolutely no load on all three nodes.
What do you think about the bond change?
admin
2,930 Posts
October 16, 2020, 6:13 pmQuote from admin on October 16, 2020, 6:13 pmyou could try the different bond, i am not sure on the syntax if it requires a primary i/f to be defined, to be sure install a vm test node and look at its cluster_info.sjon
you could try the different bond, i am not sure on the syntax if it requires a primary i/f to be defined, to be sure install a vm test node and look at its cluster_info.sjon
exitsys
43 Posts
October 16, 2020, 6:20 pmQuote from exitsys on October 16, 2020, 6:20 pmI can only strongly advise against balance-alb. I have now switched to active-passive and the cluster has no more problems at all. Both switches have been restarted several times in a row. Everything runs without any warning. i think we are now ready and can use PetaSAN productively. at 4k ips 2 node 256 we have 5min values of 60k read and 39k write. i think that is good, isn't it? Thanks for your help.
I can only strongly advise against balance-alb. I have now switched to active-passive and the cluster has no more problems at all. Both switches have been restarted several times in a row. Everything runs without any warning. i think we are now ready and can use PetaSAN productively. at 4k ips 2 node 256 we have 5min values of 60k read and 39k write. i think that is good, isn't it? Thanks for your help.
Switch Failover Test Issues
exitsys
43 Posts
Quote from exitsys on October 16, 2020, 11:19 amToday we have tested Switch Failover.
We restarted one of the two 10G switches.
Each node has 2 dual 10G cards and one 4Port copper card.
On each card there is one backend configured in the balance-alb and one iSCSI 1 or iSCSI 2.from card 1 backend bond to switch A
from card 1 iSCSi 1 to switch Afrom card 2 backend bond to switch B
from card 2 iSCSI 2 to switch BManagement is connected via copper to a third switch that has a path to each of the other two switches.
If we now restart switch A, it will be very difficult to call up the management addresses in the browser. But ping is possible without interruption to all management addresses.
But what is much worse is that the iSCSI targets are stopped.
What could be the reason for this? or how do I limit the error?
In the meantime 2 nodes have shut down again.
Thanks for your help
Today we have tested Switch Failover.
We restarted one of the two 10G switches.
Each node has 2 dual 10G cards and one 4Port copper card.
On each card there is one backend configured in the balance-alb and one iSCSI 1 or iSCSI 2.
from card 1 backend bond to switch A
from card 1 iSCSi 1 to switch A
from card 2 backend bond to switch B
from card 2 iSCSI 2 to switch B
Management is connected via copper to a third switch that has a path to each of the other two switches.
If we now restart switch A, it will be very difficult to call up the management addresses in the browser. But ping is possible without interruption to all management addresses.
But what is much worse is that the iSCSI targets are stopped.
What could be the reason for this? or how do I limit the error?
In the meantime 2 nodes have shut down again.
Thanks for your help
admin
2,930 Posts
Quote from admin on October 16, 2020, 12:01 pmif nodes were shut, it means connection on backend was lost. check backend bond has been setup correctly, both on the servers and switches., if the bond works on a 1 switch setup, it is probably an inter-switch setup issue.
if nodes were shut, it means connection on backend was lost. check backend bond has been setup correctly, both on the servers and switches., if the bond works on a 1 switch setup, it is probably an inter-switch setup issue.
exitsys
43 Posts
Quote from exitsys on October 16, 2020, 3:39 pmSo I actually had a bug in the wiring. The problem with the stopping iSCSI is gone. But I still have the problem that one of the nodes is shutting down. I have triple checked the wiring now and can't find any error. The whole backend dashboard is hanging so hard, it is hardly usable at the moment the cluster was giving a warning.
8 osds down
1 host (8 osds) down
Long heartbeat ping times on back interface seen, longest is 11905.478 msec
Long heartbeat ping times on front interface seen, longest is 11905.207 msec
Degraded data redundancy: 18058/54174 objects degraded (33.333%), 534 pgs degraded, 1088 pgs undersized
16 slow ops, oldest one blocked for 98 sec, n03 has slow ops
1/3 mons down, quorum n01,n02
So I actually had a bug in the wiring. The problem with the stopping iSCSI is gone. But I still have the problem that one of the nodes is shutting down. I have triple checked the wiring now and can't find any error. The whole backend dashboard is hanging so hard, it is hardly usable at the moment the cluster was giving a warning.
8 osds down
1 host (8 osds) down
Long heartbeat ping times on back interface seen, longest is 11905.478 msec
Long heartbeat ping times on front interface seen, longest is 11905.207 msec
Degraded data redundancy: 18058/54174 objects degraded (33.333%), 534 pgs degraded, 1088 pgs undersized
16 slow ops, oldest one blocked for 98 sec, n03 has slow ops
1/3 mons down, quorum n01,n02
admin
2,930 Posts
Quote from admin on October 16, 2020, 5:17 pmYou could switch fencing off to avoid shutdown, but it is masking the problem and could cause data integrity issue.
Apart from hardware issues, it could be recover load being set too fast for your hardware, can you show the output of
ceph config get osd.* osd_recovery_sleep
ceph config get osd.* osd_recovery_max_active
ceph config get osd.* osd_max_backfills
You could switch fencing off to avoid shutdown, but it is masking the problem and could cause data integrity issue.
Apart from hardware issues, it could be recover load being set too fast for your hardware, can you show the output of
ceph config get osd.* osd_recovery_sleep
ceph config get osd.* osd_recovery_max_active
ceph config get osd.* osd_max_backfills
exitsys
43 Posts
Quote from exitsys on October 16, 2020, 5:29 pmceph config get osd.* osd_recovery_sleep ---> 0.100000
ceph config get osd.* osd_recovery_max_active ---> 1
ceph config get osd.* osd_max_backfills ---> 1
Hardware is 3 x HPE DL360 Gen 9
on each following configuration
2 x Intel Xeon E5-2630LV3 SR209 8C Server Prozessor 8x 1,80 GHz
128GB RAM
2 x Dual 10G Nics Fiber
1 x 120GB Intel SSD for OS
8 x PM883 1,92GB SSD for OSDs5 slow ops, oldest one blocked for 2261 sec, n03 has slow ops
ceph config get osd.* osd_recovery_sleep ---> 0.100000
ceph config get osd.* osd_recovery_max_active ---> 1
ceph config get osd.* osd_max_backfills ---> 1
Hardware is 3 x HPE DL360 Gen 9
on each following configuration
2 x Intel Xeon E5-2630LV3 SR209 8C Server Prozessor 8x 1,80 GHz
128GB RAM
2 x Dual 10G Nics Fiber
1 x 120GB Intel SSD for OS
8 x PM883 1,92GB SSD for OSDs
5 slow ops, oldest one blocked for 2261 sec, n03 has slow ops
exitsys
43 Posts
Quote from exitsys on October 16, 2020, 5:49 pmCould it have something to do with the backend bond balance-alb? Shall I change it to active backup? is it enough to change the cluster_info.json on all three nodes via ssh and reboot all nodes?
Could it have something to do with the backend bond balance-alb? Shall I change it to active backup? is it enough to change the cluster_info.json on all three nodes via ssh and reboot all nodes?
admin
2,930 Posts
Quote from admin on October 16, 2020, 6:01 pmlook at the charts for any load on the node that would cause these ping delays.
look at the charts for any load on the node that would cause these ping delays.
exitsys
43 Posts
Quote from exitsys on October 16, 2020, 6:07 pmi am currently only testing failover. nothing is connected to the cluster yet. No CPU usage 5% memory used. Absolutely no load on all three nodes.
What do you think about the bond change?
i am currently only testing failover. nothing is connected to the cluster yet. No CPU usage 5% memory used. Absolutely no load on all three nodes.
What do you think about the bond change?
admin
2,930 Posts
Quote from admin on October 16, 2020, 6:13 pmyou could try the different bond, i am not sure on the syntax if it requires a primary i/f to be defined, to be sure install a vm test node and look at its cluster_info.sjon
you could try the different bond, i am not sure on the syntax if it requires a primary i/f to be defined, to be sure install a vm test node and look at its cluster_info.sjon
exitsys
43 Posts
Quote from exitsys on October 16, 2020, 6:20 pmI can only strongly advise against balance-alb. I have now switched to active-passive and the cluster has no more problems at all. Both switches have been restarted several times in a row. Everything runs without any warning. i think we are now ready and can use PetaSAN productively. at 4k ips 2 node 256 we have 5min values of 60k read and 39k write. i think that is good, isn't it? Thanks for your help.
I can only strongly advise against balance-alb. I have now switched to active-passive and the cluster has no more problems at all. Both switches have been restarted several times in a row. Everything runs without any warning. i think we are now ready and can use PetaSAN productively. at 4k ips 2 node 256 we have 5min values of 60k read and 39k write. i think that is good, isn't it? Thanks for your help.