Forums - PetaSAN

ForumGeneral DiscussionReplacing management server faile …
You need to log in to create posts and topics. Login · Register
Replacing management server failed

Vaddon
6 Posts

September 24, 2019, 4:04 pm
Quote from Vaddon on September 24, 2019, 4:04 pm
Hi,

I've had a failure of one of my management nodes (my fault) so I've re-installed and given it the same IP. With the deploy UI I choose to replace the management node and it shows the settings correctly of the node I'm replacing. However when attempting to finish I get;

Error connecting to second node on backend 1 interface
Error connecting to second node on backend 2 interface

The root user / password gets deployed and if I SSH to the replacement node I can ping the backend of the second node in the cluster just fine. Logs don't seem to show anything relating to this error (it shows successful pings to the other nodes). Interfaces come up just fine, MTU sets just fine etc.

Not really sure where to go from here?

Hi,

I've had a failure of one of my management nodes (my fault) so I've re-installed and given it the same IP. With the deploy UI I choose to replace the management node and it shows the settings correctly of the node I'm replacing. However when attempting to finish I get;

Error connecting to second node on backend 1 interface
Error connecting to second node on backend 2 interface

The root user / password gets deployed and if I SSH to the replacement node I can ping the backend of the second node in the cluster just fine. Logs don't seem to show anything relating to this error (it shows successful pings to the other nodes). Interfaces come up just fine, MTU sets just fine etc.

Not really sure where to go from here?

Last edited on September 24, 2019, 4:08 pm by Vaddon · #1

Vaddon
6 Posts

September 24, 2019, 4:31 pm
Quote from Vaddon on September 24, 2019, 4:31 pm
OK doing some more digging and testing, and it appears that my SFP modules (Twinax direct attached 10Gb) take 3-4 seconds to come up after changing MTU, but the deploy script only seems to wait around 2 seconds before sending the ping requests to see if the node is up, resulting in the first pings failing to the second node but pings to the third node work fine. If I then manually ping the second node IPs, because the modules have come up after this time, it works fine and gives 0 failures.

From the logs;

Sep 24 17:21:52 PETASAN01 kernel: [ 5622.934784] ixgbe 0000:41:00.0 eth2: changing MTU from 9000 to 1500
Sep 24 17:21:53 PETASAN01 kernel: [ 5623.213710] ixgbe 0000:41:00.1 eth3: changing MTU from 9000 to 1500
Sep 24 17:21:53 PETASAN01 kernel: [ 5623.280083] ixgbe 0000:41:00.0 eth2: detected SFP+: 3
Sep 24 17:21:53 PETASAN01 kernel: [ 5623.497916] ixgbe 0000:41:00.0 eth2: changing MTU from 1500 to 9000
Sep 24 17:21:53 PETASAN01 kernel: [ 5623.548266] ixgbe 0000:41:00.1 eth3: detected SFP+: 4
Sep 24 17:21:53 PETASAN01 kernel: [ 5623.770781] ixgbe 0000:41:00.1 eth3: changing MTU from 1500 to 9000
Sep 24 17:21:53 PETASAN01 kernel: [ 5623.832220] ixgbe 0000:41:00.0 eth2: detected SFP+: 3
Sep 24 17:21:54 PETASAN01 kernel: [ 5624.104284] ixgbe 0000:41:00.1 eth3: detected SFP+: 4
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: PING 10.255.99.10 (10.255.99.10) 56(84) bytes of data.
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: 64 bytes from 10.255.99.10: icmp_seq=1 ttl=64 time=0.015 ms
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: --- 10.255.99.10 ping statistics ---
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: 1 packets transmitted, 1 received, 0% packet loss, time 0ms
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: rtt min/avg/max/mdev = 0.015/0.015/0.015/0.000 ms
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: PING 10.255.96.1 (10.255.96.1) 56(84) bytes of data.
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: 64 bytes from 10.255.96.1: icmp_seq=1 ttl=64 time=0.011 ms
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: --- 10.255.96.1 ping statistics ---
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: 1 packets transmitted, 1 received, 0% packet loss, time 0ms
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: rtt min/avg/max/mdev = 0.011/0.011/0.011/0.000 ms
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: PING 10.255.97.1 (10.255.97.1) 56(84) bytes of data.
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: 64 bytes from 10.255.97.1: icmp_seq=1 ttl=64 time=0.010 ms
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: --- 10.255.97.1 ping statistics ---
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: 1 packets transmitted, 1 received, 0% packet loss, time 0ms
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: rtt min/avg/max/mdev = 0.010/0.010/0.010/0.000 ms
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: PING 10.255.99.12 (10.255.99.12) 56(84) bytes of data.
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: 64 bytes from 10.255.99.12: icmp_seq=1 ttl=64 time=0.167 ms
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: --- 10.255.99.12 ping statistics ---
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: 1 packets transmitted, 1 received, 0% packet loss, time 0ms
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: rtt min/avg/max/mdev = 0.167/0.167/0.167/0.000 ms
Sep 24 17:21:54 PETASAN01 kernel: [ 5624.527542] ixgbe 0000:41:00.0 eth2: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Sep 24 17:21:54 PETASAN01 kernel: [ 5624.804839] ixgbe 0000:41:00.1 eth3: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Sep 24 17:21:56 PETASAN01 deploy.py[1267]: PING 10.255.96.3 (10.255.96.3) 56(84) bytes of data.
Sep 24 17:21:56 PETASAN01 deploy.py[1267]: --- 10.255.96.3 ping statistics ---
Sep 24 17:21:56 PETASAN01 deploy.py[1267]: 2 packets transmitted, 0 received, 100% packet loss, time 1026ms
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: PING 10.255.97.3 (10.255.97.3) 56(84) bytes of data.
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: --- 10.255.97.3 ping statistics ---
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: 2 packets transmitted, 0 received, 100% packet loss, time 1008ms
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: PING 10.255.99.11 (10.255.99.11) 56(84) bytes of data.
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: 64 bytes from 10.255.99.11: icmp_seq=1 ttl=64 time=0.185 ms
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: --- 10.255.99.11 ping statistics ---
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: 1 packets transmitted, 1 received, 0% packet loss, time 0ms
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: rtt min/avg/max/mdev = 0.185/0.185/0.185/0.000 ms
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: PING 10.255.96.2 (10.255.96.2) 56(84) bytes of data.
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: 64 bytes from 10.255.96.2: icmp_seq=1 ttl=64 time=0.071 ms
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: --- 10.255.96.2 ping statistics ---
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: 1 packets transmitted, 1 received, 0% packet loss, time 0ms
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: rtt min/avg/max/mdev = 0.071/0.071/0.071/0.000 ms
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: PING 10.255.97.2 (10.255.97.2) 56(84) bytes of data.
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: 64 bytes from 10.255.97.2: icmp_seq=1 ttl=64 time=0.960 ms
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: --- 10.255.97.2 ping statistics ---
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: 1 packets transmitted, 1 received, 0% packet loss, time 0ms
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: rtt min/avg/max/mdev = 0.960/0.960/0.960/0.000 ms

Yeah I know I posted my IP's and that's not best practice but this is effectively a test cluster at this point

OK doing some more digging and testing, and it appears that my SFP modules (Twinax direct attached 10Gb) take 3-4 seconds to come up after changing MTU, but the deploy script only seems to wait around 2 seconds before sending the ping requests to see if the node is up, resulting in the first pings failing to the second node but pings to the third node work fine. If I then manually ping the second node IPs, because the modules have come up after this time, it works fine and gives 0 failures.

From the logs;

Sep 24 17:21:52 PETASAN01 kernel: [ 5622.934784] ixgbe 0000:41:00.0 eth2: changing MTU from 9000 to 1500
Sep 24 17:21:53 PETASAN01 kernel: [ 5623.213710] ixgbe 0000:41:00.1 eth3: changing MTU from 9000 to 1500
Sep 24 17:21:53 PETASAN01 kernel: [ 5623.280083] ixgbe 0000:41:00.0 eth2: detected SFP+: 3
Sep 24 17:21:53 PETASAN01 kernel: [ 5623.497916] ixgbe 0000:41:00.0 eth2: changing MTU from 1500 to 9000
Sep 24 17:21:53 PETASAN01 kernel: [ 5623.548266] ixgbe 0000:41:00.1 eth3: detected SFP+: 4
Sep 24 17:21:53 PETASAN01 kernel: [ 5623.770781] ixgbe 0000:41:00.1 eth3: changing MTU from 1500 to 9000
Sep 24 17:21:53 PETASAN01 kernel: [ 5623.832220] ixgbe 0000:41:00.0 eth2: detected SFP+: 3
Sep 24 17:21:54 PETASAN01 kernel: [ 5624.104284] ixgbe 0000:41:00.1 eth3: detected SFP+: 4
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: PING 10.255.99.10 (10.255.99.10) 56(84) bytes of data.
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: 64 bytes from 10.255.99.10: icmp_seq=1 ttl=64 time=0.015 ms
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: --- 10.255.99.10 ping statistics ---
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: 1 packets transmitted, 1 received, 0% packet loss, time 0ms
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: rtt min/avg/max/mdev = 0.015/0.015/0.015/0.000 ms
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: PING 10.255.96.1 (10.255.96.1) 56(84) bytes of data.
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: 64 bytes from 10.255.96.1: icmp_seq=1 ttl=64 time=0.011 ms
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: --- 10.255.96.1 ping statistics ---
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: 1 packets transmitted, 1 received, 0% packet loss, time 0ms
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: rtt min/avg/max/mdev = 0.011/0.011/0.011/0.000 ms
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: PING 10.255.97.1 (10.255.97.1) 56(84) bytes of data.
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: 64 bytes from 10.255.97.1: icmp_seq=1 ttl=64 time=0.010 ms
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: --- 10.255.97.1 ping statistics ---
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: 1 packets transmitted, 1 received, 0% packet loss, time 0ms
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: rtt min/avg/max/mdev = 0.010/0.010/0.010/0.000 ms
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: PING 10.255.99.12 (10.255.99.12) 56(84) bytes of data.
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: 64 bytes from 10.255.99.12: icmp_seq=1 ttl=64 time=0.167 ms
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: --- 10.255.99.12 ping statistics ---
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: 1 packets transmitted, 1 received, 0% packet loss, time 0ms
Sep 24 17:21:54 PETASAN01 deploy.py[1267]: rtt min/avg/max/mdev = 0.167/0.167/0.167/0.000 ms
Sep 24 17:21:54 PETASAN01 kernel: [ 5624.527542] ixgbe 0000:41:00.0 eth2: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Sep 24 17:21:54 PETASAN01 kernel: [ 5624.804839] ixgbe 0000:41:00.1 eth3: NIC Link is Up 10 Gbps, Flow Control: RX/TX
Sep 24 17:21:56 PETASAN01 deploy.py[1267]: PING 10.255.96.3 (10.255.96.3) 56(84) bytes of data.
Sep 24 17:21:56 PETASAN01 deploy.py[1267]: --- 10.255.96.3 ping statistics ---
Sep 24 17:21:56 PETASAN01 deploy.py[1267]: 2 packets transmitted, 0 received, 100% packet loss, time 1026ms
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: PING 10.255.97.3 (10.255.97.3) 56(84) bytes of data.
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: --- 10.255.97.3 ping statistics ---
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: 2 packets transmitted, 0 received, 100% packet loss, time 1008ms
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: PING 10.255.99.11 (10.255.99.11) 56(84) bytes of data.
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: 64 bytes from 10.255.99.11: icmp_seq=1 ttl=64 time=0.185 ms
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: --- 10.255.99.11 ping statistics ---
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: 1 packets transmitted, 1 received, 0% packet loss, time 0ms
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: rtt min/avg/max/mdev = 0.185/0.185/0.185/0.000 ms
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: PING 10.255.96.2 (10.255.96.2) 56(84) bytes of data.
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: 64 bytes from 10.255.96.2: icmp_seq=1 ttl=64 time=0.071 ms
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: --- 10.255.96.2 ping statistics ---
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: 1 packets transmitted, 1 received, 0% packet loss, time 0ms
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: rtt min/avg/max/mdev = 0.071/0.071/0.071/0.000 ms
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: PING 10.255.97.2 (10.255.97.2) 56(84) bytes of data.
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: 64 bytes from 10.255.97.2: icmp_seq=1 ttl=64 time=0.960 ms
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: --- 10.255.97.2 ping statistics ---
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: 1 packets transmitted, 1 received, 0% packet loss, time 0ms
Sep 24 17:21:58 PETASAN01 deploy.py[1267]: rtt min/avg/max/mdev = 0.960/0.960/0.960/0.000 ms

Yeah I know I posted my IP's and that's not best practice but this is effectively a test cluster at this point

Last edited on September 24, 2019, 5:27 pm by Vaddon · #2

Vaddon
6 Posts

September 24, 2019, 7:31 pm
Quote from Vaddon on September 24, 2019, 7:31 pm
OK so, I've been playing around some more (because apparently I have nothing better to do when I'm at home) and edited the deploy.py to bypass the checks;

#if not Network().ping(backend1_host):
# status_report.failed_tasks.append(
# 'core_deploy_ping_error_node_{}_backend1'.format(management_nodes.index(node) + 1))
#if not Network().ping(backend2_host):
# status_report.failed_tasks.append(
# 'core_deploy_ping_error_node_{}_backend2'.format(management_nodes.index(node) + 1))

Re-started the service with; systemctl restart petasan-deploy

And re-ran stup and it worked just fine. I suppose I could have edited the ping command in the network.py to add a sleep or wait in there to resolve the issue but this also worked as a quick and dirty hack. Obviously not a good idea to fix it this way with bypassing sanity checks and the like, but the node is replaced just fine now 🙂

OK so, I've been playing around some more (because apparently I have nothing better to do when I'm at home) and edited the deploy.py to bypass the checks;

#if not Network().ping(backend1_host):
# status_report.failed_tasks.append(
# 'core_deploy_ping_error_node_{}_backend1'.format(management_nodes.index(node) + 1))
#if not Network().ping(backend2_host):
# status_report.failed_tasks.append(
# 'core_deploy_ping_error_node_{}_backend2'.format(management_nodes.index(node) + 1))

Re-started the service with; systemctl restart petasan-deploy

And re-ran stup and it worked just fine. I suppose I could have edited the ping command in the network.py to add a sleep or wait in there to resolve the issue but this also worked as a quick and dirty hack. Obviously not a good idea to fix it this way with bypassing sanity checks and the like, but the node is replaced just fine now 🙂

#3

admin
2,969 Posts

September 24, 2019, 9:42 pm
Quote from admin on September 24, 2019, 9:42 pm
Thanks for sharing this. The ips are set in the deploy wizard on the page you enter the backend 1 and 2 ips for then node, this is followed by the page you set the node role and choose your local disks, then on the last page we do the actual node joining action which includes the ping test you describe..so maybe on your setup it needs a further delay.

It will help to quantify what a reasonable delay should be, and if it is justified to add in software or is it something bad on your setup. It will be great if you can confirm it is indeed 3-4 sec using pure networking commands and not more. Note that PetaSAN like many other highly available systems depends on ip failover during node failures, and expects dynamically set ips to be active in a relatively short time, if you have too much delay you may face other issues down the road during ip takeover.

Thanks for sharing this. The ips are set in the deploy wizard on the page you enter the backend 1 and 2 ips for then node, this is followed by the page you set the node role and choose your local disks, then on the last page we do the actual node joining action which includes the ping test you describe..so maybe on your setup it needs a further delay.

It will help to quantify what a reasonable delay should be, and if it is justified to add in software or is it something bad on your setup. It will be great if you can confirm it is indeed 3-4 sec using pure networking commands and not more. Note that PetaSAN like many other highly available systems depends on ip failover during node failures, and expects dynamically set ips to be active in a relatively short time, if you have too much delay you may face other issues down the road during ip takeover.

#4

Post Reply: Replacing management server failed

Cancel