Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Replace management node not working

Hi,

Trying to replace management node 3.

First try of the procedure fails with this error:

Error deploying node.

Error List

Error connecting to first node on backend 1 interface
Error connecting to first node on backend 2 interface
Error connecting to second node on backend 1 interface
Error connecting to second node on backend 2 interface

Petasan.log shows:

10/01/2020 02:46:14 INFO Replace node is starting.
10/01/2020 02:46:22 ERROR Connection ping error.
10/01/2020 02:46:22 ERROR ['core_deploy_ping_error_node_1_backend1', 'core_deploy_ping_error_node_1_backend2', 'core_deploy_ping_error_node_2_backend1',$

After this, I connect to new node by ssh and I see first ping fails but after that first fail , everything is OK (cisco behavior)

When try again the error changes to :

Node interfaces do not match cluster interface settings

Node 1 and 2 have these interfaces:

root@CEPH-22:~# ifconfig |grep UP
backend: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 9000
backend.74: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
backend.75: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
eth1: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
eth2: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
eth3: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
eth4: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
eth5: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9000
eth6: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
eth7: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9000
eth4.74: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
eth6.75: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536

Node3 has these ones after running replace manage procedure:

backend: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 9000
backend.74: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
backend.75: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
eth1: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
eth2: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
eth3: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
eth4: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
eth5: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9000
eth6: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
eth7: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9000
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536

As you can see, there are eth4.74 and eth6.75 interfaces which are missing in the new node.

Petasan.log shows :

10/01/2020 02:37:17 INFO Starting replace.
10/01/2020 02:37:17 INFO Successfully copied public keys.
10/01/2020 02:37:17 INFO Successfully copied private keys.
10/01/2020 02:37:18 INFO Starting to copy config file
10/01/2020 02:37:18 INFO Successfully copied config file.
10/01/2020 02:37:18 INFO Successfully joined to cluster SSD1
10/01/2020 02:37:18 INFO password set successfully.

All nodes were initially installed with version 2.3.1 and upgraded to 2.4.0. The new node is being installed with 2.4.0 ISO

Found a python script which checks eth count and this is the output:

root@CEPH-23:/opt/petasan/scripts/util# ./check_interfaces_match.py
cluster management interface eth0
node management interface eth0
management interface match
cluster eth count 8
node eth count 9
Error: eth count mis-match !!
detected interfaces
eth0
eth1
eth2
eth3
eth4
eth5
eth6
eth7
backend

After this I ran:

ip link delete backend

I bypassed ping checks like a user did in this thread:

https://www.petasan.org/forums/?view=thread&id=529

Retried replace procedure and worked...

There seems to be a problem with initial setup ping delay like it's happening to other users around the forum which also affects this function.

Thanks,

Thanks for sharing this.

For the ping test: in v 2.5 we are adding several retries before reporting connection error. Note that in your case it failed once within PetaSAN and later a second time when you manually did it via ssh, ideally it will be  good to find out why these failed. The ping tests were added to prevent later connection errors when setting up ceph and consul which could be difficult to debug. PetaSAN like other HA systems rely during its operation on dynamic ip failover  to switch ip paths when a node fails to other nodes, this should be a quick process and the ip should be responsive quickly, if there is anything you could check why it takes a long time ( or several attempts ? ) in your setup for the ip to function and if there are settings you can control this, it will be best as we want to prevent this during production failover action. Note the other forum topic you link, reported the ip took 3-4 seconds to be active, it your case it is longer. Again we are adding retries attempt in our ping test, we do not want to be too forgiving as it could be masking other problems. Also note that these same tests passed before when you first installed your cluster.

For interface count: yes currently the nodes need to have the same number of nics, version 2.5 will not require this.

 

Hi there,

Thanks for your answer.

Just to clarify some points:

Note that in your case it failed once within PetaSAN and later a second time when you manually did it via ssh, ideally it will be  good to find out why these failed.

This is a common Cisco Switches behavior with ARP table learning. First ping to newly pinged IPs always fail.

https://community.cisco.com/t5/routing/first-ping-fails/td-p/1793423

https://learningnetwork.cisco.com/thread/85577

In fact, I saw this issue appear for first time when we upgraded our networking devices from HP 1G copper to Cisco 10G fiber

Also note that these same tests passed before when you first installed your cluster.

In fact, when we deployed for first time , we had a similar issue, we had to retry the join process in every node. On the second try, as the IP address as already pinged once...it worked flawlessly.

I will try to dig a bit more on this Cisco behavior, and check if there's any change to speed up this , and prevent the first ping fail..

Thanks as usual,