Forums - PetaSAN

ForumBug ReportingLost network config for a node af …
You need to log in to create posts and topics. Login · Register
Lost network config for a node after upgrade

idar
8 Posts

July 19, 2024, 8:23 am
Quote from idar on July 19, 2024, 8:23 am
I have 4 node cluster with 3 management + storage and 1 storage roles configured. When upgrading node 3 it updated fine and upon reboot it has lost its network config. Each node is configured with 2 nics. On the affected node i added the ip in the /etc/network/interfaces for eth0 which is used for management but eth1 the backend nic is down. How do i fix eth1? I checked the physical link which is fine. I tried adding the ip with "ip addr add 10.20.30.3/24 dev eth1" and brought the nic up which worked but the ceph was still broken and in the gui under node list node 3 appears as down. Although i can reach to node 3 from all other nodes on eth0 and eth1. I restarted the node 3 again and now eth1 says down. eth0 is still up and reachable but ceph is still broken.
i compared the other nodes but couldn't figure out where does eth1 config lives?

If i run ceph health on node 3 it just hangs with no response.

On node 1/2/4 if i do ceph heath i get this:

HEALTH_WARN no active mgr; 1/3 mons down

How can i get working mgr nodes to take over while i try to fix node 3?

BTW node 1 & 2 have been upgraded fine from 3.1.0 to 3.3.0. Node 3 is down while node 4 is still on 3.1.0.

Can you help?

I have 4 node cluster with 3 management + storage and 1 storage roles configured. When upgrading node 3 it updated fine and upon reboot it has lost its network config. Each node is configured with 2 nics. On the affected node i added the ip in the /etc/network/interfaces for eth0 which is used for management but eth1 the backend nic is down. How do i fix eth1? I checked the physical link which is fine. I tried adding the ip with "ip addr add 10.20.30.3/24 dev eth1" and brought the nic up which worked but the ceph was still broken and in the gui under node list node 3 appears as down. Although i can reach to node 3 from all other nodes on eth0 and eth1. I restarted the node 3 again and now eth1 says down. eth0 is still up and reachable but ceph is still broken.
i compared the other nodes but couldn't figure out where does eth1 config lives?

If i run ceph health on node 3 it just hangs with no response.

On node 1/2/4 if i do ceph heath i get this:

HEALTH_WARN no active mgr; 1/3 mons down

How can i get working mgr nodes to take over while i try to fix node 3?

BTW node 1 & 2 have been upgraded fine from 3.1.0 to 3.3.0. Node 3 is down while node 4 is still on 3.1.0.

Can you help?

Last edited on July 19, 2024, 10:41 am by idar · #1

admin
2,958 Posts

July 19, 2024, 12:55 pm
Quote from admin on July 19, 2024, 12:55 pm
You should not have to configure any ips yourself. /etc/network/interfaces should only contain the management ip, it is set by the installer, the remaining ip configuration should be handled by PetaSAN during its startup. What is output of following

/opt/petasan/scripts/detect-interfaces.sh
ip addr

cat /etc/network/interfaces
cat /opt/petasan/config/cluster_info.json
cat /opt/petasan/config/node_info.json
cat /etc/udev/rules.d/70-persistent-net.rules

You should not have to configure any ips yourself. /etc/network/interfaces should only contain the management ip, it is set by the installer, the remaining ip configuration should be handled by PetaSAN during its startup. What is output of following

/opt/petasan/scripts/detect-interfaces.sh
ip addr

cat /etc/network/interfaces
cat /opt/petasan/config/cluster_info.json
cat /opt/petasan/config/node_info.json
cat /etc/udev/rules.d/70-persistent-net.rules

#2

idar
8 Posts

July 19, 2024, 2:32 pm
Quote from idar on July 19, 2024, 2:32 pm
/OPT/PETASAN/SCRIPTS/DETECT-INTERFACES.SH

device=eth0,mac=1c:34:da:53:d9:a0,pci=3b:00.0,model=Mellanox Technologies MT27710 Family [ConnectX-4 Lx],path=ens3f0np0
device=eth1,mac=1c:34:da:53:d9:a1,pci=3b:00.1,model=Mellanox Technologies MT27710 Family [ConnectX-4 Lx],path=ens3f1np1

ip add

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 1c:34:da:53:d9:a0 brd ff:ff:ff:ff:ff:ff
altname enp59s0f0np0
inet 10.8.93.20/24 brd 10.8.93.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::1e34:daff:fe53:d9a0/64 scope link
valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 1c:34:da:53:d9:a1 brd ff:ff:ff:ff:ff:ff
altname enp59s0f1np1

CAT /ETC/NETWORK/INTERFACES
auto eth0
iface eth0 inet static
address 10.8.93.20
netmask 255.255.255.0
gateway 10.8.93.250
# up route add -net 10.8.0.0 netmask 255.255.0.0 gw 10.8.93.250
dns-nameservers 10.8.100.75

cat /opt/petasan/config/cluster_info.json
{
"backend_1_base_ip": "",
"backend_1_eth_name": "",
"backend_1_mask": "",
"backend_1_vlan_id": "",
"backend_2_base_ip": "",
"backend_2_eth_name": "",
"backend_2_mask": "",
"backend_2_vlan_id": "",
"bonds": [],
"default_pool": [],
"default_pool_pgs": "",
"default_pool_replicas": "",
"eth_count": 0,
"jf_mtu_size": "",
"jumbo_frames": [],
"management_eth_name": "",
"management_nodes": [],
"name": "a",
"storage_engine": "bluestore",
"writecache_writeback_inflight_mb": ""

There is no : CAT /OPT/PETASAN/CONFIG/NODE_INFO.JSON

cat /etc/udev/rules.d/70-persistent-net.rules

# ADDED BY PETASAN, DO NOT MODIFY : DEFAULT_NAME=ens3f0np0, ASSIGNED_NAME=eth0
# ADDED BY PETASAN, DO NOT MODIFY : DEFAULT_NAME=ens3f1np1, ASSIGNED_NAME=eth1

SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?", ATTR{address}=="1c:34:da:53:d9:a0", ATTR{type}=="1", NAME="eth0"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?", ATTR{address}=="1c:34:da:53:d9:a1", ATTR{type}=="1", NAME="eth1"root@dev-psan-3:~#

Looking at same file on another host i can see OPT/PETASAN/CONFIG/NODE_INFO.JSON available and cluster_info.json also populated. Should i copy these files to this node and adjust or is there a way auto generate them?

/OPT/PETASAN/SCRIPTS/DETECT-INTERFACES.SH

device=eth0,mac=1c:34:da:53:d9:a0,pci=3b:00.0,model=Mellanox Technologies MT27710 Family [ConnectX-4 Lx],path=ens3f0np0
device=eth1,mac=1c:34:da:53:d9:a1,pci=3b:00.1,model=Mellanox Technologies MT27710 Family [ConnectX-4 Lx],path=ens3f1np1

ip add

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 1c:34:da:53:d9:a0 brd ff:ff:ff:ff:ff:ff
altname enp59s0f0np0
inet 10.8.93.20/24 brd 10.8.93.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::1e34:daff:fe53:d9a0/64 scope link
valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 1c:34:da:53:d9:a1 brd ff:ff:ff:ff:ff:ff
altname enp59s0f1np1

CAT /ETC/NETWORK/INTERFACES
auto eth0
iface eth0 inet static
address 10.8.93.20
netmask 255.255.255.0
gateway 10.8.93.250
# up route add -net 10.8.0.0 netmask 255.255.0.0 gw 10.8.93.250
dns-nameservers 10.8.100.75

cat /opt/petasan/config/cluster_info.json
{
"backend_1_base_ip": "",
"backend_1_eth_name": "",
"backend_1_mask": "",
"backend_1_vlan_id": "",
"backend_2_base_ip": "",
"backend_2_eth_name": "",
"backend_2_mask": "",
"backend_2_vlan_id": "",
"bonds": [],
"default_pool": [],
"default_pool_pgs": "",
"default_pool_replicas": "",
"eth_count": 0,
"jf_mtu_size": "",
"jumbo_frames": [],
"management_eth_name": "",
"management_nodes": [],
"name": "a",
"storage_engine": "bluestore",
"writecache_writeback_inflight_mb": ""

There is no : CAT /OPT/PETASAN/CONFIG/NODE_INFO.JSON

cat /etc/udev/rules.d/70-persistent-net.rules

# ADDED BY PETASAN, DO NOT MODIFY : DEFAULT_NAME=ens3f0np0, ASSIGNED_NAME=eth0
# ADDED BY PETASAN, DO NOT MODIFY : DEFAULT_NAME=ens3f1np1, ASSIGNED_NAME=eth1

SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?", ATTR{address}=="1c:34:da:53:d9:a0", ATTR{type}=="1", NAME="eth0"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?", ATTR{address}=="1c:34:da:53:d9:a1", ATTR{type}=="1", NAME="eth1"root@dev-psan-3:~#

Looking at same file on another host i can see OPT/PETASAN/CONFIG/NODE_INFO.JSON available and cluster_info.json also populated. Should i copy these files to this node and adjust or is there a way auto generate them?

Last edited on July 19, 2024, 2:38 pm by idar · #3

admin
2,958 Posts

July 19, 2024, 3:10 pm
Quote from admin on July 19, 2024, 3:10 pm
There should be a cluster_info.json and node_info.json. You can copy from other nodes and adjust node_info.json with correct backend ip. Make sure you /etc/network/interfaces has the correct management ip as setup during installation. Do not modify the files/ips yourself.

You need to reboot node for the files to take effect. The files are not autogenerated at tun time, they are needed to bring up the network and connect node to cluster, the ips/interfaces of services are autogenerated but not the backend as you need to connect to cluster.

The root cause of why the files were empty/missing should be looked into. It probably happened before you did the upgrade but showed up after you booted.

There should be a cluster_info.json and node_info.json. You can copy from other nodes and adjust node_info.json with correct backend ip. Make sure you /etc/network/interfaces has the correct management ip as setup during installation. Do not modify the files/ips yourself.

You need to reboot node for the files to take effect. The files are not autogenerated at tun time, they are needed to bring up the network and connect node to cluster, the ips/interfaces of services are autogenerated but not the backend as you need to connect to cluster.

The root cause of why the files were empty/missing should be looked into. It probably happened before you did the upgrade but showed up after you booted.

Last edited on July 19, 2024, 3:11 pm by admin · #4

idar
8 Posts

July 23, 2024, 11:47 am
Quote from idar on July 23, 2024, 11:47 am
Thanks, i copied the node and cluster.json files from a working node and adjusted on the corrupted node 3. Restarted and now the. node is back into play. Now repairing some PGs.

Also, when i had this node 3 down, why didnt the other 2 node with management role kicked in? The management gui on the other nodes was not properly functioning. like the homepage status were not loading etc.

What would have been the way to fix that issue if one of the management nodes down?

And lastly, excellent help provided in the forums. Surely will recommend this product in my work places.

Thanks, i copied the node and cluster.json files from a working node and adjusted on the corrupted node 3. Restarted and now the. node is back into play. Now repairing some PGs.

Also, when i had this node 3 down, why didnt the other 2 node with management role kicked in? The management gui on the other nodes was not properly functioning. like the homepage status were not loading etc.

What would have been the way to fix that issue if one of the management nodes down?

And lastly, excellent help provided in the forums. Surely will recommend this product in my work places.

#5

admin
2,958 Posts

July 23, 2024, 8:37 pm
Quote from admin on July 23, 2024, 8:37 pm
One potential of 1 node down bringing the system to its knees is if you set you recovery/backfill speed too high (from UI) and your hardware or disks ate very slow, the recovery load itself could lead to this.. You should always simulate such faillover cases

One potential of 1 node down bringing the system to its knees is if you set you recovery/backfill speed too high (from UI) and your hardware or disks ate very slow, the recovery load itself could lead to this.. You should always simulate such faillover cases

#6

idar
8 Posts

July 24, 2024, 12:05 pm
Quote from idar on July 24, 2024, 12:05 pm
The system was performing fine. It was only the dashboard with graphs and stats not showing any stats. Rest of the GUI and options were available. The cluster is built with newer servers and ssds so load wise it wasnt an issue.
So not sure why a single node triggered that issue.

The system was performing fine. It was only the dashboard with graphs and stats not showing any stats. Rest of the GUI and options were available. The cluster is built with newer servers and ssds so load wise it wasnt an issue.
So not sure why a single node triggered that issue.

#7

admin
2,958 Posts

July 24, 2024, 1:35 pm
Quote from admin on July 24, 2024, 1:35 pm
You need to test/simulate such a node failure and make sure you can access any management node.

Can you please post the following charts at around the time the non-response UI happen:

Cluster Statistics -> PG Status

Node Statistics -> Disk % Utilization (on any storage node)

You need to test/simulate such a node failure and make sure you can access any management node.

Can you please post the following charts at around the time the non-response UI happen:

Cluster Statistics -> PG Status

Node Statistics -> Disk % Utilization (on any storage node)

#8

Post Reply: Lost network config for a node after upgrade

Cancel