Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Lost network config for a node after upgrade

I have 4 node cluster with 3 management + storage and 1 storage roles configured. When upgrading node 3 it updated fine and upon reboot it has lost its network config. Each node is configured with 2 nics. On the affected node i added the ip in the /etc/network/interfaces for eth0 which is used for management but eth1 the backend nic is down. How do i fix eth1? I checked the physical link which is fine. I tried adding the ip with "ip addr add 10.20.30.3/24 dev eth1" and brought the nic up which worked but the ceph was still broken and in the gui under node list node 3 appears as down. Although i can reach to node 3 from all other nodes on eth0 and eth1. I restarted the node 3 again and now eth1 says down. eth0 is still up and reachable but ceph is still broken.
i compared the other nodes but couldn't figure out where does eth1 config lives?

If i run ceph health on node 3 it just hangs with no response.

On node 1/2/4 if i do ceph heath i get this:

HEALTH_WARN no active mgr; 1/3 mons down

How can i get working mgr nodes to take over while i try to fix node 3?

BTW node 1 & 2 have been upgraded fine from 3.1.0 to 3.3.0. Node 3 is down while node 4 is still on 3.1.0.

Can you help?

You should not have to configure any ips yourself.  /etc/network/interfaces should only contain the management ip, it is set by the installer, the remaining ip configuration should be handled by PetaSAN during its startup. What is output of following

/opt/petasan/scripts/detect-interfaces.sh
ip addr

cat /etc/network/interfaces
cat /opt/petasan/config/cluster_info.json
cat /opt/petasan/config/node_info.json
cat /etc/udev/rules.d/70-persistent-net.rules

 

 

/OPT/PETASAN/SCRIPTS/DETECT-INTERFACES.SH

device=eth0,mac=1c:34:da:53:d9:a0,pci=3b:00.0,model=Mellanox Technologies MT27710 Family [ConnectX-4 Lx],path=ens3f0np0
device=eth1,mac=1c:34:da:53:d9:a1,pci=3b:00.1,model=Mellanox Technologies MT27710 Family [ConnectX-4 Lx],path=ens3f1np1

ip add

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 1c:34:da:53:d9:a0 brd ff:ff:ff:ff:ff:ff
altname enp59s0f0np0
inet 10.8.93.20/24 brd 10.8.93.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::1e34:daff:fe53:d9a0/64 scope link
valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 1c:34:da:53:d9:a1 brd ff:ff:ff:ff:ff:ff
altname enp59s0f1np1

CAT /ETC/NETWORK/INTERFACES
auto eth0
iface eth0 inet static
address 10.8.93.20
netmask 255.255.255.0
gateway 10.8.93.250
# up route add -net 10.8.0.0 netmask 255.255.0.0 gw 10.8.93.250
dns-nameservers 10.8.100.75

cat /opt/petasan/config/cluster_info.json
{
"backend_1_base_ip": "",
"backend_1_eth_name": "",
"backend_1_mask": "",
"backend_1_vlan_id": "",
"backend_2_base_ip": "",
"backend_2_eth_name": "",
"backend_2_mask": "",
"backend_2_vlan_id": "",
"bonds": [],
"default_pool": [],
"default_pool_pgs": "",
"default_pool_replicas": "",
"eth_count": 0,
"jf_mtu_size": "",
"jumbo_frames": [],
"management_eth_name": "",
"management_nodes": [],
"name": "a",
"storage_engine": "bluestore",
"writecache_writeback_inflight_mb": ""

 

There is no : CAT /OPT/PETASAN/CONFIG/NODE_INFO.JSON

 

cat /etc/udev/rules.d/70-persistent-net.rules

# ADDED BY PETASAN, DO NOT MODIFY : DEFAULT_NAME=ens3f0np0, ASSIGNED_NAME=eth0
# ADDED BY PETASAN, DO NOT MODIFY : DEFAULT_NAME=ens3f1np1, ASSIGNED_NAME=eth1

SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="1c:34:da:53:d9:a0", ATTR{type}=="1", NAME="eth0"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="1c:34:da:53:d9:a1", ATTR{type}=="1", NAME="eth1"root@dev-psan-3:~#

 

Looking at same file on another host i can see OPT/PETASAN/CONFIG/NODE_INFO.JSON available and cluster_info.json also populated. Should i copy these files to this node and adjust or is there a way auto generate them?

There should be a cluster_info.json and node_info.json. You can copy from other nodes and adjust node_info.json with correct backend ip. Make sure you /etc/network/interfaces has the correct management ip as setup during installation. Do not modify the files/ips yourself.

You need to reboot node for the files to take effect. The files are not autogenerated at tun time, they are needed to bring up the network and connect node to cluster, the ips/interfaces of services are autogenerated but not the backend as you need to connect to cluster.

The root cause of why the files were empty/missing should be looked into. It probably happened before you did the upgrade but showed up after you booted.

Thanks, i copied the node and cluster.json files from a working node and adjusted on the corrupted node 3. Restarted and now the. node is back into play. Now repairing some PGs.

Also, when i had this node 3 down, why didnt the other 2 node with management role kicked in? The management gui on the other nodes was not properly functioning. like the homepage status were not loading etc.

What would have been the way to fix that issue if one of the management nodes down?

And lastly, excellent help provided in the forums. Surely will recommend this product in my work places.

One potential of 1 node down bringing the system to its knees is if you set you recovery/backfill speed too high (from UI) and your hardware or disks ate very slow, the recovery load itself could lead to this.. You should always simulate such faillover cases

The system was performing fine. It was only the dashboard with graphs and stats not showing any stats. Rest of the GUI  and options were available. The cluster is built with newer servers and ssds so load wise it wasnt an issue.
So not sure why a single node triggered that issue.

You need to test/simulate such  a node failure  and make sure  you can access any management node.

Can you please post the following charts at around the time the non-response UI happen:

Cluster Statistics -> PG Status

Node Statistics ->  Disk % Utilization  (on any storage node)