Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

NFS issues (closed port, restart ganesha container) after update from 3.0.2 to 3.0.3

Pages: 1 2

NFS issues (closed port, restart ganesha container) after update from 3.0.2 to 3.0.3.

Everything worked fine until the update. At upgrade cluster status: OK. Cluster have 3 nodes.

After the update, the hosts could not write data to NFS (RO mode).

#Try 1
I removed the NFS export from the panel, added it again.
Didn't work, hosts can no longer connect to NFS at all even in RO mode.

#Try 2
On all hosts I turned off the NFS service and turned it back on - doesn't work.

#Try 3
I restarted the entire cluster - did not work.

#Try 4
I removed 3 IP addresses from the NFS configuration, left one - did not work.

#Try 5
I disabled firewall on all hosts - did not work.

Logs show no errors.

nmap shows the port is filtered or closed (both from localhost and from another host on the network)
netstat shows that NFS is listening on the correct IPs on the correct interface.

I have no more ideas.
The logs show that the container is created and restarted/created again after 3 minutes.

I am very much asking for urgent help.

27/05/2022 11:12:59 INFO     LockBase : dropping old sessions
27/05/2022 11:13:09 INFO     LockBase : successfully dropped old sessions
27/05/2022 11:13:09 INFO     Clean all old local resources.
27/05/2022 11:13:11 INFO     NFSServer : sync Consul settings
27/05/2022 11:13:11 INFO     NFSServer : sync Consul settings -> done
27/05/2022 11:13:11 INFO     LockBase : try to acquire the resource = NFS-172-30-0-172
27/05/2022 11:13:12 INFO     LockBase : succeeded on acquiring the resource = NFS-172-30-0-172
27/05/2022 11:13:13 INFO     NFSServer : waiting for the container NFS-172-30-0-172 to be up.
27/05/2022 11:13:13 INFO     Starting NFS Exports Service
27/05/2022 11:13:13 INFO     Container Manager : creating NFS-172-30-0-172 container
27/05/2022 11:13:17 INFO     NFSServer : sync Consul settings
27/05/2022 11:13:17 INFO     NFSServer : sync Consul settings -> done
27/05/2022 11:14:27 INFO     Stopping NFS Exports Service
27/05/2022 11:15:36 INFO     NFSServer : sync Consul settings
27/05/2022 11:15:36 INFO     NFSServer : sync Consul settings -> done
27/05/2022 11:15:37 INFO     LockBase : try to acquire the resource = NFS-172-30-0-172
27/05/2022 11:15:37 INFO     LockBase : succeeded on acquiring the resource = NFS-172-30-0-172
27/05/2022 11:15:38 INFO     NFSServer : waiting for the container NFS-172-30-0-172 to be up.
27/05/2022 11:15:38 INFO     Starting NFS Exports Service
27/05/2022 11:15:38 INFO     Container Manager : creating NFS-172-30-0-172 container
27/05/2022 11:15:42 INFO     NFSServer : sync Consul settings
27/05/2022 11:15:42 INFO     NFSServer : sync Consul settings -> done
[...]

Can you perform another online update now, it will update to 3.1 and see if this solves the issue ?

Do you have any access restrictions put on your shares via the ui ?

Does the NFS status page show the services are up and do serve the public ips correctly ?

Today tried to run update, but it says:

You have newest version (3.0.3)

When I tried now, it's updating. Will let You know.

After upgrade to 3.1.0, still ports are closed / filtered.

I re-enabled NFS service on every node - without success

NFS Startus:

After a long time, all hosts are "UP."
NFS ports have started to be visible and are open.

Unfortunately, no data is available.

Does deleting a line from "NFS Exports" cause data loss?
In my opinion it shouldn't, because it's just an export of a particular filesystem.

After some time, NFS Status adr UP,
but ports are closed/filtered.

Docker recreates images one-by-one in a loop.

At logs I can see:

May 27 13:59:18 ceph03 nfs_server.py[202292]: arping: bind: Cannot assign requested address
May 27 13:59:21 ceph03 nfs_server.py[202309]: arping: bind: Cannot assign requested address
May 27 13:59:23 ceph03 nfs_server.py[202326]: arping: bind: Cannot assign requested address
May 27 13:59:26 ceph03 nfs_server.py[202349]: arping: bind: Cannot assign requested address

can you double check your interfaces detected and named correctly and their ips are assigned to them correctly via:

/opt/petasan/scripts/detect-interfaces.sh
ip addr

The main issue seems the database for multi server NFS have flagged a "grace" period..this should happen for only 90 sec unless there are continuous/frequent errors. Are the services all up now with no grace showing ?

 

Not sure what you mean by "Does deleting a line from "NFS Exports" cause data loss?"

If you cannot see data from clients, you should be able to see it from any NFS server, we create a path in

/mnt/filesystem_name/layout_name/nfs/share_name

or you mean you deleted the share ?

/opt/petasan/scripts/detect-interfaces.sh
device=eth0,mac=78:2b:cb:4f:73:5b,pci=01:00.0,model=Broadcom Inc. and subsidiaries NetXtreme II BCM5716 Gigabit Ethernet (PowerEdge R510 BCM5716 Gigabit Ethernet),path=enp1s0f0
device=eth1,mac=78:2b:cb:4f:73:5c,pci=01:00.1,model=Broadcom Inc. and subsidiaries NetXtreme II BCM5716 Gigabit Ethernet (PowerEdge R510 BCM5716 Gigabit Ethernet),path=enp1s0f1
device=eth2,mac=00:1b:21:ec:ff:fd,pci=03:00.0,model=Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (Ethernet Server Adapter X520-1),path=enp3s0

4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
inet 172.30.0.41/24 scope global eth2
inet 10.0.1.41/24 scope global eth2
inet 172.30.0.172/24 scope global secondary eth2

I had grace period between my post above, from 11 to 11:35. Now all services are UP.

It works totally random right now, 10 minutes works and then 15 minutes doesn't work.

NMAP:
2049/tcp open  nfs
Nmap scan report for 172.30.0.172

2049/tcp closed nfs
Nmap scan report for 172.30.0.173

2049/tcp open  nfs
Nmap scan report for 172.30.0.171

 

Data:

I can't see any data in mentioned path, it's empty.

at url: http://ceph01/manage_nfs/exports I removed share, and add it again.

And I expected it to work like Linux /etc/exports connecting and disconnecting an existing share - not deleting data.

And additionally
I found errors in syslog:

 

May 27 14:38:12 ceph03 nfs_server.py[243681]: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
May 27 14:38:12 ceph03 nfs_server.py[243681]: @    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
May 27 14:38:12 ceph03 nfs_server.py[243681]: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
May 27 14:38:12 ceph03 nfs_server.py[243681]: IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
May 27 14:38:12 ceph03 nfs_server.py[243681]: Someone could be eavesdropping on you right now (man-in-the-middle attack)!
May 27 14:38:12 ceph03 nfs_server.py[243681]: It is also possible that a host key has just been changed.
May 27 14:38:12 ceph03 nfs_server.py[243681]: The fingerprint for the ECDSA key sent by the remote host is
May 27 14:38:12 ceph03 nfs_server.py[243681]: SHA256:bPYsBdagaJNbG5UxF9KkvBa1Xbxc0dqNrL4zaVQyp9E.
May 27 14:38:12 ceph03 nfs_server.py[243681]: Please contact your system administrator.
May 27 14:38:12 ceph03 nfs_server.py[243681]: Add correct host key in /root/.ssh/known_hosts to get rid of this message.
May 27 14:38:12 ceph03 nfs_server.py[243681]: Offending ECDSA key in /root/.ssh/known_hosts:8
May 27 14:38:12 ceph03 nfs_server.py[243681]:   remove with:
May 27 14:38:12 ceph03 nfs_server.py[243681]:   ssh-keygen -f "/root/.ssh/known_hosts" -R "ceph02"
May 27 14:38:12 ceph03 nfs_server.py[243681]: Password authentication is disabled to avoid man-in-the-middle attacks.
May 27 14:38:12 ceph03 nfs_server.py[243681]: Keyboard-interactive authentication is disabled to avoid man-in-the-middle attacks.

It look like container cannot make ssh connection to host.

Can you double check the ips are assigned correctly on the interfaces  ?

When we add a share we create an internal directory of the path specified above, when we delete the share this directory is deleted.

My understanding is that it is still not running...like it is up for 10 min down again for similar time. I understand this is what you see from the NFS Status page. Do you see on that page services going up/down and ips moving among the servers ?

Can you please stop service on all nodes:

systemctl stop petasan-nfs

then restart it on just 1 node

systemctl stop petasan-nfs

Wait 2 min, is it status up and stable or does it go up and down ? if the later can you please post the relevant section of /opt/petasan/log/PetaSAN.log from that node ?

The IP addresses are correctly assigned to the interfaces (I showed above), the same status is on the "NFS Status" page, plus NMAP reflects the status from the page and from the interfaces.

In NFS Service Status, Status is UP, but IP addresses are floating around the servers (in groups of 2 per server, one per server, etc).

I'll do a test with the logs in a bit.

FYI:
service have 2 names:
petasan-nfs-server.service
petasan-nfs-exports@NFS-172-xx-0-173.service

I chosen first one.

Service is UP, port is closed.
Logs are here:
https://pastebin.com/h8yeSNn6

 

Pages: 1 2