Forums - PetaSAN

ForumGeneral DiscussionFrom the admin web application, a …
You need to log in to create posts and topics. Login · Register
From the admin web application, add & remove OSD's?

Pages: 1 2

admin
2,967 Posts

June 14, 2018, 4:00 pm
Quote from admin on June 14, 2018, 4:00 pm
It does seem /etc/hosts got corrupt due to consul connection failure while joining the node, this is the reason the physical disk list does not open. To manually fix hosts file:

# stop auto sync service
systemctl stop petasan-file-sync
# manual fix hosts file
nano /etc/hosts
# sync the hosts file to all nodes:
/opt/petasan/scripts/util/sync_file.py /etc/hosts
# restart the sync service on current node
systemctl start petasan-file-sync

The root cause needs to be fixed, i would keep an eye on the log file and see if you do still see the consul connection errors. If you still see the consul connection errors in the logs, the system is not stable.
The most likely cause is flaky network or the system could be under powered, so under load (client io, recovery, scrub) it could be the system slowed to the point where it could not connect to the cluster. Observe your %disk busy,cpu, ram during/after the initial failure and see if those resources were maxed out. Do you have enough ram ?

It does seem /etc/hosts got corrupt due to consul connection failure while joining the node, this is the reason the physical disk list does not open. To manually fix hosts file:

# stop auto sync service
systemctl stop petasan-file-sync
# manual fix hosts file
nano /etc/hosts
# sync the hosts file to all nodes:
/opt/petasan/scripts/util/sync_file.py /etc/hosts
# restart the sync service on current node
systemctl start petasan-file-sync

The root cause needs to be fixed, i would keep an eye on the log file and see if you do still see the consul connection errors. If you still see the consul connection errors in the logs, the system is not stable.
The most likely cause is flaky network or the system could be under powered, so under load (client io, recovery, scrub) it could be the system slowed to the point where it could not connect to the cluster. Observe your %disk busy,cpu, ram during/after the initial failure and see if those resources were maxed out. Do you have enough ram ?

Last edited on June 14, 2018, 4:02 pm by admin · #11

Ste
137 Posts

June 15, 2018, 10:18 am
Quote from Ste on June 15, 2018, 10:18 am
Ok, it worked, thanks.

I agree that the system could be under powered: two nodes have 4MB ram and one node has only 100 Mbit ethernet ports. In fact, quite often during recovery I get emails from the cluster telling some OSD or one node are down, while actually they are still working. Anyway in this test cluster this is not a concern.

Bye, S.

Ok, it worked, thanks.

I agree that the system could be under powered: two nodes have 4MB ram and one node has only 100 Mbit ethernet ports. In fact, quite often during recovery I get emails from the cluster telling some OSD or one node are down, while actually they are still working. Anyway in this test cluster this is not a concern.

Bye, S.

#12

Post Reply: From the admin web application, add & remove OSD's?

Cancel

Pages: 1 2