From the admin web application, add & remove OSD's?
Pages: 1 2
admin
2,930 Posts
June 14, 2018, 4:00 pmQuote from admin on June 14, 2018, 4:00 pmIt does seem /etc/hosts got corrupt due to consul connection failure while joining the node, this is the reason the physical disk list does not open. To manually fix hosts file:
# stop auto sync service
systemctl stop petasan-file-sync
# manual fix hosts file
nano /etc/hosts
# sync the hosts file to all nodes:
/opt/petasan/scripts/util/sync_file.py /etc/hosts
# restart the sync service on current node
systemctl start petasan-file-sync
The root cause needs to be fixed, i would keep an eye on the log file and see if you do still see the consul connection errors. If you still see the consul connection errors in the logs, the system is not stable.
The most likely cause is flaky network or the system could be under powered, so under load (client io, recovery, scrub) it could be the system slowed to the point where it could not connect to the cluster. Observe your %disk busy,cpu, ram during/after the initial failure and see if those resources were maxed out. Do you have enough ram ?
It does seem /etc/hosts got corrupt due to consul connection failure while joining the node, this is the reason the physical disk list does not open. To manually fix hosts file:
# stop auto sync service
systemctl stop petasan-file-sync
# manual fix hosts file
nano /etc/hosts
# sync the hosts file to all nodes:
/opt/petasan/scripts/util/sync_file.py /etc/hosts
# restart the sync service on current node
systemctl start petasan-file-sync
The root cause needs to be fixed, i would keep an eye on the log file and see if you do still see the consul connection errors. If you still see the consul connection errors in the logs, the system is not stable.
The most likely cause is flaky network or the system could be under powered, so under load (client io, recovery, scrub) it could be the system slowed to the point where it could not connect to the cluster. Observe your %disk busy,cpu, ram during/after the initial failure and see if those resources were maxed out. Do you have enough ram ?
Last edited on June 14, 2018, 4:02 pm by admin · #11
Ste
125 Posts
June 15, 2018, 10:18 amQuote from Ste on June 15, 2018, 10:18 amOk, it worked, thanks.
I agree that the system could be under powered: two nodes have 4MB ram and one node has only 100 Mbit ethernet ports. In fact, quite often during recovery I get emails from the cluster telling some OSD or one node are down, while actually they are still working. Anyway in this test cluster this is not a concern.
Bye, S.
Ok, it worked, thanks.
I agree that the system could be under powered: two nodes have 4MB ram and one node has only 100 Mbit ethernet ports. In fact, quite often during recovery I get emails from the cluster telling some OSD or one node are down, while actually they are still working. Anyway in this test cluster this is not a concern.
Bye, S.
Pages: 1 2
From the admin web application, add & remove OSD's?
admin
2,930 Posts
Quote from admin on June 14, 2018, 4:00 pmIt does seem /etc/hosts got corrupt due to consul connection failure while joining the node, this is the reason the physical disk list does not open. To manually fix hosts file:
# stop auto sync service
systemctl stop petasan-file-sync
# manual fix hosts file
nano /etc/hosts
# sync the hosts file to all nodes:
/opt/petasan/scripts/util/sync_file.py /etc/hosts
# restart the sync service on current node
systemctl start petasan-file-syncThe root cause needs to be fixed, i would keep an eye on the log file and see if you do still see the consul connection errors. If you still see the consul connection errors in the logs, the system is not stable.
The most likely cause is flaky network or the system could be under powered, so under load (client io, recovery, scrub) it could be the system slowed to the point where it could not connect to the cluster. Observe your %disk busy,cpu, ram during/after the initial failure and see if those resources were maxed out. Do you have enough ram ?
It does seem /etc/hosts got corrupt due to consul connection failure while joining the node, this is the reason the physical disk list does not open. To manually fix hosts file:
# stop auto sync service
systemctl stop petasan-file-sync
# manual fix hosts file
nano /etc/hosts
# sync the hosts file to all nodes:
/opt/petasan/scripts/util/sync_file.py /etc/hosts
# restart the sync service on current node
systemctl start petasan-file-sync
The root cause needs to be fixed, i would keep an eye on the log file and see if you do still see the consul connection errors. If you still see the consul connection errors in the logs, the system is not stable.
The most likely cause is flaky network or the system could be under powered, so under load (client io, recovery, scrub) it could be the system slowed to the point where it could not connect to the cluster. Observe your %disk busy,cpu, ram during/after the initial failure and see if those resources were maxed out. Do you have enough ram ?
Ste
125 Posts
Quote from Ste on June 15, 2018, 10:18 amOk, it worked, thanks.
I agree that the system could be under powered: two nodes have 4MB ram and one node has only 100 Mbit ethernet ports. In fact, quite often during recovery I get emails from the cluster telling some OSD or one node are down, while actually they are still working. Anyway in this test cluster this is not a concern.
Bye, S.
Ok, it worked, thanks.
I agree that the system could be under powered: two nodes have 4MB ram and one node has only 100 Mbit ethernet ports. In fact, quite often during recovery I get emails from the cluster telling some OSD or one node are down, while actually they are still working. Anyway in this test cluster this is not a concern.
Bye, S.