504 Gateway Time-out on every node, 100% disk usage, I believe on a journal drive
Pages: 1 2
moose999
9 Posts
February 23, 2023, 5:21 pmQuote from moose999 on February 23, 2023, 5:21 pmHi,
I am having a big problem. My 3 nodes are innaccesible, with 504 errors in the browser, and "Windows cannot access" over SMB.
I can SSH into each node. When I do
systemctl status ceph-mgr@CEPH01
I get
ceph-mgr@CEPH01.service: Failed with result 'exit-code'
one one node, and
Loaded: loaded (/lib/systemd/system/ceph-mgr@.service; disabled; vendor preset: enabled)
Active: inactive (dead)
on the others.
I tried /opt/petasan/scripts/online-updates/update.sh
which worked on 2 nodes, but on the 3rd I got a disk space error.
I have 100% disk usage, I believe on a journal drive:
Filesystem 1K-blocks Used Available Use% Mounted on
udev 65721144 0 65721144 0% /dev
tmpfs 13150028 133264 13016764 2% /run
/dev/sda3 15416264 15399880 0 100% /
tmpfs 65750132 0 65750132 0% /dev/shm
tmpfs 5120 0 5120 0% /run/lock
tmpfs 65750132 0 65750132 0% /sys/fs/cgroup
/dev/sda4 30832548 174804 30641360 1% /var/lib/ceph
/dev/sda5 183126020 1136504 181973132 1% /opt/petasan/config
/dev/sda2 129039 260 128780 1% /boot/efi
tmpfs 65750132 24 65750108 1% /var/lib/ceph/osd/ceph-0
tmpfs 65750132 24 65750108 1% /var/lib/ceph/osd/ceph-1
tmpfs 65750132 24 65750108 1% /var/lib/ceph/osd/ceph-8
Is the disk space problem on one node related to the web UI problems on all the nodes? How can i regain access to the data?
Many thanks!
Hi,
I am having a big problem. My 3 nodes are innaccesible, with 504 errors in the browser, and "Windows cannot access" over SMB.
I can SSH into each node. When I do
systemctl status ceph-mgr@CEPH01
I get
ceph-mgr@CEPH01.service: Failed with result 'exit-code'
one one node, and
Loaded: loaded (/lib/systemd/system/ceph-mgr@.service; disabled; vendor preset: enabled)
Active: inactive (dead)
on the others.
I tried /opt/petasan/scripts/online-updates/update.sh
which worked on 2 nodes, but on the 3rd I got a disk space error.
I have 100% disk usage, I believe on a journal drive:
Filesystem 1K-blocks Used Available Use% Mounted on
udev 65721144 0 65721144 0% /dev
tmpfs 13150028 133264 13016764 2% /run
/dev/sda3 15416264 15399880 0 100% /
tmpfs 65750132 0 65750132 0% /dev/shm
tmpfs 5120 0 5120 0% /run/lock
tmpfs 65750132 0 65750132 0% /sys/fs/cgroup
/dev/sda4 30832548 174804 30641360 1% /var/lib/ceph
/dev/sda5 183126020 1136504 181973132 1% /opt/petasan/config
/dev/sda2 129039 260 128780 1% /boot/efi
tmpfs 65750132 24 65750108 1% /var/lib/ceph/osd/ceph-0
tmpfs 65750132 24 65750108 1% /var/lib/ceph/osd/ceph-1
tmpfs 65750132 24 65750108 1% /var/lib/ceph/osd/ceph-8
Is the disk space problem on one node related to the web UI problems on all the nodes? How can i regain access to the data?
Many thanks!
admin
2,930 Posts
February 23, 2023, 5:41 pmQuote from admin on February 23, 2023, 5:41 pmlook and what is taking space and clean it, it could be log files:
du -hd 1 /var/log
look and what is taking space and clean it, it could be log files:
du -hd 1 /var/log
moose999
9 Posts
February 24, 2023, 5:26 pmQuote from moose999 on February 24, 2023, 5:26 pmYes, the disk space problem was log files, and by cleaning some out I have disk space free.
I can run /opt/petasan/scripts/online-updates/update.sh on very node.
systemctl status ceph-mgr@CEPH01
shows
Loaded: loaded (/lib/systemd/system/ceph-mgr@.service; disabled; vendor preset: enabled)
Active: inactive (dead)
on every node.
However, I still cannot log into the web UI on any node, as I get a 504 error, and the SMB share is also still not working.
Please can you suggest something else I can try?
Many thanks!
Yes, the disk space problem was log files, and by cleaning some out I have disk space free.
I can run /opt/petasan/scripts/online-updates/update.sh on very node.
systemctl status ceph-mgr@CEPH01
shows
Loaded: loaded (/lib/systemd/system/ceph-mgr@.service; disabled; vendor preset: enabled)
Active: inactive (dead)
on every node.
However, I still cannot log into the web UI on any node, as I get a 504 error, and the SMB share is also still not working.
Please can you suggest something else I can try?
Many thanks!
Last edited on February 24, 2023, 5:27 pm by moose999 · #3
moose999
9 Posts
admin
2,930 Posts
February 28, 2023, 8:49 pmQuote from admin on February 28, 2023, 8:49 pmwhat is the status of ceph
ceph status
status of cifs
ctdb status
do you seen any errors in /opt/petasan/log/PetaSAN.log ?
what is the status of ceph
ceph status
status of cifs
ctdb status
do you seen any errors in /opt/petasan/log/PetaSAN.log ?
moose999
9 Posts
March 1, 2023, 5:13 pmQuote from moose999 on March 1, 2023, 5:13 pmroot@san01:~# ceph status
2023-03-01T17:09:02.922+0000 7f87455c3700 0 monclient(hunting): authenticate timed out after 300
root@san03:~# ctdb status
connect() failed, errno=2
Failed to connect to CTDB daemon (/var/run/ctdb/ctdbd.socket)
Failed to read nodes file "/etc/ctdb/nodes"
Is this node part of CTDB cluster?
And in the log, there is repeatedly the error:
CIFS init shared filesystem not mounted
Many thanks!
root@san01:~# ceph status
2023-03-01T17:09:02.922+0000 7f87455c3700 0 monclient(hunting): authenticate timed out after 300
root@san03:~# ctdb status
connect() failed, errno=2
Failed to connect to CTDB daemon (/var/run/ctdb/ctdbd.socket)
Failed to read nodes file "/etc/ctdb/nodes"
Is this node part of CTDB cluster?
And in the log, there is repeatedly the error:
CIFS init shared filesystem not mounted
Many thanks!
Last edited on March 1, 2023, 5:14 pm by moose999 · #6
admin
2,930 Posts
March 1, 2023, 10:30 pmQuote from admin on March 1, 2023, 10:30 pmCeph, Samba, Gluster are down. Typically when unrelated systems are down it indicates something in your environment like network configuration or hardware.
Ceph, Samba, Gluster are down. Typically when unrelated systems are down it indicates something in your environment like network configuration or hardware.
moose999
9 Posts
March 2, 2023, 2:09 pmQuote from moose999 on March 2, 2023, 2:09 pmAh I see.
I have tested networking by ensuring every machine can ping both interfaces on every other machine, and all seems well there.
I have checked the SMART status of every drive, and all self-assesement tests pass.
Are there commands to manually start Ceph, Samba or Gluster?
If I am forced to reinstall petasan, will I be able to retrieve the data?
Many thanks!
Ah I see.
I have tested networking by ensuring every machine can ping both interfaces on every other machine, and all seems well there.
I have checked the SMART status of every drive, and all self-assesement tests pass.
Are there commands to manually start Ceph, Samba or Gluster?
If I am forced to reinstall petasan, will I be able to retrieve the data?
Many thanks!
admin
2,930 Posts
March 2, 2023, 5:25 pmQuote from admin on March 2, 2023, 5:25 pmDo not re-install since you need your data.
i would focus on starting Ceph first, probably if you fix the cause other services will also start. Look at why you cannot talk to the Ceph monitors, either the monitors do mot start or they cannot communicate with one another. Try to get the monitors to start and communicate, look at the monitor logs in /vat/log/ceph.
Also are you sure nodes can ping each other on the backend interface ?
You can contact us for support if you need. Good luck.
Do not re-install since you need your data.
i would focus on starting Ceph first, probably if you fix the cause other services will also start. Look at why you cannot talk to the Ceph monitors, either the monitors do mot start or they cannot communicate with one another. Try to get the monitors to start and communicate, look at the monitor logs in /vat/log/ceph.
Also are you sure nodes can ping each other on the backend interface ?
You can contact us for support if you need. Good luck.
moose999
9 Posts
March 2, 2023, 8:03 pmQuote from moose999 on March 2, 2023, 8:03 pmyes, my backend is 10.0.0.1-3, and all nodes can ping all addresses
ceph status gives me:
monclient(hunting): authenticate timed out after 300
How close to the clocks need to be? I notice one of my nodes is ~13 seconds behind.
/var/log/ceph/ceph-mon.hostname.log only has entries relating to the cluster creation in early January.
yes, my backend is 10.0.0.1-3, and all nodes can ping all addresses
ceph status gives me:
monclient(hunting): authenticate timed out after 300
How close to the clocks need to be? I notice one of my nodes is ~13 seconds behind.
/var/log/ceph/ceph-mon.hostname.log only has entries relating to the cluster creation in early January.
Pages: 1 2
504 Gateway Time-out on every node, 100% disk usage, I believe on a journal drive
moose999
9 Posts
Quote from moose999 on February 23, 2023, 5:21 pmHi,
I am having a big problem. My 3 nodes are innaccesible, with 504 errors in the browser, and "Windows cannot access" over SMB.
I can SSH into each node. When I do
systemctl status ceph-mgr@CEPH01
I get
ceph-mgr@CEPH01.service: Failed with result 'exit-code'
one one node, and
Loaded: loaded (/lib/systemd/system/ceph-mgr@.service; disabled; vendor preset: enabled)
Active: inactive (dead)on the others.
I tried /opt/petasan/scripts/online-updates/update.sh
which worked on 2 nodes, but on the 3rd I got a disk space error.
I have 100% disk usage, I believe on a journal drive:
Filesystem 1K-blocks Used Available Use% Mounted on
udev 65721144 0 65721144 0% /dev
tmpfs 13150028 133264 13016764 2% /run
/dev/sda3 15416264 15399880 0 100% /
tmpfs 65750132 0 65750132 0% /dev/shm
tmpfs 5120 0 5120 0% /run/lock
tmpfs 65750132 0 65750132 0% /sys/fs/cgroup
/dev/sda4 30832548 174804 30641360 1% /var/lib/ceph
/dev/sda5 183126020 1136504 181973132 1% /opt/petasan/config
/dev/sda2 129039 260 128780 1% /boot/efi
tmpfs 65750132 24 65750108 1% /var/lib/ceph/osd/ceph-0
tmpfs 65750132 24 65750108 1% /var/lib/ceph/osd/ceph-1
tmpfs 65750132 24 65750108 1% /var/lib/ceph/osd/ceph-8Is the disk space problem on one node related to the web UI problems on all the nodes? How can i regain access to the data?
Many thanks!
Hi,
I am having a big problem. My 3 nodes are innaccesible, with 504 errors in the browser, and "Windows cannot access" over SMB.
I can SSH into each node. When I do
systemctl status ceph-mgr@CEPH01
I get
ceph-mgr@CEPH01.service: Failed with result 'exit-code'
one one node, and
Loaded: loaded (/lib/systemd/system/ceph-mgr@.service; disabled; vendor preset: enabled)
Active: inactive (dead)
on the others.
I tried /opt/petasan/scripts/online-updates/update.sh
which worked on 2 nodes, but on the 3rd I got a disk space error.
I have 100% disk usage, I believe on a journal drive:
Filesystem 1K-blocks Used Available Use% Mounted on
udev 65721144 0 65721144 0% /dev
tmpfs 13150028 133264 13016764 2% /run
/dev/sda3 15416264 15399880 0 100% /
tmpfs 65750132 0 65750132 0% /dev/shm
tmpfs 5120 0 5120 0% /run/lock
tmpfs 65750132 0 65750132 0% /sys/fs/cgroup
/dev/sda4 30832548 174804 30641360 1% /var/lib/ceph
/dev/sda5 183126020 1136504 181973132 1% /opt/petasan/config
/dev/sda2 129039 260 128780 1% /boot/efi
tmpfs 65750132 24 65750108 1% /var/lib/ceph/osd/ceph-0
tmpfs 65750132 24 65750108 1% /var/lib/ceph/osd/ceph-1
tmpfs 65750132 24 65750108 1% /var/lib/ceph/osd/ceph-8
Is the disk space problem on one node related to the web UI problems on all the nodes? How can i regain access to the data?
Many thanks!
admin
2,930 Posts
Quote from admin on February 23, 2023, 5:41 pmlook and what is taking space and clean it, it could be log files:
du -hd 1 /var/log
look and what is taking space and clean it, it could be log files:
du -hd 1 /var/log
moose999
9 Posts
Quote from moose999 on February 24, 2023, 5:26 pmYes, the disk space problem was log files, and by cleaning some out I have disk space free.
I can run /opt/petasan/scripts/online-updates/update.sh on very node.
systemctl status ceph-mgr@CEPH01
shows
Loaded: loaded (/lib/systemd/system/ceph-mgr@.service; disabled; vendor preset: enabled)
Active: inactive (dead)on every node.
However, I still cannot log into the web UI on any node, as I get a 504 error, and the SMB share is also still not working.
Please can you suggest something else I can try?
Many thanks!
Yes, the disk space problem was log files, and by cleaning some out I have disk space free.
I can run /opt/petasan/scripts/online-updates/update.sh on very node.
systemctl status ceph-mgr@CEPH01
shows
Loaded: loaded (/lib/systemd/system/ceph-mgr@.service; disabled; vendor preset: enabled)
Active: inactive (dead)
on every node.
However, I still cannot log into the web UI on any node, as I get a 504 error, and the SMB share is also still not working.
Please can you suggest something else I can try?
Many thanks!
moose999
9 Posts
admin
2,930 Posts
Quote from admin on February 28, 2023, 8:49 pmwhat is the status of ceph
ceph status
status of cifs
ctdb status
do you seen any errors in /opt/petasan/log/PetaSAN.log ?
what is the status of ceph
ceph status
status of cifs
ctdb status
do you seen any errors in /opt/petasan/log/PetaSAN.log ?
moose999
9 Posts
Quote from moose999 on March 1, 2023, 5:13 pmroot@san01:~# ceph status
2023-03-01T17:09:02.922+0000 7f87455c3700 0 monclient(hunting): authenticate timed out after 300
root@san03:~# ctdb status
connect() failed, errno=2
Failed to connect to CTDB daemon (/var/run/ctdb/ctdbd.socket)
Failed to read nodes file "/etc/ctdb/nodes"
Is this node part of CTDB cluster?
And in the log, there is repeatedly the error:
CIFS init shared filesystem not mounted
Many thanks!
root@san01:~# ceph status
2023-03-01T17:09:02.922+0000 7f87455c3700 0 monclient(hunting): authenticate timed out after 300
root@san03:~# ctdb status
connect() failed, errno=2
Failed to connect to CTDB daemon (/var/run/ctdb/ctdbd.socket)
Failed to read nodes file "/etc/ctdb/nodes"
Is this node part of CTDB cluster?
And in the log, there is repeatedly the error:
CIFS init shared filesystem not mounted
Many thanks!
admin
2,930 Posts
Quote from admin on March 1, 2023, 10:30 pmCeph, Samba, Gluster are down. Typically when unrelated systems are down it indicates something in your environment like network configuration or hardware.
Ceph, Samba, Gluster are down. Typically when unrelated systems are down it indicates something in your environment like network configuration or hardware.
moose999
9 Posts
Quote from moose999 on March 2, 2023, 2:09 pmAh I see.
I have tested networking by ensuring every machine can ping both interfaces on every other machine, and all seems well there.
I have checked the SMART status of every drive, and all self-assesement tests pass.
Are there commands to manually start Ceph, Samba or Gluster?
If I am forced to reinstall petasan, will I be able to retrieve the data?
Many thanks!
Ah I see.
I have tested networking by ensuring every machine can ping both interfaces on every other machine, and all seems well there.
I have checked the SMART status of every drive, and all self-assesement tests pass.
Are there commands to manually start Ceph, Samba or Gluster?
If I am forced to reinstall petasan, will I be able to retrieve the data?
Many thanks!
admin
2,930 Posts
Quote from admin on March 2, 2023, 5:25 pmDo not re-install since you need your data.
i would focus on starting Ceph first, probably if you fix the cause other services will also start. Look at why you cannot talk to the Ceph monitors, either the monitors do mot start or they cannot communicate with one another. Try to get the monitors to start and communicate, look at the monitor logs in /vat/log/ceph.
Also are you sure nodes can ping each other on the backend interface ?
You can contact us for support if you need. Good luck.
Do not re-install since you need your data.
i would focus on starting Ceph first, probably if you fix the cause other services will also start. Look at why you cannot talk to the Ceph monitors, either the monitors do mot start or they cannot communicate with one another. Try to get the monitors to start and communicate, look at the monitor logs in /vat/log/ceph.
Also are you sure nodes can ping each other on the backend interface ?
You can contact us for support if you need. Good luck.
moose999
9 Posts
Quote from moose999 on March 2, 2023, 8:03 pmyes, my backend is 10.0.0.1-3, and all nodes can ping all addresses
ceph status gives me:
monclient(hunting): authenticate timed out after 300
How close to the clocks need to be? I notice one of my nodes is ~13 seconds behind.
/var/log/ceph/ceph-mon.hostname.log only has entries relating to the cluster creation in early January.
yes, my backend is 10.0.0.1-3, and all nodes can ping all addresses
ceph status gives me:
monclient(hunting): authenticate timed out after 300
How close to the clocks need to be? I notice one of my nodes is ~13 seconds behind.
/var/log/ceph/ceph-mon.hostname.log only has entries relating to the cluster creation in early January.