Forums - PetaSAN

ForumGeneral Discussion504 Gateway Time-out on every nod …
You need to log in to create posts and topics. Login · Register
504 Gateway Time-out on every node, 100% disk usage, I believe on a journal drive

Pages: 1 2

moose999
9 Posts

February 23, 2023, 5:21 pm
Quote from moose999 on February 23, 2023, 5:21 pm
Hi,

I am having a big problem. My 3 nodes are innaccesible, with 504 errors in the browser, and "Windows cannot access" over SMB.

I can SSH into each node. When I do

systemctl status ceph-mgr@CEPH01

I get

ceph-mgr@CEPH01.service: Failed with result 'exit-code'

one one node, and

Loaded: loaded (/lib/systemd/system/ceph-mgr@.service; disabled; vendor preset: enabled)
Active: inactive (dead)

on the others.

I tried /opt/petasan/scripts/online-updates/update.sh

which worked on 2 nodes, but on the 3rd I got a disk space error.

I have 100% disk usage, I believe on a journal drive:

Filesystem     1K-blocks     Used Available Use% Mounted on
udev            65721144        0 65721144   0% /dev
tmpfs           13150028   133264 13016764   2% /run
/dev/sda3       15416264 15399880         0 100% /
tmpfs           65750132        0 65750132   0% /dev/shm
tmpfs               5120        0      5120   0% /run/lock
tmpfs           65750132        0 65750132   0% /sys/fs/cgroup
/dev/sda4       30832548   174804 30641360   1% /var/lib/ceph
/dev/sda5      183126020 1136504 181973132   1% /opt/petasan/config
/dev/sda2         129039      260    128780   1% /boot/efi
tmpfs           65750132       24 65750108   1% /var/lib/ceph/osd/ceph-0
tmpfs           65750132       24 65750108   1% /var/lib/ceph/osd/ceph-1
tmpfs           65750132       24 65750108   1% /var/lib/ceph/osd/ceph-8

Is the disk space problem on one node related to the web UI problems on all the nodes? How can i regain access to the data?

Many thanks!

Hi,

I am having a big problem. My 3 nodes are innaccesible, with 504 errors in the browser, and "Windows cannot access" over SMB.

I can SSH into each node. When I do

systemctl status ceph-mgr@CEPH01

I get

ceph-mgr@CEPH01.service: Failed with result 'exit-code'

one one node, and

Loaded: loaded (/lib/systemd/system/ceph-mgr@.service; disabled; vendor preset: enabled)
Active: inactive (dead)

on the others.

I tried /opt/petasan/scripts/online-updates/update.sh

which worked on 2 nodes, but on the 3rd I got a disk space error.

I have 100% disk usage, I believe on a journal drive:

Filesystem     1K-blocks     Used Available Use% Mounted on
udev            65721144        0 65721144   0% /dev
tmpfs           13150028   133264 13016764   2% /run
/dev/sda3       15416264 15399880         0 100% /
tmpfs           65750132        0 65750132   0% /dev/shm
tmpfs               5120        0      5120   0% /run/lock
tmpfs           65750132        0 65750132   0% /sys/fs/cgroup
/dev/sda4       30832548   174804 30641360   1% /var/lib/ceph
/dev/sda5      183126020 1136504 181973132   1% /opt/petasan/config
/dev/sda2         129039      260    128780   1% /boot/efi
tmpfs           65750132       24 65750108   1% /var/lib/ceph/osd/ceph-0
tmpfs           65750132       24 65750108   1% /var/lib/ceph/osd/ceph-1
tmpfs           65750132       24 65750108   1% /var/lib/ceph/osd/ceph-8

Is the disk space problem on one node related to the web UI problems on all the nodes? How can i regain access to the data?

Many thanks!

#1

admin
2,959 Posts

February 23, 2023, 5:41 pm
Quote from admin on February 23, 2023, 5:41 pm
look and what is taking space and clean it, it could be log files:

du -hd 1 /var/log

look and what is taking space and clean it, it could be log files:

du -hd 1 /var/log

#2

moose999
9 Posts

February 24, 2023, 5:26 pm
Quote from moose999 on February 24, 2023, 5:26 pm
Yes, the disk space problem was log files, and by cleaning some out I have disk space free.

I can run /opt/petasan/scripts/online-updates/update.sh on very node.

systemctl status ceph-mgr@CEPH01

shows

Loaded: loaded (/lib/systemd/system/ceph-mgr@.service; disabled; vendor preset: enabled)
Active: inactive (dead)

on every node.

However, I still cannot log into the web UI on any node, as I get a 504 error, and the SMB share is also still not working.

Please can you suggest something else I can try?

Many thanks!

Yes, the disk space problem was log files, and by cleaning some out I have disk space free.

I can run /opt/petasan/scripts/online-updates/update.sh on very node.

systemctl status ceph-mgr@CEPH01

shows

Loaded: loaded (/lib/systemd/system/ceph-mgr@.service; disabled; vendor preset: enabled)
Active: inactive (dead)

on every node.

However, I still cannot log into the web UI on any node, as I get a 504 error, and the SMB share is also still not working.

Please can you suggest something else I can try?

Many thanks!

Last edited on February 24, 2023, 5:27 pm by moose999 · #3

moose999
9 Posts

February 28, 2023, 11:18 am
Quote from moose999 on February 28, 2023, 11:18 am
Bump 😊😊

Bump 😊😊

#4

admin
2,959 Posts

February 28, 2023, 8:49 pm
Quote from admin on February 28, 2023, 8:49 pm
what is the status of ceph

ceph status

status of cifs

ctdb status

do you seen any errors in /opt/petasan/log/PetaSAN.log ?

what is the status of ceph

ceph status

status of cifs

ctdb status

do you seen any errors in /opt/petasan/log/PetaSAN.log ?

#5

moose999
9 Posts

March 1, 2023, 5:13 pm
Quote from moose999 on March 1, 2023, 5:13 pm
root@san01:~# ceph status
2023-03-01T17:09:02.922+0000 7f87455c3700 0 monclient(hunting): authenticate timed out after 300

root@san03:~# ctdb status
connect() failed, errno=2
Failed to connect to CTDB daemon (/var/run/ctdb/ctdbd.socket)
Failed to read nodes file "/etc/ctdb/nodes"
Is this node part of CTDB cluster?

And in the log, there is repeatedly the error:

CIFS init shared filesystem not mounted

Many thanks!

root@san01:~# ceph status
2023-03-01T17:09:02.922+0000 7f87455c3700 0 monclient(hunting): authenticate timed out after 300

root@san03:~# ctdb status
connect() failed, errno=2
Failed to connect to CTDB daemon (/var/run/ctdb/ctdbd.socket)
Failed to read nodes file "/etc/ctdb/nodes"
Is this node part of CTDB cluster?

And in the log, there is repeatedly the error:

CIFS init shared filesystem not mounted

Many thanks!

Last edited on March 1, 2023, 5:14 pm by moose999 · #6

admin
2,959 Posts

March 1, 2023, 10:30 pm
Quote from admin on March 1, 2023, 10:30 pm
Ceph, Samba, Gluster are down. Typically when unrelated systems are down it indicates something in your environment like network configuration or hardware.

Ceph, Samba, Gluster are down. Typically when unrelated systems are down it indicates something in your environment like network configuration or hardware.

#7

moose999
9 Posts

March 2, 2023, 2:09 pm
Quote from moose999 on March 2, 2023, 2:09 pm
Ah I see.

I have tested networking by ensuring every machine can ping both interfaces on every other machine, and all seems well there.

I have checked the SMART status of every drive, and all self-assesement tests pass.

Are there commands to manually start Ceph, Samba or Gluster?

If I am forced to reinstall petasan, will I be able to retrieve the data?

Many thanks!

Ah I see.

I have tested networking by ensuring every machine can ping both interfaces on every other machine, and all seems well there.

I have checked the SMART status of every drive, and all self-assesement tests pass.

Are there commands to manually start Ceph, Samba or Gluster?

If I am forced to reinstall petasan, will I be able to retrieve the data?

Many thanks!

#8

admin
2,959 Posts

March 2, 2023, 5:25 pm
Quote from admin on March 2, 2023, 5:25 pm
Do not re-install since you need your data.

i would focus on starting Ceph first, probably if you fix the cause other services will also start. Look at why you cannot talk to the Ceph monitors, either the monitors do mot start or they cannot communicate with one another. Try to get the monitors to start and communicate, look at the monitor logs in /vat/log/ceph.

Also are you sure nodes can ping each other on the backend interface ?

You can contact us for support if you need. Good luck.

Do not re-install since you need your data.

i would focus on starting Ceph first, probably if you fix the cause other services will also start. Look at why you cannot talk to the Ceph monitors, either the monitors do mot start or they cannot communicate with one another. Try to get the monitors to start and communicate, look at the monitor logs in /vat/log/ceph.

Also are you sure nodes can ping each other on the backend interface ?

You can contact us for support if you need. Good luck.

#9

moose999
9 Posts

March 2, 2023, 8:03 pm
Quote from moose999 on March 2, 2023, 8:03 pm
yes, my backend is 10.0.0.1-3, and all nodes can ping all addresses

ceph status gives me:

monclient(hunting): authenticate timed out after 300

How close to the clocks need to be? I notice one of my nodes is ~13 seconds behind.

/var/log/ceph/ceph-mon.hostname.log only has entries relating to the cluster creation in early January.

yes, my backend is 10.0.0.1-3, and all nodes can ping all addresses

ceph status gives me:

monclient(hunting): authenticate timed out after 300

How close to the clocks need to be? I notice one of my nodes is ~13 seconds behind.

/var/log/ceph/ceph-mon.hostname.log only has entries relating to the cluster creation in early January.

#10

Post Reply: 504 Gateway Time-out on every node, 100% disk usage, I believe on a journal drive

Cancel

Pages: 1 2