Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

504 Gateway Time-out on every node, 100% disk usage, I believe on a journal drive

Pages: 1 2

Hi,

I am having a big problem. My 3 nodes are innaccesible, with 504 errors in the browser, and "Windows cannot access" over SMB.

I can SSH into each node. When I do

systemctl status ceph-mgr@CEPH01

I get

ceph-mgr@CEPH01.service: Failed with result 'exit-code'

one one node, and

Loaded: loaded (/lib/systemd/system/ceph-mgr@.service; disabled; vendor preset: enabled)
Active: inactive (dead)

on the others.

I tried /opt/petasan/scripts/online-updates/update.sh

which worked on 2 nodes, but on the 3rd I got a disk space error.

I have 100% disk usage, I believe on a journal drive:

Filesystem     1K-blocks     Used Available Use% Mounted on
udev            65721144        0  65721144   0% /dev
tmpfs           13150028   133264  13016764   2% /run
/dev/sda3       15416264 15399880         0 100% /
tmpfs           65750132        0  65750132   0% /dev/shm
tmpfs               5120        0      5120   0% /run/lock
tmpfs           65750132        0  65750132   0% /sys/fs/cgroup
/dev/sda4       30832548   174804  30641360   1% /var/lib/ceph
/dev/sda5      183126020  1136504 181973132   1% /opt/petasan/config
/dev/sda2         129039      260    128780   1% /boot/efi
tmpfs           65750132       24  65750108   1% /var/lib/ceph/osd/ceph-0
tmpfs           65750132       24  65750108   1% /var/lib/ceph/osd/ceph-1
tmpfs           65750132       24  65750108   1% /var/lib/ceph/osd/ceph-8

Is the disk space problem on one node related to the web UI problems on all the nodes? How can i regain access to the data?

Many thanks!

look and what is taking space and clean it, it could be log files:

du -hd 1  /var/log
 

Yes, the disk space problem was log files, and by cleaning some out I have disk space free.

I can run /opt/petasan/scripts/online-updates/update.sh on very node.

systemctl status ceph-mgr@CEPH01

shows

Loaded: loaded (/lib/systemd/system/ceph-mgr@.service; disabled; vendor preset: enabled)
Active: inactive (dead)

on every node.

 

However, I still cannot log into the web UI on any node, as I get a 504 error, and the SMB share is also still not working.

Please can you suggest something else I can try?

Many thanks!

Bump 😊😊

what is the status of ceph

ceph status

status of cifs

ctdb status

do you seen any errors in /opt/petasan/log/PetaSAN.log ?

root@san01:~# ceph status
2023-03-01T17:09:02.922+0000 7f87455c3700  0 monclient(hunting): authenticate timed out after 300

 

 

root@san03:~# ctdb status
connect() failed, errno=2
Failed to connect to CTDB daemon (/var/run/ctdb/ctdbd.socket)
Failed to read nodes file "/etc/ctdb/nodes"
Is this node part of CTDB cluster?

 

 

And in the log, there is repeatedly the error:

CIFS init shared filesystem not mounted

Many thanks!

Ceph, Samba, Gluster are down. Typically when unrelated systems are down it indicates something in your environment like network configuration or hardware.

Ah I see.

I have tested networking by ensuring every machine can ping both interfaces on every other machine, and all seems well there.

I have checked the SMART status of every drive, and all self-assesement tests pass.

Are there commands to manually start Ceph, Samba or Gluster?

If I am forced to reinstall petasan, will I be able to retrieve the data?

Many thanks!

Do not re-install since you need your data.

i would focus on starting Ceph first, probably if you fix the cause other services will also start. Look at why you cannot talk to the Ceph monitors, either the monitors do mot start or they cannot communicate with one another. Try to get the monitors to start and communicate, look at the monitor logs in /vat/log/ceph.

Also are you sure nodes can ping each other on the backend interface ?

You can contact us for support if you need. Good luck.

yes, my backend is 10.0.0.1-3, and all nodes can ping all addresses

ceph status gives me:

monclient(hunting): authenticate timed out after 300

How close to the clocks need to be? I notice one of my nodes is ~13 seconds behind.

/var/log/ceph/ceph-mon.hostname.log only has entries relating to the cluster creation in early January.

Pages: 1 2