Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

CEPH cluster down

Hello,
We had to move our ceph cluster physically.  I followed all procedures (I thought) and shut them down at the same time cleanly.   Upon starting them up, I am now seeing:

HEALTH_WARN 1 filesystem is degraded; 1 MDSs report slow metadata IOs; 23 osds down; Reduced data availability: 1082 pgs inactive, 341 pgs down; Degraded data redundancy: 5202276/19234662 objects degraded (27.046%), 1223 pgs degraded, 2262 pgs undersized; 6 slow ops, oldest one blocked for 426 sec, mon.nc-san3 has slow ops

What should I do to recover from this?   We shut down the 5x SANs all at once, moved to a new facility, and started them back up.   It does not seem to be getting better on its own.
I tried increasing the recovery speed options from the web interface.

I am running it on Ubuntu 18 and using PetaSAN 2.8.1 still.

It keeps showing various amounts of osds down, pgs down, etc but it is not seeming to recover on its own

It gets as low as 4 osds down and then starts climbing again.   The pgs down/degraded/etc keep changing but never go in a fully positive direction.   Any tips?

I was able to resolve this.. seems there was a race condition created and I was able to get it to recover simply by restarting the ceph-osd daemons

This is a good example of the beauty of PetaSAN, and CEPH. We have been using CEPH since 2017 and have had server crashing, power failure, disk failure and switch failures, and never had CEPH let me down. I all situations, CEPH started complaining and eventually it fixed itself. Sure, I had to reboot a node sometimes, and that’s it.