CEPH cluster down
evankmd82
11 Posts
August 19, 2024, 4:38 amQuote from evankmd82 on August 19, 2024, 4:38 am
I tried increasing the recovery speed options from the web interface.
I am running it on Ubuntu 18 and using PetaSAN 2.8.1 still.
It keeps showing various amounts of osds down, pgs down, etc but it is not seeming to recover on its own
Hello,
We had to move our ceph cluster physically. I followed all procedures (I thought) and shut them down at the same time cleanly. Upon starting them up, I am now seeing:
HEALTH_WARN 1 filesystem is degraded; 1 MDSs report slow metadata IOs; 23 osds down; Reduced data availability: 1082 pgs inactive, 341 pgs down; Degraded data redundancy: 5202276/19234662 objects degraded (27.046%), 1223 pgs degraded, 2262 pgs undersized; 6 slow ops, oldest one blocked for 426 sec, mon.nc-san3 has slow ops
What should I do to recover from this? We shut down the 5x SANs all at once, moved to a new facility, and started them back up. It does not seem to be getting better on its own.
I tried increasing the recovery speed options from the web interface.
I am running it on Ubuntu 18 and using PetaSAN 2.8.1 still.
It keeps showing various amounts of osds down, pgs down, etc but it is not seeming to recover on its own
evankmd82
11 Posts
August 19, 2024, 5:06 amQuote from evankmd82 on August 19, 2024, 5:06 amIt gets as low as 4 osds down and then starts climbing again. The pgs down/degraded/etc keep changing but never go in a fully positive direction. Any tips?
It gets as low as 4 osds down and then starts climbing again. The pgs down/degraded/etc keep changing but never go in a fully positive direction. Any tips?
evankmd82
11 Posts
August 19, 2024, 6:15 amQuote from evankmd82 on August 19, 2024, 6:15 amI was able to resolve this.. seems there was a race condition created and I was able to get it to recover simply by restarting the ceph-osd daemons
I was able to resolve this.. seems there was a race condition created and I was able to get it to recover simply by restarting the ceph-osd daemons
X1M
5 Posts
November 9, 2024, 6:24 amQuote from X1M on November 9, 2024, 6:24 amThis is a good example of the beauty of PetaSAN, and CEPH. We have been using CEPH since 2017 and have had server crashing, power failure, disk failure and switch failures, and never had CEPH let me down. I all situations, CEPH started complaining and eventually it fixed itself. Sure, I had to reboot a node sometimes, and that’s it.
This is a good example of the beauty of PetaSAN, and CEPH. We have been using CEPH since 2017 and have had server crashing, power failure, disk failure and switch failures, and never had CEPH let me down. I all situations, CEPH started complaining and eventually it fixed itself. Sure, I had to reboot a node sometimes, and that’s it.
CEPH cluster down
evankmd82
11 Posts
Quote from evankmd82 on August 19, 2024, 4:38 am I tried increasing the recovery speed options from the web interface.I am running it on Ubuntu 18 and using PetaSAN 2.8.1 still.
It keeps showing various amounts of osds down, pgs down, etc but it is not seeming to recover on its own
HEALTH_WARN 1 filesystem is degraded; 1 MDSs report slow metadata IOs; 23 osds down; Reduced data availability: 1082 pgs inactive, 341 pgs down; Degraded data redundancy: 5202276/19234662 objects degraded (27.046%), 1223 pgs degraded, 2262 pgs undersized; 6 slow ops, oldest one blocked for 426 sec, mon.nc-san3 has slow ops
I am running it on Ubuntu 18 and using PetaSAN 2.8.1 still.
evankmd82
11 Posts
Quote from evankmd82 on August 19, 2024, 5:06 amIt gets as low as 4 osds down and then starts climbing again. The pgs down/degraded/etc keep changing but never go in a fully positive direction. Any tips?
It gets as low as 4 osds down and then starts climbing again. The pgs down/degraded/etc keep changing but never go in a fully positive direction. Any tips?
evankmd82
11 Posts
Quote from evankmd82 on August 19, 2024, 6:15 amI was able to resolve this.. seems there was a race condition created and I was able to get it to recover simply by restarting the ceph-osd daemons
I was able to resolve this.. seems there was a race condition created and I was able to get it to recover simply by restarting the ceph-osd daemons
X1M
5 Posts
Quote from X1M on November 9, 2024, 6:24 amThis is a good example of the beauty of PetaSAN, and CEPH. We have been using CEPH since 2017 and have had server crashing, power failure, disk failure and switch failures, and never had CEPH let me down. I all situations, CEPH started complaining and eventually it fixed itself. Sure, I had to reboot a node sometimes, and that’s it.
This is a good example of the beauty of PetaSAN, and CEPH. We have been using CEPH since 2017 and have had server crashing, power failure, disk failure and switch failures, and never had CEPH let me down. I all situations, CEPH started complaining and eventually it fixed itself. Sure, I had to reboot a node sometimes, and that’s it.
HEALTH_WARN 1 filesystem is degraded; 1 MDSs report slow metadata IOs; 23 osds down; Reduced data availability: 1082 pgs inactive, 341 pgs down; Degraded data redundancy: 5202276/19234662 objects degraded (27.046%), 1223 pgs degraded, 2262 pgs undersized; 6 slow ops, oldest one blocked for 426 sec, mon.nc-san3 has slow ops