Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Crash after blackout

Hello everyone,

tonight we've had a bad blackout exceeding UPS' battery charge so my PetaSAN cluster has gone down badly. Now I have a lot of problems bringing it up. The behaviour seems a bit random to me.

  • Random nodes reboots
  • Random OSDs disappears
  • Charts randomly shows No Datapoints
  • PG Status never reach the 1000/1000

Even after a lot of Node reboots I can't bring the cluster to normal status. Is there something I can do to fix the problem? Thanks in advance...

Best regards.

Luca

Hi,

what is the output of

ceph status --cluster CLUSTER_NAME

do you get random node reboots or do you get random shutdowns ?

are some of the osd always down or do they go up and down (flap) ?

Hi,

  • the output of the ceph status command is:

cluster 92f9db61-fc7b-4327-ba71-1a5fb85ee1ca

     health HEALTH_ERR

            820 pgs are stuck inactive for more than 300 seconds

            180 pgs degraded

            820 pgs down

            820 pgs peering

            180 pgs stale

            820 pgs stuck inactive

            180 pgs stuck unclean

            149 pgs undersized

            3 requests are blocked > 32 sec

            recovery 4110/38066 objects degraded (10.797%)

            recovery 839/38066 objects misplaced (2.204%)

            too many PGs per OSD (606 > max 500)

     monmap e3: 3 mons at {ps-node-01=10.0.1.1:6789/0,ps-node-02=10.0.1.2:6789/0,ps-node-03=10.0.1.3:6789/0}

            election epoch 426, quorum 0,1,2 ps-node-01,ps-node-02,ps-node-03

     osdmap e1231: 6 osds: 3 up, 3 in; 180 remapped pgs

            flags sortbitwise,require_jewel_osds

      pgmap v137921: 1000 pgs, 1 pools, 75983 MB data, 19033 objects

            86908 MB used, 347 GB / 431 GB avail

            4110/38066 objects degraded (10.797%)

            839/38066 objects misplaced (2.204%)

                 820 down+peering

                 145 stale+active+undersized+degraded

                  31 stale+active+degraded

                   4 stale+active+undersized+degraded+remapped

  • I get random reboots and also random shutdown
  • OSDs are not always down but they go up and down and varies from node to node.

Thanks again for your support.

Best regards.

Luca

In terms of priority we should:

  1. Fix random reboot issues
  2. Try to bring up all OSDs
  3. Fix PG stuck states

For the reboots, there is nothing in PetaSAN/Ceph itself that will perform a reboot, so this is strange maybe there is a hardware issue. However in many cases after an unclean crash, the system will be busy trying to recover and check data consistency,  possibly the default Ceph values for recovery puts too much stress on your hardware. To reduce this load you can add the following configuration to /etc/ceph/CLUSTER_NAME.conf under the [global] section for your 3 nodes:

osd_max_backfills = 1

osd_recovery_max_active = 1

osd_recovery_threads = 1

osd_recovery_op priority = 1

osd_client_op_priority = 63

osd_max_scrubs = 1

osd_scrub_during_recovery = false

osd_scrub_priority = 1

 

then reboot all 3 nodes. Again this is not a direct fix for reboots but it should put less strain on a recovering system. It will be helpful if you can watch the resource values: cpu/net/disk busy %  by running atop command. Also check for any kernel messages with

dmesg | grep -i "error\|warn\|fail"

Also can you tell me your current resources: RAM, NICs and their speed.

 

If reboots and possible system load get fixed then it is possible to look at bringing up the OSDs. You need to have at least 5 of 6 up to recover data. The slightly good sign is that the OSDs are not always down but are flapping. This is more common under severe system load where the OSD heartbeats are not getting passed. There are other indications you mentioned such as the graphs are not showing which do indicate there is a resource issue. It is likely that after solving the reboots and system load the OSDs will be up by themselves, else we will need to start them manually and look at their logs.

After fixing the OSDs, the PG stuck should improve and in some cases totally be fixed. However it may be required to manually intervene to fix consistency issues.

Let me know how you progress and i will try to help as much as possible. Good luck.

Hello,
I've done the cluster.conf changes you suggested me and then rebooted all the nodes. Now I see 2 nodes of 3 up and using the atop command on every node I see that I'm running very high on disk utilizations (101% on the system disk sda on two nodes!!!) and very low on free RAM. My test specs are very poor because I'm running 3 nodes with only 1 GBe NIC and 4 GB of RAM.
Before sending this message I see that node 2 has only 2 (out of 3) OSD up. Thanks again for the support...

Best regards!

Luca

It does look like there is a resource issue, if you can increase your RAM to at least 8 G and try. The busy system disk is most likely caused by  RAM issue.