Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

1 OSD Down and Health Warn.. proper steps

I currently am showing 1 OSD down and out for more than a day. If I use the interface to delete the disk I get a warning that the says: "Cluster health is not OK, OSD deletion may cause data loss. Are you sure you want to proceed?"

Output of ceph -s (pasted below) shows that the only 'health warn' has to do with page scrubs.
There are misplaced objects (around 5%, but it has been in that state since yesterday afternoon, the number changes, it sometimes goes up and sometimes goes down by hundredths of a %, but hasn't dropped below 5% that entire time)

Is it safe to delete that down disk and add a new one? or should I add a new one first and then delete it?

cluster:

    id:     1da111ec-ffe8-4029-9834-e0988079925b

    health: HEALTH_WARN

            383 pgs not deep-scrubbed in time

            342 pgs not scrubbed in time

  services:

    mon: 3 daemons, quorum petasan3,petasan1,petasan2 (age 2w)

    mgr: petasan1(active, since 2w), standbys: petasan2, petasan3

    mds: cephfs:1 {0=petasan1=up:active} 2 up:standby

    osd: 64 osds: 63 up (since 26h), 63 in (since 46h); 145 remapped pgs

  task status:

    scrub status:

        mds.petasan1: idle

  data:

    pools:   3 pools, 3200 pgs

    objects: 26.29M objects, 99 TiB

    usage:   211 TiB used, 366 TiB / 577 TiB avail

    pgs:     2640099/52570180 objects misplaced (5.022%)

             3055 active+clean

             132  active+remapped+backfill_wait

             13   active+remapped+backfilling

  io:

    client:   80 MiB/s rd, 33 MiB/s wr, 773 op/s rd, 240 op/s wr

    recovery: 223 MiB/s, 55 objects/s

you can delete it and add a new one, this is better as it will re-use the same id which will result in same objects being mapped to it. it is also ok to add a new one then delete, but could result in more re-balance traffic so it is do-able but not best.

Thanks, I removed the old disk and added a new.

I actually added a new Journal SSD device and a new SSD Cache device and then when I added the new OSD I picked the new journal/cache disks manually.
This node still shows in the UI that there is no journal disk, only cache.


Should I be concerned about the HEALTH_WARN scrub/deep-scrub messages? those numbers keep growing, It is now up to 400/342

not concerning, could be affected by the recovery traffic of the failed osd. when recovery is done allow a week for a complete scrub cycle, if still you get not scrubbed in time warnings you can increase the scrub speed from maintenance tab.