Forums - PetaSAN

ForumGeneral Discussion1 OSD Down and Health Warn.. prop …
You need to log in to create posts and topics. Login · Register
1 OSD Down and Health Warn.. proper steps

neiltorda
99 Posts

January 17, 2021, 2:19 pm
Quote from neiltorda on January 17, 2021, 2:19 pm
I currently am showing 1 OSD down and out for more than a day. If I use the interface to delete the disk I get a warning that the says: "Cluster health is not OK, OSD deletion may cause data loss. Are you sure you want to proceed?"

Output of ceph -s (pasted below) shows that the only 'health warn' has to do with page scrubs.
There are misplaced objects (around 5%, but it has been in that state since yesterday afternoon, the number changes, it sometimes goes up and sometimes goes down by hundredths of a %, but hasn't dropped below 5% that entire time)

Is it safe to delete that down disk and add a new one? or should I add a new one first and then delete it?

cluster:

id: 1da111ec-ffe8-4029-9834-e0988079925b

health: HEALTH_WARN

383 pgs not deep-scrubbed in time

342 pgs not scrubbed in time

services:

mon: 3 daemons, quorum petasan3,petasan1,petasan2 (age 2w)

mgr: petasan1(active, since 2w), standbys: petasan2, petasan3

mds: cephfs:1 {0=petasan1=up:active} 2 up:standby

osd: 64 osds: 63 up (since 26h), 63 in (since 46h); 145 remapped pgs

task status:

scrub status:

mds.petasan1: idle

data:

pools: 3 pools, 3200 pgs

objects: 26.29M objects, 99 TiB

usage: 211 TiB used, 366 TiB / 577 TiB avail

pgs: 2640099/52570180 objects misplaced (5.022%)

   3055 active+clean

   132 active+remapped+backfill_wait

   13 active+remapped+backfilling

io:

client: 80 MiB/s rd, 33 MiB/s wr, 773 op/s rd, 240 op/s wr

recovery: 223 MiB/s, 55 objects/s

I currently am showing 1 OSD down and out for more than a day. If I use the interface to delete the disk I get a warning that the says: "Cluster health is not OK, OSD deletion may cause data loss. Are you sure you want to proceed?"

Output of ceph -s (pasted below) shows that the only 'health warn' has to do with page scrubs.
There are misplaced objects (around 5%, but it has been in that state since yesterday afternoon, the number changes, it sometimes goes up and sometimes goes down by hundredths of a %, but hasn't dropped below 5% that entire time)

Is it safe to delete that down disk and add a new one? or should I add a new one first and then delete it?

cluster:

id: 1da111ec-ffe8-4029-9834-e0988079925b

health: HEALTH_WARN

383 pgs not deep-scrubbed in time

342 pgs not scrubbed in time

services:

mon: 3 daemons, quorum petasan3,petasan1,petasan2 (age 2w)

mgr: petasan1(active, since 2w), standbys: petasan2, petasan3

mds: cephfs:1 {0=petasan1=up:active} 2 up:standby

osd: 64 osds: 63 up (since 26h), 63 in (since 46h); 145 remapped pgs

task status:

scrub status:

mds.petasan1: idle

data:

pools: 3 pools, 3200 pgs

objects: 26.29M objects, 99 TiB

usage: 211 TiB used, 366 TiB / 577 TiB avail

pgs: 2640099/52570180 objects misplaced (5.022%)

   3055 active+clean

   132 active+remapped+backfill_wait

   13 active+remapped+backfilling

io:

client: 80 MiB/s rd, 33 MiB/s wr, 773 op/s rd, 240 op/s wr

recovery: 223 MiB/s, 55 objects/s

#1

admin
2,969 Posts

January 17, 2021, 4:48 pm
Quote from admin on January 17, 2021, 4:48 pm
you can delete it and add a new one, this is better as it will re-use the same id which will result in same objects being mapped to it. it is also ok to add a new one then delete, but could result in more re-balance traffic so it is do-able but not best.

you can delete it and add a new one, this is better as it will re-use the same id which will result in same objects being mapped to it. it is also ok to add a new one then delete, but could result in more re-balance traffic so it is do-able but not best.

#2

neiltorda
99 Posts

January 17, 2021, 7:20 pm
Quote from neiltorda on January 17, 2021, 7:20 pm
Thanks, I removed the old disk and added a new.

I actually added a new Journal SSD device and a new SSD Cache device and then when I added the new OSD I picked the new journal/cache disks manually.
This node still shows in the UI that there is no journal disk, only cache.

Should I be concerned about the HEALTH_WARN scrub/deep-scrub messages? those numbers keep growing, It is now up to 400/342

Thanks, I removed the old disk and added a new.

I actually added a new Journal SSD device and a new SSD Cache device and then when I added the new OSD I picked the new journal/cache disks manually.
This node still shows in the UI that there is no journal disk, only cache.

Should I be concerned about the HEALTH_WARN scrub/deep-scrub messages? those numbers keep growing, It is now up to 400/342

#3

admin
2,969 Posts

January 17, 2021, 11:05 pm
Quote from admin on January 17, 2021, 11:05 pm
not concerning, could be affected by the recovery traffic of the failed osd. when recovery is done allow a week for a complete scrub cycle, if still you get not scrubbed in time warnings you can increase the scrub speed from maintenance tab.

not concerning, could be affected by the recovery traffic of the failed osd. when recovery is done allow a week for a complete scrub cycle, if still you get not scrubbed in time warnings you can increase the scrub speed from maintenance tab.

#4

Post Reply: 1 OSD Down and Health Warn.. proper steps

Cancel