1 OSD Down and Health Warn.. proper steps
neiltorda
98 Posts
January 17, 2021, 2:19 pmQuote from neiltorda on January 17, 2021, 2:19 pmI currently am showing 1 OSD down and out for more than a day. If I use the interface to delete the disk I get a warning that the says: "Cluster health is not OK, OSD deletion may cause data loss. Are you sure you want to proceed?"
Output of ceph -s (pasted below) shows that the only 'health warn' has to do with page scrubs.
There are misplaced objects (around 5%, but it has been in that state since yesterday afternoon, the number changes, it sometimes goes up and sometimes goes down by hundredths of a %, but hasn't dropped below 5% that entire time)
Is it safe to delete that down disk and add a new one? or should I add a new one first and then delete it?
cluster:
id: 1da111ec-ffe8-4029-9834-e0988079925b
health: HEALTH_WARN
383 pgs not deep-scrubbed in time
342 pgs not scrubbed in time
services:
mon: 3 daemons, quorum petasan3,petasan1,petasan2 (age 2w)
mgr: petasan1(active, since 2w), standbys: petasan2, petasan3
mds: cephfs:1 {0=petasan1=up:active} 2 up:standby
osd: 64 osds: 63 up (since 26h), 63 in (since 46h); 145 remapped pgs
task status:
scrub status:
mds.petasan1: idle
data:
pools: 3 pools, 3200 pgs
objects: 26.29M objects, 99 TiB
usage: 211 TiB used, 366 TiB / 577 TiB avail
pgs: 2640099/52570180 objects misplaced (5.022%)
3055 active+clean
132 active+remapped+backfill_wait
13 active+remapped+backfilling
io:
client: 80 MiB/s rd, 33 MiB/s wr, 773 op/s rd, 240 op/s wr
recovery: 223 MiB/s, 55 objects/s
I currently am showing 1 OSD down and out for more than a day. If I use the interface to delete the disk I get a warning that the says: "Cluster health is not OK, OSD deletion may cause data loss. Are you sure you want to proceed?"
Output of ceph -s (pasted below) shows that the only 'health warn' has to do with page scrubs.
There are misplaced objects (around 5%, but it has been in that state since yesterday afternoon, the number changes, it sometimes goes up and sometimes goes down by hundredths of a %, but hasn't dropped below 5% that entire time)
Is it safe to delete that down disk and add a new one? or should I add a new one first and then delete it?
cluster:
id: 1da111ec-ffe8-4029-9834-e0988079925b
health: HEALTH_WARN
383 pgs not deep-scrubbed in time
342 pgs not scrubbed in time
services:
mon: 3 daemons, quorum petasan3,petasan1,petasan2 (age 2w)
mgr: petasan1(active, since 2w), standbys: petasan2, petasan3
mds: cephfs:1 {0=petasan1=up:active} 2 up:standby
osd: 64 osds: 63 up (since 26h), 63 in (since 46h); 145 remapped pgs
task status:
scrub status:
mds.petasan1: idle
data:
pools: 3 pools, 3200 pgs
objects: 26.29M objects, 99 TiB
usage: 211 TiB used, 366 TiB / 577 TiB avail
pgs: 2640099/52570180 objects misplaced (5.022%)
3055 active+clean
132 active+remapped+backfill_wait
13 active+remapped+backfilling
io:
client: 80 MiB/s rd, 33 MiB/s wr, 773 op/s rd, 240 op/s wr
recovery: 223 MiB/s, 55 objects/s
admin
2,930 Posts
January 17, 2021, 4:48 pmQuote from admin on January 17, 2021, 4:48 pmyou can delete it and add a new one, this is better as it will re-use the same id which will result in same objects being mapped to it. it is also ok to add a new one then delete, but could result in more re-balance traffic so it is do-able but not best.
you can delete it and add a new one, this is better as it will re-use the same id which will result in same objects being mapped to it. it is also ok to add a new one then delete, but could result in more re-balance traffic so it is do-able but not best.
neiltorda
98 Posts
January 17, 2021, 7:20 pmQuote from neiltorda on January 17, 2021, 7:20 pmThanks, I removed the old disk and added a new.
I actually added a new Journal SSD device and a new SSD Cache device and then when I added the new OSD I picked the new journal/cache disks manually.
This node still shows in the UI that there is no journal disk, only cache.
Should I be concerned about the HEALTH_WARN scrub/deep-scrub messages? those numbers keep growing, It is now up to 400/342
Thanks, I removed the old disk and added a new.
I actually added a new Journal SSD device and a new SSD Cache device and then when I added the new OSD I picked the new journal/cache disks manually.
This node still shows in the UI that there is no journal disk, only cache.
Should I be concerned about the HEALTH_WARN scrub/deep-scrub messages? those numbers keep growing, It is now up to 400/342
admin
2,930 Posts
January 17, 2021, 11:05 pmQuote from admin on January 17, 2021, 11:05 pmnot concerning, could be affected by the recovery traffic of the failed osd. when recovery is done allow a week for a complete scrub cycle, if still you get not scrubbed in time warnings you can increase the scrub speed from maintenance tab.
not concerning, could be affected by the recovery traffic of the failed osd. when recovery is done allow a week for a complete scrub cycle, if still you get not scrubbed in time warnings you can increase the scrub speed from maintenance tab.
1 OSD Down and Health Warn.. proper steps
neiltorda
98 Posts
Quote from neiltorda on January 17, 2021, 2:19 pmI currently am showing 1 OSD down and out for more than a day. If I use the interface to delete the disk I get a warning that the says: "Cluster health is not OK, OSD deletion may cause data loss. Are you sure you want to proceed?"
Output of ceph -s (pasted below) shows that the only 'health warn' has to do with page scrubs.
There are misplaced objects (around 5%, but it has been in that state since yesterday afternoon, the number changes, it sometimes goes up and sometimes goes down by hundredths of a %, but hasn't dropped below 5% that entire time)Is it safe to delete that down disk and add a new one? or should I add a new one first and then delete it?
cluster:
id: 1da111ec-ffe8-4029-9834-e0988079925b
health: HEALTH_WARN
383 pgs not deep-scrubbed in time
342 pgs not scrubbed in time
services:
mon: 3 daemons, quorum petasan3,petasan1,petasan2 (age 2w)
mgr: petasan1(active, since 2w), standbys: petasan2, petasan3
mds: cephfs:1 {0=petasan1=up:active} 2 up:standby
osd: 64 osds: 63 up (since 26h), 63 in (since 46h); 145 remapped pgs
task status:
scrub status:
mds.petasan1: idle
data:
pools: 3 pools, 3200 pgs
objects: 26.29M objects, 99 TiB
usage: 211 TiB used, 366 TiB / 577 TiB avail
pgs: 2640099/52570180 objects misplaced (5.022%)
3055 active+clean
132 active+remapped+backfill_wait
13 active+remapped+backfilling
io:
client: 80 MiB/s rd, 33 MiB/s wr, 773 op/s rd, 240 op/s wr
recovery: 223 MiB/s, 55 objects/s
I currently am showing 1 OSD down and out for more than a day. If I use the interface to delete the disk I get a warning that the says: "Cluster health is not OK, OSD deletion may cause data loss. Are you sure you want to proceed?"
Output of ceph -s (pasted below) shows that the only 'health warn' has to do with page scrubs.
There are misplaced objects (around 5%, but it has been in that state since yesterday afternoon, the number changes, it sometimes goes up and sometimes goes down by hundredths of a %, but hasn't dropped below 5% that entire time)
Is it safe to delete that down disk and add a new one? or should I add a new one first and then delete it?
cluster:
id: 1da111ec-ffe8-4029-9834-e0988079925b
health: HEALTH_WARN
383 pgs not deep-scrubbed in time
342 pgs not scrubbed in time
services:
mon: 3 daemons, quorum petasan3,petasan1,petasan2 (age 2w)
mgr: petasan1(active, since 2w), standbys: petasan2, petasan3
mds: cephfs:1 {0=petasan1=up:active} 2 up:standby
osd: 64 osds: 63 up (since 26h), 63 in (since 46h); 145 remapped pgs
task status:
scrub status:
mds.petasan1: idle
data:
pools: 3 pools, 3200 pgs
objects: 26.29M objects, 99 TiB
usage: 211 TiB used, 366 TiB / 577 TiB avail
pgs: 2640099/52570180 objects misplaced (5.022%)
3055 active+clean
132 active+remapped+backfill_wait
13 active+remapped+backfilling
io:
client: 80 MiB/s rd, 33 MiB/s wr, 773 op/s rd, 240 op/s wr
recovery: 223 MiB/s, 55 objects/s
admin
2,930 Posts
Quote from admin on January 17, 2021, 4:48 pmyou can delete it and add a new one, this is better as it will re-use the same id which will result in same objects being mapped to it. it is also ok to add a new one then delete, but could result in more re-balance traffic so it is do-able but not best.
you can delete it and add a new one, this is better as it will re-use the same id which will result in same objects being mapped to it. it is also ok to add a new one then delete, but could result in more re-balance traffic so it is do-able but not best.
neiltorda
98 Posts
Quote from neiltorda on January 17, 2021, 7:20 pmThanks, I removed the old disk and added a new.
I actually added a new Journal SSD device and a new SSD Cache device and then when I added the new OSD I picked the new journal/cache disks manually.
This node still shows in the UI that there is no journal disk, only cache.
Should I be concerned about the HEALTH_WARN scrub/deep-scrub messages? those numbers keep growing, It is now up to 400/342
Thanks, I removed the old disk and added a new.
I actually added a new Journal SSD device and a new SSD Cache device and then when I added the new OSD I picked the new journal/cache disks manually.
This node still shows in the UI that there is no journal disk, only cache.
Should I be concerned about the HEALTH_WARN scrub/deep-scrub messages? those numbers keep growing, It is now up to 400/342
admin
2,930 Posts
Quote from admin on January 17, 2021, 11:05 pmnot concerning, could be affected by the recovery traffic of the failed osd. when recovery is done allow a week for a complete scrub cycle, if still you get not scrubbed in time warnings you can increase the scrub speed from maintenance tab.
not concerning, could be affected by the recovery traffic of the failed osd. when recovery is done allow a week for a complete scrub cycle, if still you get not scrubbed in time warnings you can increase the scrub speed from maintenance tab.