Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

OSD in limbo

I had an OSD fail, and after replacing the disk, am unable to bring it back online and into the cluster. I got the failure notification, then proceeded to 'systemctl stop ceph-osd@75'. Then deleted the OSD from PetaSAN Node Disk List, then removed the physical disk.

Reinstalled a new disk, and tried to add it using the PetaSAN node disk list. It appears to add the OSD, but never brings it up. Here's the output of 'systemctl status ceph-osd@75':

root@BD-Ceph-SD2:~# systemctl status ceph-osd@75
ceph-osd@75.service - Ceph object storage daemon osd.75
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendo
r preset: enabled)
Drop-In: /etc/systemd/system/ceph-osd@.service.d
└─override.conf
Active: activating (auto-restart) (Result: exit-code) since Wed 2018-10-10 12
:30:37 EDT; 17s ago
Process: 3106347 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i -
-setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
Process: 3106341 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${C
LUSTER} --id %i (code=exited, status=0/SUCCESS)
Main PID: 3106347 (code=exited, status=1/FAILURE)

Oct 10 12:30:37 BD-Ceph-SD2 systemd[1]: ceph-osd@75.service: Unit entere
d failed state.
Oct 10 12:30:37 BD-Ceph-SD2 systemd[1]: ceph-osd@75.service: Failed with
result 'exit-code'.

I have attempted to 'systemctl start ceph-osd@75' but there is no change. Here's the output of 'ceph status --cluster CLUSTERNAME'

root@BD-Ceph-SD2:~# ceph status --cluster BD-Ceph-Cluster1
cluster:
id: 2ce79d8c-2fb2-4f0f-8750-55699d3800e0
health: HEALTH_OK

services:
mon: 3 daemons, quorum BD-Ceph-SD1,BD-Ceph-SD2,BD-Ceph-SD3
mgr: BD-Ceph-SD3(active), standbys: BD-Ceph-SD1, BD-Ceph-SD2
osd: 96 osds: 95 up, 95 in

data:
pools: 1 pools, 4096 pgs
objects: 2145k objects, 8582 GB
usage: 19204 GB used, 242 TB / 261 TB avail
pgs: 4096 active+clean

io:
client: 16072 kB/s rd, 1735 kB/s wr, 503 op/s rd, 402 op/s wr

and the (relevant) output of 'ceph-disk list' for the OSD in question:

/dev/sdz :
/dev/sdz1 ceph data, prepared, cluster BD-Ceph-Cluster1, osd.75, block /dev/sdz2, block.db /dev/nvme0n1p18
/dev/sdz2 ceph block, for /dev/sdz1

The disk is stuck in the Prepared state. It shows no status in the PetaSAN node disk list, where there would normally be a green up or red down icon, there is nothing. It does show as in use by OSD75 though.

Any ideas for getting this to sync back up into the pool?

Thanks!

Tried 'ceph-disk activate /dev/sdz1':

root@BD-Ceph-SD2:~# ceph-disk activate /dev/sdz1
ceph-disk: Error: another BD-Ceph-Cluster1 osd.75 already mounted in position (old/different cluster instance?); unmounting ours.

Some relics from the original disk are still lingering. How to clear them out to activate this new OSD?

I also tried zero'ing the journal partition and new disk to see if that was the issue, but did not help. I did notice that when I remove the osd again from PetaSAN disk list, and try to re-add it, the journal partition for the new OSD is incremented, and the old journal partition is never deleted. Now I have at least 4 or 5 additional partitions on the journal disk that probably shouldn't be there. Is there any way to clean these up?

Thanks!

Do you have space on your  nvme journal /dev/nvme0n1p18. It requires a 20 GB partition for wal/db per OSD. One shortcomings of ceph-disk is it will always create a new partition at the end of the device without trying to re-use old ones, so in case of failures you will have empty partition holes on the nvme device.

Another issue this may occur is if you have very limited free RAM left.

Else can you please post the ceph-disk log (or if possible just the relevant section at end):

/opt/petasan/log/ceph-disk.log

Just saw your reply after i edited my previous post. It does appear the journal may be out of space, as there are a number of old stale partitions there. Am I safe to remove the paritions that don't show up as in use from ceph-disk list? IE:

root@BD-Ceph-SD2:~# ceph-disk list | grep /dev/nvme
/dev/nvme0n1 :
/dev/nvme0n1p1 ceph block.db, for /dev/sdf1
/dev/nvme0n1p10 ceph block.db
/dev/nvme0n1p11 ceph block.db, for /dev/sdr1
/dev/nvme0n1p12 ceph block.db, for /dev/sdt1
/dev/nvme0n1p13 ceph block.db, for /dev/sdv1
/dev/nvme0n1p14 ceph block.db
/dev/nvme0n1p15 ceph block.db, for /dev/sdn1
/dev/nvme0n1p16 ceph block.db, for /dev/sdp1
/dev/nvme0n1p17 ceph block.db
/dev/nvme0n1p18 ceph block.db
/dev/nvme0n1p19 ceph block.db
/dev/nvme0n1p2 ceph block.db, for /dev/sdd1
/dev/nvme0n1p20 ceph block.db
/dev/nvme0n1p3 ceph block.db, for /dev/sdb1
/dev/nvme0n1p4 ceph block.db, for /dev/sda1
/dev/nvme0n1p5 ceph block.db, for /dev/sdj1
/dev/nvme0n1p6 ceph block.db, for /dev/sdh1
/dev/nvme0n1p7 ceph block.db
/dev/nvme0n1p8 ceph block.db
/dev/nvme0n1p9 ceph block.db

As long as they don't say "for /dev/something" am I ok to assume i can remove?

I would post the output of the disk log, but it is extremely muddled with actions i've been taking. Will post it if there;s something of use after flushing the stale journal partitions

 

 

 

Yes you can remove p17-p20 . The only cases where the partition does not show a for "/dev/sdx" is either the OSD was already deleted or in case it is valid but down. To be extra safe just double check that together with other nvmes, that you account for all local OSDs.

I recommend you delete the failed OSD, remove the partitions as per above, then make sure you have at least 20 GB left on nvme before you try to add it via the ui.

If you still have issues, i need to look at the ceph-disk log.

Also in addition to the above, do physically remove the failed OSD disk after deletion, do not leave it. Before adding new disk as OSD,  check there is no mounting of the old OSD

umount /var/lib/ceph/osd/CLUSTER_NAME-OSD_ID

After removing the stale journal partitions, i checked for the mounting in /var/lib/ceph/osd and it still existed. ran the umount, then was able to add the OSD from PetaSAN UI without issue. Didn't have to physically remove the disk. (Which is great cause I'm not onsite today!)

Would it be logical to include in the PetaSAN UI logic a check for journal partitions when removing an OSD so it cleans that up automatically?

Thanks for your help!

currently we just create a new 20 GB partition on the journal device when we add a new OSD, i believe what you want is for us to look for any unused partition on the journal device first, such unused partition could have been a result of a previous failed OSD that is now deleted, in your case we got lucky since we were able to delete partitions at the end. Given that this probably the  highest reason we see for failure to add OSDs + it is also a waste of journal device (typically high end), i will log this as a bug so we would try to support this.

for the mount issue, we do zap/clean a disk when we delete it + when we add a disk as OSD, this cleaning includes first un-mounting  all its partitions, so it should not be an issue.

Just 1 comment : having 1 nvme as journal for 12 disks is OK from a performance point of view, but if this single device fails, it will bring down 12 OSDs. It is not a major issue, but will likely create a lot of recovery traffic. Typically nvme are journals for ssds, and ssds are for hdds at approx 1:4 ratio. But again your setup is do-able.