Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Cannot add any new OSD after removing an OSD

After cleaning the drive and everything and making sure to run commands to remove the drive out of the ceph osd tree, umount it, and remove it from crush map not sure what else is causing this issue. Last time when i ran into this issue my general command works until it doesn't.

ceph osd out 33
systemctl stop ceph-osd@33
ceph osd crush remove osd.33
ceph osd rm 33
umount /var/lib/ceph/osd/ceph-33
ceph osd crush remove osd.33
rm -rf /var/lib/ceph/osd/ceph-33

I am having trouble adding any drive back into the node.

it say adding but will fail shortly afterward with error log below:

04/06/2024 08:20:07 INFO CIFSService key change action

04/06/2024 08:20:00 ERROR Error executing : ceph-volume --log-path /opt/petasan/log lvm prepare --bluestore --dmcrypt --data /dev/xvdn1 --crush-device-class ssd-rack-A00001N00001-02

--> RuntimeError: Unable to create a new OSD id

stderr: Error EEXIST: entity osd.33 exists but key does not match

Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 316b6224-a805-43a7-9d46-8d794df5830c

Running command: /usr/bin/ceph-authtool --gen-print-key

04/06/2024 08:20:00 ERROR Running command: /usr/bin/ceph-authtool --gen-print-key

04/06/2024 08:19:59 INFO Starting : ceph-volume --log-path /opt/petasan/log lvm prepare --bluestore --dmcrypt --data /dev/xvdn1 --crush-device-class ssd-rack-A00001N00001-02

04/06/2024 08:19:56 INFO Executing udevadm settle --timeout 30

04/06/2024 08:19:56 INFO Calling udevadm on xvdn device

04/06/2024 08:19:56 INFO Executing partprobe /dev/xvdn

04/06/2024 08:19:56 INFO Calling partprobe on xvdn device

04/06/2024 08:19:55 INFO Creating data partition num 1 size 66560MB on /dev/xvdn

04/06/2024 08:19:54 INFO Start prepare bluestore OSD : xvdn

04/06/2024 08:19:54 INFO User didn't select a cache for disk xvdn.

04/06/2024 08:19:54 INFO User didn't select a journal for disk xvdn, so the journal will be on the same disk.

04/06/2024 08:19:51 INFO Executing : partprobe /dev/xvdn

04/06/2024 08:19:51 INFO Executing : parted -s /dev/xvdn mklabel gpt

04/06/2024 08:19:51 INFO Executing : dd if=/dev/zero of=/dev/xvdn bs=1M count=20 oflag=direct,dsync >/dev/null 2>&1

04/06/2024 08:19:51 INFO Executing : wipefs --all /dev/xvdn

04/06/2024 08:19:50 INFO Start cleaning : xvdn

what was the initial failure that let you to re-move/re-add the OSD ?

if you try with non-encrypted drive, do you also have the same issue ?

 

I was just playing around with the removal of the drive and adding for testing purpose making sure I understand the whole process so there was no issue with the drive. Issue occurs sometime when I remove the drive and it seem as if something is left hanging causing issue when adding it back in.

Try encrypted and unencrypted and no difference. Once this issue occurs rebooting, deleting the disk and changing it doesn't seem to do anything. But since this is a VM i can revert back to its previous state and everything is back to normal and i can add the disk but in production I definitely cannot do that especially if i need to remove OSDs and then cannot add it afterward. FYI version 3.3.0

05/06/2024 04:26:25 ERROR Error executing : ceph-volume --log-path /opt/petasan/log lvm prepare --bluestore --data /dev/xvdn1 --crush-device-class ssd-rack-A00001N00001-02

--> RuntimeError: Unable to create a new OSD id

stderr: Error EEXIST: entity osd.33 exists but key does not match

Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 1b31de05-06c4-4ecc-bebe-804c280fcaab

05/06/2024 04:26:25 ERROR Running command: /usr/bin/ceph-authtool --gen-print-key

05/06/2024 04:26:25 INFO Starting : ceph-volume --log-path /opt/petasan/log lvm prepare --bluestore --data /dev/xvdn1 --crush-device-class ssd-rack-A00001N00001-02

05/06/2024 04:26:22 INFO Executing udevadm settle --timeout 30

05/06/2024 04:26:22 INFO Calling udevadm on xvdn device

05/06/2024 04:26:22 INFO Executing partprobe /dev/xvdn

05/06/2024 04:26:22 INFO Calling partprobe on xvdn device

05/06/2024 04:26:20 INFO Creating data partition num 1 size 66560MB on /dev/xvdn

05/06/2024 04:26:20 INFO Start prepare bluestore OSD : xvdn

05/06/2024 04:26:20 INFO User didn't select a cache for disk xvdn.

05/06/2024 04:26:20 INFO User didn't select a journal for disk xvdn, so the journal will be on the same disk.

05/06/2024 04:26:17 INFO Executing : partprobe /dev/xvdn

05/06/2024 04:26:17 INFO Executing : parted -s /dev/xvdn mklabel gpt

05/06/2024 04:26:17 INFO Executing : dd if=/dev/zero of=/dev/xvdn bs=1M count=20 oflag=direct,dsync >/dev/null 2>&1

05/06/2024 04:26:17 INFO Executing : wipefs --all /dev/xvdn

05/06/2024 04:26:17 INFO Executing : dd if=/dev/zero of=/dev/xvdn1 bs=1M count=20 oflag=direct,dsync >/dev/null 2>&1

05/06/2024 04:26:17 INFO Executing : wipefs --all /dev/xvdn1

PetaSAN.core.common.CustomException.CIFSException

raise CIFSException(CIFSException.CIFS_CLUSTER_NOT_UP, '')

File "/usr/lib/python3/dist-packages/PetaSAN/core/cifs/cifs_server.py", line 369, in rename_shares_top_dir

self.rename_shares_top_dir(settings)

File "/usr/lib/python3/dist-packages/PetaSAN/core/cifs/cifs_server.py", line 311, in sync_consul_settings

self.server.sync_consul_settings(settings)

File "/usr/lib/python3/dist-packages/PetaSAN/core/cifs/cifs_service.py", line 38, in key_change_action

self.key_change_action(kv)

File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/watch_base.py", line 62, in run

Traceback (most recent call last):

05/06/2024 04:26:16 ERROR

05/06/2024 04:26:16 ERROR WatchBase Exception :

05/06/2024 04:26:15 INFO CIFSService key change action

05/06/2024 04:26:13 INFO Start cleaning : xvdn

 

05/06/2024 04:26:10 INFO Start add osd job 29889

05/06/2024 04:26:10 INFO Start add osd job for disk xvdn.

05/06/2024 04:26:10 INFO Running script : /opt/petasan/scripts/admin/node_manage_disks.py add-osd -disk_name xvdn -device_class ssd-rack-A00001N00001-02 -encrypted False

PetaSAN.core.common.CustomException.CIFSException

Ok. I tried deleting another OSD which is not ideal but adding OSD start working after deleting. Which makes no sense to me but it is working again. But I notice the xvdn is stuck in "Adding" and just hangs there now... Not sure what would cause that. This could be why it doesn't let me adding anything. Even removing xvdn drive and adding drive back jump it straight to "Adding" therefore causing OSD addition to not work.

Also tried cmd line

root@storage-pool00000-node00001:/var/lib/ceph/osd# python /opt/petasan/scripts/admin/node_manage_disks.py add-osd -disk_name xvdn
Setting name!
partNum is 0
The operation has completed successfully.
Cannot add osd for disk xvdn , Exception is : unexpected EOF while parsing (<string>, line 0)
## petasan ##
Error adding OSD.

Ok so my solution is to use: node_manage_disks.py to do the adding and removing of drive. It was mostly the removal of the drive that was the issue so using this script it resolve all of the issue as far as i can tell. I do need to disconnect and reconnect the drive to get the wipefs to run if not the drive is busy and will fail the run on rare occasion which could be the system still busy and holding onto the drive for some reason.

The issue does happen if you try to re-add the same healthy encrypted drive that you had deleted. It does not happen if you add other drives. In real case scenario your replacement drive will be different so will also not be affected.
The issue is because if you remove an encrypted OSD drive that is still healthy, the dm-crypt volume is left active and will prevent you from wiping the drive correctly for re-use. We should be handling this case in future releases to correctly remove the dm-crypt volume.

If you have a case you need to do this..replace healthy encrypted OSD drive with new OSD from same drive:

Manually stop the OSD then remove from UI.
Find dm-crypt UUID using
lsblk
Remove dm-crypt volume (replace with UUID)
cryptsetup remove /dev/mapper/l4D6ql-Prji-IzH4-dfhF-xzuf-5ETl-jNRcXC
Clean drive (replace sdX)
pvs -o vgname /dev/sdX1 | grep ceph | xargs vgchange -a n
wipefs -a /dev/sdX1
wipefs -a /dev/sdX

Thanks really appreciate it and thanks for prompt response. This was perfect.