Cannot add any new OSD after removing an OSD
the.only.chaos.lucifer
31 Posts
June 4, 2024, 8:34 amQuote from the.only.chaos.lucifer on June 4, 2024, 8:34 amAfter cleaning the drive and everything and making sure to run commands to remove the drive out of the ceph osd tree, umount it, and remove it from crush map not sure what else is causing this issue. Last time when i ran into this issue my general command works until it doesn't.
ceph osd out 33
systemctl stop ceph-osd@33
ceph osd crush remove osd.33
ceph osd rm 33
umount /var/lib/ceph/osd/ceph-33
ceph osd crush remove osd.33
rm -rf /var/lib/ceph/osd/ceph-33
I am having trouble adding any drive back into the node.
it say adding but will fail shortly afterward with error log below:
04/06/2024 08:20:07 INFO CIFSService key change action
04/06/2024 08:20:00 ERROR Error executing : ceph-volume --log-path /opt/petasan/log lvm prepare --bluestore --dmcrypt --data /dev/xvdn1 --crush-device-class ssd-rack-A00001N00001-02
--> RuntimeError: Unable to create a new OSD id
stderr: Error EEXIST: entity osd.33 exists but key does not match
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 316b6224-a805-43a7-9d46-8d794df5830c
Running command: /usr/bin/ceph-authtool --gen-print-key
04/06/2024 08:20:00 ERROR Running command: /usr/bin/ceph-authtool --gen-print-key
04/06/2024 08:19:59 INFO Starting : ceph-volume --log-path /opt/petasan/log lvm prepare --bluestore --dmcrypt --data /dev/xvdn1 --crush-device-class ssd-rack-A00001N00001-02
04/06/2024 08:19:56 INFO Executing udevadm settle --timeout 30
04/06/2024 08:19:56 INFO Calling udevadm on xvdn device
04/06/2024 08:19:56 INFO Executing partprobe /dev/xvdn
04/06/2024 08:19:56 INFO Calling partprobe on xvdn device
04/06/2024 08:19:55 INFO Creating data partition num 1 size 66560MB on /dev/xvdn
04/06/2024 08:19:54 INFO Start prepare bluestore OSD : xvdn
04/06/2024 08:19:54 INFO User didn't select a cache for disk xvdn.
04/06/2024 08:19:54 INFO User didn't select a journal for disk xvdn, so the journal will be on the same disk.
04/06/2024 08:19:51 INFO Executing : partprobe /dev/xvdn
04/06/2024 08:19:51 INFO Executing : parted -s /dev/xvdn mklabel gpt
04/06/2024 08:19:51 INFO Executing : dd if=/dev/zero of=/dev/xvdn bs=1M count=20 oflag=direct,dsync >/dev/null 2>&1
04/06/2024 08:19:51 INFO Executing : wipefs --all /dev/xvdn
04/06/2024 08:19:50 INFO Start cleaning : xvdn
After cleaning the drive and everything and making sure to run commands to remove the drive out of the ceph osd tree, umount it, and remove it from crush map not sure what else is causing this issue. Last time when i ran into this issue my general command works until it doesn't.
ceph osd out 33
systemctl stop ceph-osd@33
ceph osd crush remove osd.33
ceph osd rm 33
umount /var/lib/ceph/osd/ceph-33
ceph osd crush remove osd.33
rm -rf /var/lib/ceph/osd/ceph-33
I am having trouble adding any drive back into the node.
it say adding but will fail shortly afterward with error log below:
04/06/2024 08:20:07 INFO CIFSService key change action
04/06/2024 08:20:00 ERROR Error executing : ceph-volume --log-path /opt/petasan/log lvm prepare --bluestore --dmcrypt --data /dev/xvdn1 --crush-device-class ssd-rack-A00001N00001-02
--> RuntimeError: Unable to create a new OSD id
stderr: Error EEXIST: entity osd.33 exists but key does not match
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 316b6224-a805-43a7-9d46-8d794df5830c
Running command: /usr/bin/ceph-authtool --gen-print-key
04/06/2024 08:20:00 ERROR Running command: /usr/bin/ceph-authtool --gen-print-key
04/06/2024 08:19:59 INFO Starting : ceph-volume --log-path /opt/petasan/log lvm prepare --bluestore --dmcrypt --data /dev/xvdn1 --crush-device-class ssd-rack-A00001N00001-02
04/06/2024 08:19:56 INFO Executing udevadm settle --timeout 30
04/06/2024 08:19:56 INFO Calling udevadm on xvdn device
04/06/2024 08:19:56 INFO Executing partprobe /dev/xvdn
04/06/2024 08:19:56 INFO Calling partprobe on xvdn device
04/06/2024 08:19:55 INFO Creating data partition num 1 size 66560MB on /dev/xvdn
04/06/2024 08:19:54 INFO Start prepare bluestore OSD : xvdn
04/06/2024 08:19:54 INFO User didn't select a cache for disk xvdn.
04/06/2024 08:19:54 INFO User didn't select a journal for disk xvdn, so the journal will be on the same disk.
04/06/2024 08:19:51 INFO Executing : partprobe /dev/xvdn
04/06/2024 08:19:51 INFO Executing : parted -s /dev/xvdn mklabel gpt
04/06/2024 08:19:51 INFO Executing : dd if=/dev/zero of=/dev/xvdn bs=1M count=20 oflag=direct,dsync >/dev/null 2>&1
04/06/2024 08:19:51 INFO Executing : wipefs --all /dev/xvdn
04/06/2024 08:19:50 INFO Start cleaning : xvdn
admin
2,930 Posts
June 4, 2024, 11:34 amQuote from admin on June 4, 2024, 11:34 amwhat was the initial failure that let you to re-move/re-add the OSD ?
if you try with non-encrypted drive, do you also have the same issue ?
what was the initial failure that let you to re-move/re-add the OSD ?
if you try with non-encrypted drive, do you also have the same issue ?
the.only.chaos.lucifer
31 Posts
June 5, 2024, 4:34 amQuote from the.only.chaos.lucifer on June 5, 2024, 4:34 amI was just playing around with the removal of the drive and adding for testing purpose making sure I understand the whole process so there was no issue with the drive. Issue occurs sometime when I remove the drive and it seem as if something is left hanging causing issue when adding it back in.
Try encrypted and unencrypted and no difference. Once this issue occurs rebooting, deleting the disk and changing it doesn't seem to do anything. But since this is a VM i can revert back to its previous state and everything is back to normal and i can add the disk but in production I definitely cannot do that especially if i need to remove OSDs and then cannot add it afterward. FYI version 3.3.0
05/06/2024 04:26:25 ERROR Error executing : ceph-volume --log-path /opt/petasan/log lvm prepare --bluestore --data /dev/xvdn1 --crush-device-class ssd-rack-A00001N00001-02
--> RuntimeError: Unable to create a new OSD id
stderr: Error EEXIST: entity osd.33 exists but key does not match
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 1b31de05-06c4-4ecc-bebe-804c280fcaab
05/06/2024 04:26:25 ERROR Running command: /usr/bin/ceph-authtool --gen-print-key
05/06/2024 04:26:25 INFO Starting : ceph-volume --log-path /opt/petasan/log lvm prepare --bluestore --data /dev/xvdn1 --crush-device-class ssd-rack-A00001N00001-02
05/06/2024 04:26:22 INFO Executing udevadm settle --timeout 30
05/06/2024 04:26:22 INFO Calling udevadm on xvdn device
05/06/2024 04:26:22 INFO Executing partprobe /dev/xvdn
05/06/2024 04:26:22 INFO Calling partprobe on xvdn device
05/06/2024 04:26:20 INFO Creating data partition num 1 size 66560MB on /dev/xvdn
05/06/2024 04:26:20 INFO Start prepare bluestore OSD : xvdn
05/06/2024 04:26:20 INFO User didn't select a cache for disk xvdn.
05/06/2024 04:26:20 INFO User didn't select a journal for disk xvdn, so the journal will be on the same disk.
05/06/2024 04:26:17 INFO Executing : partprobe /dev/xvdn
05/06/2024 04:26:17 INFO Executing : parted -s /dev/xvdn mklabel gpt
05/06/2024 04:26:17 INFO Executing : dd if=/dev/zero of=/dev/xvdn bs=1M count=20 oflag=direct,dsync >/dev/null 2>&1
05/06/2024 04:26:17 INFO Executing : wipefs --all /dev/xvdn
05/06/2024 04:26:17 INFO Executing : dd if=/dev/zero of=/dev/xvdn1 bs=1M count=20 oflag=direct,dsync >/dev/null 2>&1
05/06/2024 04:26:17 INFO Executing : wipefs --all /dev/xvdn1
PetaSAN.core.common.CustomException.CIFSException
raise CIFSException(CIFSException.CIFS_CLUSTER_NOT_UP, '')
File "/usr/lib/python3/dist-packages/PetaSAN/core/cifs/cifs_server.py", line 369, in rename_shares_top_dir
self.rename_shares_top_dir(settings)
File "/usr/lib/python3/dist-packages/PetaSAN/core/cifs/cifs_server.py", line 311, in sync_consul_settings
self.server.sync_consul_settings(settings)
File "/usr/lib/python3/dist-packages/PetaSAN/core/cifs/cifs_service.py", line 38, in key_change_action
self.key_change_action(kv)
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/watch_base.py", line 62, in run
Traceback (most recent call last):
05/06/2024 04:26:16 ERROR
05/06/2024 04:26:16 ERROR WatchBase Exception :
05/06/2024 04:26:15 INFO CIFSService key change action
05/06/2024 04:26:13 INFO Start cleaning : xvdn
05/06/2024 04:26:10 INFO Start add osd job 29889
05/06/2024 04:26:10 INFO Start add osd job for disk xvdn.
05/06/2024 04:26:10 INFO Running script : /opt/petasan/scripts/admin/node_manage_disks.py add-osd -disk_name xvdn -device_class ssd-rack-A00001N00001-02 -encrypted False
PetaSAN.core.common.CustomException.CIFSException
I was just playing around with the removal of the drive and adding for testing purpose making sure I understand the whole process so there was no issue with the drive. Issue occurs sometime when I remove the drive and it seem as if something is left hanging causing issue when adding it back in.
Try encrypted and unencrypted and no difference. Once this issue occurs rebooting, deleting the disk and changing it doesn't seem to do anything. But since this is a VM i can revert back to its previous state and everything is back to normal and i can add the disk but in production I definitely cannot do that especially if i need to remove OSDs and then cannot add it afterward. FYI version 3.3.0
05/06/2024 04:26:25 ERROR Error executing : ceph-volume --log-path /opt/petasan/log lvm prepare --bluestore --data /dev/xvdn1 --crush-device-class ssd-rack-A00001N00001-02
--> RuntimeError: Unable to create a new OSD id
stderr: Error EEXIST: entity osd.33 exists but key does not match
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 1b31de05-06c4-4ecc-bebe-804c280fcaab
05/06/2024 04:26:25 ERROR Running command: /usr/bin/ceph-authtool --gen-print-key
05/06/2024 04:26:25 INFO Starting : ceph-volume --log-path /opt/petasan/log lvm prepare --bluestore --data /dev/xvdn1 --crush-device-class ssd-rack-A00001N00001-02
05/06/2024 04:26:22 INFO Executing udevadm settle --timeout 30
05/06/2024 04:26:22 INFO Calling udevadm on xvdn device
05/06/2024 04:26:22 INFO Executing partprobe /dev/xvdn
05/06/2024 04:26:22 INFO Calling partprobe on xvdn device
05/06/2024 04:26:20 INFO Creating data partition num 1 size 66560MB on /dev/xvdn
05/06/2024 04:26:20 INFO Start prepare bluestore OSD : xvdn
05/06/2024 04:26:20 INFO User didn't select a cache for disk xvdn.
05/06/2024 04:26:20 INFO User didn't select a journal for disk xvdn, so the journal will be on the same disk.
05/06/2024 04:26:17 INFO Executing : partprobe /dev/xvdn
05/06/2024 04:26:17 INFO Executing : parted -s /dev/xvdn mklabel gpt
05/06/2024 04:26:17 INFO Executing : dd if=/dev/zero of=/dev/xvdn bs=1M count=20 oflag=direct,dsync >/dev/null 2>&1
05/06/2024 04:26:17 INFO Executing : wipefs --all /dev/xvdn
05/06/2024 04:26:17 INFO Executing : dd if=/dev/zero of=/dev/xvdn1 bs=1M count=20 oflag=direct,dsync >/dev/null 2>&1
05/06/2024 04:26:17 INFO Executing : wipefs --all /dev/xvdn1
PetaSAN.core.common.CustomException.CIFSException
raise CIFSException(CIFSException.CIFS_CLUSTER_NOT_UP, '')
File "/usr/lib/python3/dist-packages/PetaSAN/core/cifs/cifs_server.py", line 369, in rename_shares_top_dir
self.rename_shares_top_dir(settings)
File "/usr/lib/python3/dist-packages/PetaSAN/core/cifs/cifs_server.py", line 311, in sync_consul_settings
self.server.sync_consul_settings(settings)
File "/usr/lib/python3/dist-packages/PetaSAN/core/cifs/cifs_service.py", line 38, in key_change_action
self.key_change_action(kv)
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/watch_base.py", line 62, in run
Traceback (most recent call last):
05/06/2024 04:26:16 ERROR
05/06/2024 04:26:16 ERROR WatchBase Exception :
05/06/2024 04:26:15 INFO CIFSService key change action
05/06/2024 04:26:13 INFO Start cleaning : xvdn
05/06/2024 04:26:10 INFO Start add osd job 29889
05/06/2024 04:26:10 INFO Start add osd job for disk xvdn.
05/06/2024 04:26:10 INFO Running script : /opt/petasan/scripts/admin/node_manage_disks.py add-osd -disk_name xvdn -device_class ssd-rack-A00001N00001-02 -encrypted False
PetaSAN.core.common.CustomException.CIFSException
Last edited on June 5, 2024, 5:14 am by the.only.chaos.lucifer · #3
the.only.chaos.lucifer
31 Posts
June 5, 2024, 5:11 amQuote from the.only.chaos.lucifer on June 5, 2024, 5:11 amOk. I tried deleting another OSD which is not ideal but adding OSD start working after deleting. Which makes no sense to me but it is working again. But I notice the xvdn is stuck in "Adding" and just hangs there now... Not sure what would cause that. This could be why it doesn't let me adding anything. Even removing xvdn drive and adding drive back jump it straight to "Adding" therefore causing OSD addition to not work.
Also tried cmd line
root@storage-pool00000-node00001:/var/lib/ceph/osd# python /opt/petasan/scripts/admin/node_manage_disks.py add-osd -disk_name xvdn
Setting name!
partNum is 0
The operation has completed successfully.
Cannot add osd for disk xvdn , Exception is : unexpected EOF while parsing (<string>, line 0)
## petasan ##
Error adding OSD.
Ok. I tried deleting another OSD which is not ideal but adding OSD start working after deleting. Which makes no sense to me but it is working again. But I notice the xvdn is stuck in "Adding" and just hangs there now... Not sure what would cause that. This could be why it doesn't let me adding anything. Even removing xvdn drive and adding drive back jump it straight to "Adding" therefore causing OSD addition to not work.
Also tried cmd line
root@storage-pool00000-node00001:/var/lib/ceph/osd# python /opt/petasan/scripts/admin/node_manage_disks.py add-osd -disk_name xvdn
Setting name!
partNum is 0
The operation has completed successfully.
Cannot add osd for disk xvdn , Exception is : unexpected EOF while parsing (<string>, line 0)
## petasan ##
Error adding OSD.
the.only.chaos.lucifer
31 Posts
June 5, 2024, 7:37 amQuote from the.only.chaos.lucifer on June 5, 2024, 7:37 amOk so my solution is to use: node_manage_disks.py to do the adding and removing of drive. It was mostly the removal of the drive that was the issue so using this script it resolve all of the issue as far as i can tell. I do need to disconnect and reconnect the drive to get the wipefs to run if not the drive is busy and will fail the run on rare occasion which could be the system still busy and holding onto the drive for some reason.
Ok so my solution is to use: node_manage_disks.py to do the adding and removing of drive. It was mostly the removal of the drive that was the issue so using this script it resolve all of the issue as far as i can tell. I do need to disconnect and reconnect the drive to get the wipefs to run if not the drive is busy and will fail the run on rare occasion which could be the system still busy and holding onto the drive for some reason.
Last edited on June 5, 2024, 7:52 am by the.only.chaos.lucifer · #5
admin
2,930 Posts
June 5, 2024, 7:14 pmQuote from admin on June 5, 2024, 7:14 pmThe issue does happen if you try to re-add the same healthy encrypted drive that you had deleted. It does not happen if you add other drives. In real case scenario your replacement drive will be different so will also not be affected.
The issue is because if you remove an encrypted OSD drive that is still healthy, the dm-crypt volume is left active and will prevent you from wiping the drive correctly for re-use. We should be handling this case in future releases to correctly remove the dm-crypt volume.
If you have a case you need to do this..replace healthy encrypted OSD drive with new OSD from same drive:
Manually stop the OSD then remove from UI.
Find dm-crypt UUID using
lsblk
Remove dm-crypt volume (replace with UUID)
cryptsetup remove /dev/mapper/l4D6ql-Prji-IzH4-dfhF-xzuf-5ETl-jNRcXC
Clean drive (replace sdX)
pvs -o vgname /dev/sdX1 | grep ceph | xargs vgchange -a n
wipefs -a /dev/sdX1
wipefs -a /dev/sdX
The issue does happen if you try to re-add the same healthy encrypted drive that you had deleted. It does not happen if you add other drives. In real case scenario your replacement drive will be different so will also not be affected.
The issue is because if you remove an encrypted OSD drive that is still healthy, the dm-crypt volume is left active and will prevent you from wiping the drive correctly for re-use. We should be handling this case in future releases to correctly remove the dm-crypt volume.
If you have a case you need to do this..replace healthy encrypted OSD drive with new OSD from same drive:
Manually stop the OSD then remove from UI.
Find dm-crypt UUID using
lsblk
Remove dm-crypt volume (replace with UUID)
cryptsetup remove /dev/mapper/l4D6ql-Prji-IzH4-dfhF-xzuf-5ETl-jNRcXC
Clean drive (replace sdX)
pvs -o vgname /dev/sdX1 | grep ceph | xargs vgchange -a n
wipefs -a /dev/sdX1
wipefs -a /dev/sdX
Last edited on June 5, 2024, 7:17 pm by admin · #6
the.only.chaos.lucifer
31 Posts
June 7, 2024, 4:57 pmQuote from the.only.chaos.lucifer on June 7, 2024, 4:57 pmThanks really appreciate it and thanks for prompt response. This was perfect.
Thanks really appreciate it and thanks for prompt response. This was perfect.
Cannot add any new OSD after removing an OSD
the.only.chaos.lucifer
31 Posts
Quote from the.only.chaos.lucifer on June 4, 2024, 8:34 amAfter cleaning the drive and everything and making sure to run commands to remove the drive out of the ceph osd tree, umount it, and remove it from crush map not sure what else is causing this issue. Last time when i ran into this issue my general command works until it doesn't.
ceph osd out 33
systemctl stop ceph-osd@33
ceph osd crush remove osd.33
ceph osd rm 33
umount /var/lib/ceph/osd/ceph-33
ceph osd crush remove osd.33
rm -rf /var/lib/ceph/osd/ceph-33I am having trouble adding any drive back into the node.
it say adding but will fail shortly afterward with error log below:
04/06/2024 08:20:07 INFO CIFSService key change action
04/06/2024 08:20:00 ERROR Error executing : ceph-volume --log-path /opt/petasan/log lvm prepare --bluestore --dmcrypt --data /dev/xvdn1 --crush-device-class ssd-rack-A00001N00001-02
--> RuntimeError: Unable to create a new OSD id
stderr: Error EEXIST: entity osd.33 exists but key does not match
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 316b6224-a805-43a7-9d46-8d794df5830c
Running command: /usr/bin/ceph-authtool --gen-print-key
04/06/2024 08:20:00 ERROR Running command: /usr/bin/ceph-authtool --gen-print-key
04/06/2024 08:19:59 INFO Starting : ceph-volume --log-path /opt/petasan/log lvm prepare --bluestore --dmcrypt --data /dev/xvdn1 --crush-device-class ssd-rack-A00001N00001-02
04/06/2024 08:19:56 INFO Executing udevadm settle --timeout 30
04/06/2024 08:19:56 INFO Calling udevadm on xvdn device
04/06/2024 08:19:56 INFO Executing partprobe /dev/xvdn
04/06/2024 08:19:56 INFO Calling partprobe on xvdn device
04/06/2024 08:19:55 INFO Creating data partition num 1 size 66560MB on /dev/xvdn
04/06/2024 08:19:54 INFO Start prepare bluestore OSD : xvdn
04/06/2024 08:19:54 INFO User didn't select a cache for disk xvdn.
04/06/2024 08:19:54 INFO User didn't select a journal for disk xvdn, so the journal will be on the same disk.
04/06/2024 08:19:51 INFO Executing : partprobe /dev/xvdn
04/06/2024 08:19:51 INFO Executing : parted -s /dev/xvdn mklabel gpt
04/06/2024 08:19:51 INFO Executing : dd if=/dev/zero of=/dev/xvdn bs=1M count=20 oflag=direct,dsync >/dev/null 2>&1
04/06/2024 08:19:51 INFO Executing : wipefs --all /dev/xvdn
04/06/2024 08:19:50 INFO Start cleaning : xvdn
After cleaning the drive and everything and making sure to run commands to remove the drive out of the ceph osd tree, umount it, and remove it from crush map not sure what else is causing this issue. Last time when i ran into this issue my general command works until it doesn't.
ceph osd out 33
systemctl stop ceph-osd@33
ceph osd crush remove osd.33
ceph osd rm 33
umount /var/lib/ceph/osd/ceph-33
ceph osd crush remove osd.33
rm -rf /var/lib/ceph/osd/ceph-33
I am having trouble adding any drive back into the node.
it say adding but will fail shortly afterward with error log below:
04/06/2024 08:20:07 INFO CIFSService key change action
04/06/2024 08:20:00 ERROR Error executing : ceph-volume --log-path /opt/petasan/log lvm prepare --bluestore --dmcrypt --data /dev/xvdn1 --crush-device-class ssd-rack-A00001N00001-02
--> RuntimeError: Unable to create a new OSD id
stderr: Error EEXIST: entity osd.33 exists but key does not match
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 316b6224-a805-43a7-9d46-8d794df5830c
Running command: /usr/bin/ceph-authtool --gen-print-key
04/06/2024 08:20:00 ERROR Running command: /usr/bin/ceph-authtool --gen-print-key
04/06/2024 08:19:59 INFO Starting : ceph-volume --log-path /opt/petasan/log lvm prepare --bluestore --dmcrypt --data /dev/xvdn1 --crush-device-class ssd-rack-A00001N00001-02
04/06/2024 08:19:56 INFO Executing udevadm settle --timeout 30
04/06/2024 08:19:56 INFO Calling udevadm on xvdn device
04/06/2024 08:19:56 INFO Executing partprobe /dev/xvdn
04/06/2024 08:19:56 INFO Calling partprobe on xvdn device
04/06/2024 08:19:55 INFO Creating data partition num 1 size 66560MB on /dev/xvdn
04/06/2024 08:19:54 INFO Start prepare bluestore OSD : xvdn
04/06/2024 08:19:54 INFO User didn't select a cache for disk xvdn.
04/06/2024 08:19:54 INFO User didn't select a journal for disk xvdn, so the journal will be on the same disk.
04/06/2024 08:19:51 INFO Executing : partprobe /dev/xvdn
04/06/2024 08:19:51 INFO Executing : parted -s /dev/xvdn mklabel gpt
04/06/2024 08:19:51 INFO Executing : dd if=/dev/zero of=/dev/xvdn bs=1M count=20 oflag=direct,dsync >/dev/null 2>&1
04/06/2024 08:19:51 INFO Executing : wipefs --all /dev/xvdn
04/06/2024 08:19:50 INFO Start cleaning : xvdn
admin
2,930 Posts
Quote from admin on June 4, 2024, 11:34 amwhat was the initial failure that let you to re-move/re-add the OSD ?
if you try with non-encrypted drive, do you also have the same issue ?
what was the initial failure that let you to re-move/re-add the OSD ?
if you try with non-encrypted drive, do you also have the same issue ?
the.only.chaos.lucifer
31 Posts
Quote from the.only.chaos.lucifer on June 5, 2024, 4:34 amI was just playing around with the removal of the drive and adding for testing purpose making sure I understand the whole process so there was no issue with the drive. Issue occurs sometime when I remove the drive and it seem as if something is left hanging causing issue when adding it back in.
Try encrypted and unencrypted and no difference. Once this issue occurs rebooting, deleting the disk and changing it doesn't seem to do anything. But since this is a VM i can revert back to its previous state and everything is back to normal and i can add the disk but in production I definitely cannot do that especially if i need to remove OSDs and then cannot add it afterward. FYI version 3.3.0
05/06/2024 04:26:25 ERROR Error executing : ceph-volume --log-path /opt/petasan/log lvm prepare --bluestore --data /dev/xvdn1 --crush-device-class ssd-rack-A00001N00001-02
--> RuntimeError: Unable to create a new OSD id
stderr: Error EEXIST: entity osd.33 exists but key does not match
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 1b31de05-06c4-4ecc-bebe-804c280fcaab
05/06/2024 04:26:25 ERROR Running command: /usr/bin/ceph-authtool --gen-print-key
05/06/2024 04:26:25 INFO Starting : ceph-volume --log-path /opt/petasan/log lvm prepare --bluestore --data /dev/xvdn1 --crush-device-class ssd-rack-A00001N00001-02
05/06/2024 04:26:22 INFO Executing udevadm settle --timeout 30
05/06/2024 04:26:22 INFO Calling udevadm on xvdn device
05/06/2024 04:26:22 INFO Executing partprobe /dev/xvdn
05/06/2024 04:26:22 INFO Calling partprobe on xvdn device
05/06/2024 04:26:20 INFO Creating data partition num 1 size 66560MB on /dev/xvdn
05/06/2024 04:26:20 INFO Start prepare bluestore OSD : xvdn
05/06/2024 04:26:20 INFO User didn't select a cache for disk xvdn.
05/06/2024 04:26:20 INFO User didn't select a journal for disk xvdn, so the journal will be on the same disk.
05/06/2024 04:26:17 INFO Executing : partprobe /dev/xvdn
05/06/2024 04:26:17 INFO Executing : parted -s /dev/xvdn mklabel gpt
05/06/2024 04:26:17 INFO Executing : dd if=/dev/zero of=/dev/xvdn bs=1M count=20 oflag=direct,dsync >/dev/null 2>&1
05/06/2024 04:26:17 INFO Executing : wipefs --all /dev/xvdn
05/06/2024 04:26:17 INFO Executing : dd if=/dev/zero of=/dev/xvdn1 bs=1M count=20 oflag=direct,dsync >/dev/null 2>&1
05/06/2024 04:26:17 INFO Executing : wipefs --all /dev/xvdn1
PetaSAN.core.common.CustomException.CIFSException
raise CIFSException(CIFSException.CIFS_CLUSTER_NOT_UP, '')
File "/usr/lib/python3/dist-packages/PetaSAN/core/cifs/cifs_server.py", line 369, in rename_shares_top_dir
self.rename_shares_top_dir(settings)
File "/usr/lib/python3/dist-packages/PetaSAN/core/cifs/cifs_server.py", line 311, in sync_consul_settings
self.server.sync_consul_settings(settings)
File "/usr/lib/python3/dist-packages/PetaSAN/core/cifs/cifs_service.py", line 38, in key_change_action
self.key_change_action(kv)
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/watch_base.py", line 62, in run
Traceback (most recent call last):
05/06/2024 04:26:16 ERROR
05/06/2024 04:26:16 ERROR WatchBase Exception :
05/06/2024 04:26:15 INFO CIFSService key change action
05/06/2024 04:26:13 INFO Start cleaning : xvdn
05/06/2024 04:26:10 INFO Start add osd job 29889
05/06/2024 04:26:10 INFO Start add osd job for disk xvdn.
05/06/2024 04:26:10 INFO Running script : /opt/petasan/scripts/admin/node_manage_disks.py add-osd -disk_name xvdn -device_class ssd-rack-A00001N00001-02 -encrypted False
PetaSAN.core.common.CustomException.CIFSException
I was just playing around with the removal of the drive and adding for testing purpose making sure I understand the whole process so there was no issue with the drive. Issue occurs sometime when I remove the drive and it seem as if something is left hanging causing issue when adding it back in.
Try encrypted and unencrypted and no difference. Once this issue occurs rebooting, deleting the disk and changing it doesn't seem to do anything. But since this is a VM i can revert back to its previous state and everything is back to normal and i can add the disk but in production I definitely cannot do that especially if i need to remove OSDs and then cannot add it afterward. FYI version 3.3.0
05/06/2024 04:26:25 ERROR Error executing : ceph-volume --log-path /opt/petasan/log lvm prepare --bluestore --data /dev/xvdn1 --crush-device-class ssd-rack-A00001N00001-02
--> RuntimeError: Unable to create a new OSD id
stderr: Error EEXIST: entity osd.33 exists but key does not match
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 1b31de05-06c4-4ecc-bebe-804c280fcaab
05/06/2024 04:26:25 ERROR Running command: /usr/bin/ceph-authtool --gen-print-key
05/06/2024 04:26:25 INFO Starting : ceph-volume --log-path /opt/petasan/log lvm prepare --bluestore --data /dev/xvdn1 --crush-device-class ssd-rack-A00001N00001-02
05/06/2024 04:26:22 INFO Executing udevadm settle --timeout 30
05/06/2024 04:26:22 INFO Calling udevadm on xvdn device
05/06/2024 04:26:22 INFO Executing partprobe /dev/xvdn
05/06/2024 04:26:22 INFO Calling partprobe on xvdn device
05/06/2024 04:26:20 INFO Creating data partition num 1 size 66560MB on /dev/xvdn
05/06/2024 04:26:20 INFO Start prepare bluestore OSD : xvdn
05/06/2024 04:26:20 INFO User didn't select a cache for disk xvdn.
05/06/2024 04:26:20 INFO User didn't select a journal for disk xvdn, so the journal will be on the same disk.
05/06/2024 04:26:17 INFO Executing : partprobe /dev/xvdn
05/06/2024 04:26:17 INFO Executing : parted -s /dev/xvdn mklabel gpt
05/06/2024 04:26:17 INFO Executing : dd if=/dev/zero of=/dev/xvdn bs=1M count=20 oflag=direct,dsync >/dev/null 2>&1
05/06/2024 04:26:17 INFO Executing : wipefs --all /dev/xvdn
05/06/2024 04:26:17 INFO Executing : dd if=/dev/zero of=/dev/xvdn1 bs=1M count=20 oflag=direct,dsync >/dev/null 2>&1
05/06/2024 04:26:17 INFO Executing : wipefs --all /dev/xvdn1
PetaSAN.core.common.CustomException.CIFSException
raise CIFSException(CIFSException.CIFS_CLUSTER_NOT_UP, '')
File "/usr/lib/python3/dist-packages/PetaSAN/core/cifs/cifs_server.py", line 369, in rename_shares_top_dir
self.rename_shares_top_dir(settings)
File "/usr/lib/python3/dist-packages/PetaSAN/core/cifs/cifs_server.py", line 311, in sync_consul_settings
self.server.sync_consul_settings(settings)
File "/usr/lib/python3/dist-packages/PetaSAN/core/cifs/cifs_service.py", line 38, in key_change_action
self.key_change_action(kv)
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/watch_base.py", line 62, in run
Traceback (most recent call last):
05/06/2024 04:26:16 ERROR
05/06/2024 04:26:16 ERROR WatchBase Exception :
05/06/2024 04:26:15 INFO CIFSService key change action
05/06/2024 04:26:13 INFO Start cleaning : xvdn
05/06/2024 04:26:10 INFO Start add osd job 29889
05/06/2024 04:26:10 INFO Start add osd job for disk xvdn.
05/06/2024 04:26:10 INFO Running script : /opt/petasan/scripts/admin/node_manage_disks.py add-osd -disk_name xvdn -device_class ssd-rack-A00001N00001-02 -encrypted False
PetaSAN.core.common.CustomException.CIFSException
the.only.chaos.lucifer
31 Posts
Quote from the.only.chaos.lucifer on June 5, 2024, 5:11 amOk. I tried deleting another OSD which is not ideal but adding OSD start working after deleting. Which makes no sense to me but it is working again. But I notice the xvdn is stuck in "Adding" and just hangs there now... Not sure what would cause that. This could be why it doesn't let me adding anything. Even removing xvdn drive and adding drive back jump it straight to "Adding" therefore causing OSD addition to not work.
Also tried cmd line
root@storage-pool00000-node00001:/var/lib/ceph/osd# python /opt/petasan/scripts/admin/node_manage_disks.py add-osd -disk_name xvdn
Setting name!
partNum is 0
The operation has completed successfully.
Cannot add osd for disk xvdn , Exception is : unexpected EOF while parsing (<string>, line 0)
## petasan ##
Error adding OSD.
Ok. I tried deleting another OSD which is not ideal but adding OSD start working after deleting. Which makes no sense to me but it is working again. But I notice the xvdn is stuck in "Adding" and just hangs there now... Not sure what would cause that. This could be why it doesn't let me adding anything. Even removing xvdn drive and adding drive back jump it straight to "Adding" therefore causing OSD addition to not work.
Also tried cmd line
root@storage-pool00000-node00001:/var/lib/ceph/osd# python /opt/petasan/scripts/admin/node_manage_disks.py add-osd -disk_name xvdn
Setting name!
partNum is 0
The operation has completed successfully.
Cannot add osd for disk xvdn , Exception is : unexpected EOF while parsing (<string>, line 0)
## petasan ##
Error adding OSD.
the.only.chaos.lucifer
31 Posts
Quote from the.only.chaos.lucifer on June 5, 2024, 7:37 amOk so my solution is to use: node_manage_disks.py to do the adding and removing of drive. It was mostly the removal of the drive that was the issue so using this script it resolve all of the issue as far as i can tell. I do need to disconnect and reconnect the drive to get the wipefs to run if not the drive is busy and will fail the run on rare occasion which could be the system still busy and holding onto the drive for some reason.
Ok so my solution is to use: node_manage_disks.py to do the adding and removing of drive. It was mostly the removal of the drive that was the issue so using this script it resolve all of the issue as far as i can tell. I do need to disconnect and reconnect the drive to get the wipefs to run if not the drive is busy and will fail the run on rare occasion which could be the system still busy and holding onto the drive for some reason.
admin
2,930 Posts
Quote from admin on June 5, 2024, 7:14 pmThe issue does happen if you try to re-add the same healthy encrypted drive that you had deleted. It does not happen if you add other drives. In real case scenario your replacement drive will be different so will also not be affected.
The issue is because if you remove an encrypted OSD drive that is still healthy, the dm-crypt volume is left active and will prevent you from wiping the drive correctly for re-use. We should be handling this case in future releases to correctly remove the dm-crypt volume.If you have a case you need to do this..replace healthy encrypted OSD drive with new OSD from same drive:
Manually stop the OSD then remove from UI.
Find dm-crypt UUID using
lsblk
Remove dm-crypt volume (replace with UUID)
cryptsetup remove /dev/mapper/l4D6ql-Prji-IzH4-dfhF-xzuf-5ETl-jNRcXC
Clean drive (replace sdX)
pvs -o vgname /dev/sdX1 | grep ceph | xargs vgchange -a n
wipefs -a /dev/sdX1
wipefs -a /dev/sdX
The issue does happen if you try to re-add the same healthy encrypted drive that you had deleted. It does not happen if you add other drives. In real case scenario your replacement drive will be different so will also not be affected.
The issue is because if you remove an encrypted OSD drive that is still healthy, the dm-crypt volume is left active and will prevent you from wiping the drive correctly for re-use. We should be handling this case in future releases to correctly remove the dm-crypt volume.
If you have a case you need to do this..replace healthy encrypted OSD drive with new OSD from same drive:
Manually stop the OSD then remove from UI.
Find dm-crypt UUID using
lsblk
Remove dm-crypt volume (replace with UUID)
cryptsetup remove /dev/mapper/l4D6ql-Prji-IzH4-dfhF-xzuf-5ETl-jNRcXC
Clean drive (replace sdX)
pvs -o vgname /dev/sdX1 | grep ceph | xargs vgchange -a n
wipefs -a /dev/sdX1
wipefs -a /dev/sdX
the.only.chaos.lucifer
31 Posts
Quote from the.only.chaos.lucifer on June 7, 2024, 4:57 pmThanks really appreciate it and thanks for prompt response. This was perfect.
Thanks really appreciate it and thanks for prompt response. This was perfect.