Forums - PetaSAN

ForumBug ReportingUnable to add OSD to cluster
You need to log in to create posts and topics. Login · Register
Unable to add OSD to cluster

ghbiz
76 Posts

March 6, 2020, 12:44 pm
Quote from ghbiz on March 6, 2020, 12:44 pm
Hello,

We just updated to the latest version 2.5 Petasan and are also adding disks to each node as an additional task. I noticed that one of the nodes had run out of space on the nvme journal drive when it's 800GB. This node had 8 drives and 12 partitions on the NVME disk which leads me to believe the software doesn't go back and re-use old partitions which would be a great feature improvement in itself.

However, my problem is that all the OSDs got manually removed from the cluster and cleared out the partitions on NVME drive to reset the entire node. Then follow normal startup procedures of "add NVME back in as journal partition" and start adding regular HDD drives into the cluster with NVME journal. The provisioning of journal partition doesn't seem to be the same as previously described. Now, there are 2 partition on journal disk, one as big as the drive just about and a second which is slightly less than 1Mb. The UI is giving an error "Exception calling add OSD method." when trying to add the 3rd OSD to cluster with journal drive, however, it adds fine to the cluster w/out journaling.

Would greatly appreciaite some insight here on what should be troubleshot to get these OSDs back on-line with journaling.
-----------------
IndexError: list index out of range
06/03/2020 07:26:03 ERROR Error while run command.
06/03/2020 07:26:03 INFO script
06/03/2020 07:26:03 INFO /opt/petasan/scripts/admin/node_manage_disks.py add-osd
06/03/2020 07:26:03 INFO params
06/03/2020 07:26:03 INFO -disk_name sdg -journal auto
06/03/2020 07:26:03 INFO Start add osd job for disk sdg.
06/03/2020 07:26:09 INFO Start cleaning disk : sdg
06/03/2020 07:26:11 INFO Executing : wipefs --all /dev/sdg
06/03/2020 07:26:11 INFO Executing : dd if=/dev/zero of=/dev/sdg bs=1M count=20 oflag=direct,dsync >/dev/null 2>&1
06/03/2020 07:26:11 INFO Executing : parted -s /dev/sdg mklabel gpt
06/03/2020 07:26:11 INFO Executing : partprobe /dev/sdg
06/03/2020 07:26:15 INFO Auto select journal for disk sdg.
06/03/2020 07:26:16 ERROR Cannot add osd for disk sdg , Exception is : list index out of range
Traceback (most recent call last):
File "/opt/petasan/scripts/admin/node_manage_disks.py", line 256, in add_osd
journal = ceph_disk_lib.get_valid_journal()
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/ceph/ceph_disk_lib.py", line 563, in get_valid_journal
if not is_journal_space_avail(disk.name):
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/ceph/ceph_disk_lib.py", line 542, in is_journal_space_avail
free_disk_space = disk_avail_space(disk_name)
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/ceph/ceph_disk_lib.py", line 794, in disk_avail_space
free_disk_space = float(free_disk_space[0]) + total_unused_partitions_size
IndexError: list index out of range
06/03/2020 07:27:45 ERROR list index out of range
Traceback (most recent call last):
File "/opt/petasan/scripts/admin/node_manage_disks.py", line 128, in main_catch
func(args)
File "/opt/petasan/scripts/admin/node_manage_disks.py", line 518, in disk_avail_space
disk_free_space = ceph_disk_lib.disk_avail_space(args.disk_name)
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/ceph/ceph_disk_lib.py", line 794, in disk_avail_space
free_disk_space = float(free_disk_space[0]) + total_unused_partitions_size
IndexError: list index out of range
06/03/2020 07:27:45 ERROR Error while run command.

----------------
----------------
root@ceph-node6:~# parted /dev/nvme0n1
GNU Parted 3.2
Using /dev/nvme0n1
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) p
Model: INTEL SSDPE2MD800G4 (nvme)
Disk /dev/nvme0n1: 800GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number Start End Size File system Name Flags
2 17.4kB 1049kB 1031kB ceph-journal-db
1 1049kB 800GB 800GB ceph-journal-db

Hello,

We just updated to the latest version 2.5 Petasan and are also adding disks to each node as an additional task. I noticed that one of the nodes had run out of space on the nvme journal drive when it's 800GB. This node had 8 drives and 12 partitions on the NVME disk which leads me to believe the software doesn't go back and re-use old partitions which would be a great feature improvement in itself.

However, my problem is that all the OSDs got manually removed from the cluster and cleared out the partitions on NVME drive to reset the entire node. Then follow normal startup procedures of "add NVME back in as journal partition" and start adding regular HDD drives into the cluster with NVME journal. The provisioning of journal partition doesn't seem to be the same as previously described. Now, there are 2 partition on journal disk, one as big as the drive just about and a second which is slightly less than 1Mb. The UI is giving an error "Exception calling add OSD method." when trying to add the 3rd OSD to cluster with journal drive, however, it adds fine to the cluster w/out journaling.

Would greatly appreciaite some insight here on what should be troubleshot to get these OSDs back on-line with journaling.
-----------------
IndexError: list index out of range
06/03/2020 07:26:03 ERROR Error while run command.
06/03/2020 07:26:03 INFO script
06/03/2020 07:26:03 INFO /opt/petasan/scripts/admin/node_manage_disks.py add-osd
06/03/2020 07:26:03 INFO params
06/03/2020 07:26:03 INFO -disk_name sdg -journal auto
06/03/2020 07:26:03 INFO Start add osd job for disk sdg.
06/03/2020 07:26:09 INFO Start cleaning disk : sdg
06/03/2020 07:26:11 INFO Executing : wipefs --all /dev/sdg
06/03/2020 07:26:11 INFO Executing : dd if=/dev/zero of=/dev/sdg bs=1M count=20 oflag=direct,dsync >/dev/null 2>&1
06/03/2020 07:26:11 INFO Executing : parted -s /dev/sdg mklabel gpt
06/03/2020 07:26:11 INFO Executing : partprobe /dev/sdg
06/03/2020 07:26:15 INFO Auto select journal for disk sdg.
06/03/2020 07:26:16 ERROR Cannot add osd for disk sdg , Exception is : list index out of range
Traceback (most recent call last):
File "/opt/petasan/scripts/admin/node_manage_disks.py", line 256, in add_osd
journal = ceph_disk_lib.get_valid_journal()
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/ceph/ceph_disk_lib.py", line 563, in get_valid_journal
if not is_journal_space_avail(disk.name):
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/ceph/ceph_disk_lib.py", line 542, in is_journal_space_avail
free_disk_space = disk_avail_space(disk_name)
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/ceph/ceph_disk_lib.py", line 794, in disk_avail_space
free_disk_space = float(free_disk_space[0]) + total_unused_partitions_size
IndexError: list index out of range
06/03/2020 07:27:45 ERROR list index out of range
Traceback (most recent call last):
File "/opt/petasan/scripts/admin/node_manage_disks.py", line 128, in main_catch
func(args)
File "/opt/petasan/scripts/admin/node_manage_disks.py", line 518, in disk_avail_space
disk_free_space = ceph_disk_lib.disk_avail_space(args.disk_name)
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/ceph/ceph_disk_lib.py", line 794, in disk_avail_space
free_disk_space = float(free_disk_space[0]) + total_unused_partitions_size
IndexError: list index out of range
06/03/2020 07:27:45 ERROR Error while run command.

----------------
----------------
root@ceph-node6:~# parted /dev/nvme0n1
GNU Parted 3.2
Using /dev/nvme0n1
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) p
Model: INTEL SSDPE2MD800G4 (nvme)
Disk /dev/nvme0n1: 800GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number Start End Size File system Name Flags
2 17.4kB 1049kB 1031kB ceph-journal-db
1 1049kB 800GB 800GB ceph-journal-db

#1

admin
2,930 Posts

March 6, 2020, 1:43 pm
Quote from admin on March 6, 2020, 1:43 pm
Re-use of journal space was added in v 2.3.0, it tags added journal partitions so they can be-reused else empty space will be used. we cannot use just any exiting partition, However i do not know if this features also support journals created pre- 2.3.0. Was your cluster created before 2.3.0 ?

From the partition info of your nvme, it does not seem that any OSDs were added. Can you delete it as a journal from ui then re-add it from ui then see if you can add OSDs.

Re-use of journal space was added in v 2.3.0, it tags added journal partitions so they can be-reused else empty space will be used. we cannot use just any exiting partition, However i do not know if this features also support journals created pre- 2.3.0. Was your cluster created before 2.3.0 ?

From the partition info of your nvme, it does not seem that any OSDs were added. Can you delete it as a journal from ui then re-add it from ui then see if you can add OSDs.

#2

ghbiz
76 Posts

March 6, 2020, 4:55 pm
Quote from ghbiz on March 6, 2020, 4:55 pm
Hello,
Yes, this cluster was created prior to 2.3. I've deleted the journal drive, readded it and readded the first and second drive fine. 3rd drive starts throwing the errors in petaSAN.log file as show below. I've also zapped all the disks prior to adding them.

Can we do a webex or anydesk type session to troubleshoot deeper?

06/03/2020 11:24:09 INFO Starting : ceph-volume --log-path /opt/petasan/log lvm prepare --bluestore --data /dev/sdb1 --block.db /dev/nvme0n1p1
06/03/2020 11:24:10 ERROR Error executing : ceph-volume --log-path /opt/petasan/log lvm prepare --bluestore --data /dev/sdb1 --block.db /dev/nvme0n1p1
06/03/2020 11:47:38 ERROR list index out of range
Traceback (most recent call last):
File "/opt/petasan/scripts/admin/node_manage_disks.py", line 128, in main_catch
func(args)
File "/opt/petasan/scripts/admin/node_manage_disks.py", line 518, in disk_avail_space
disk_free_space = ceph_disk_lib.disk_avail_space(args.disk_name)
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/ceph/ceph_disk_lib.py", line 794, in disk_avail_space
free_disk_space = float(free_disk_space[0]) + total_unused_partitions_size
IndexError: list index out of range
06/03/2020 11:47:38 ERROR Error while run command.

Hello,
Yes, this cluster was created prior to 2.3. I've deleted the journal drive, readded it and readded the first and second drive fine. 3rd drive starts throwing the errors in petaSAN.log file as show below. I've also zapped all the disks prior to adding them.

Can we do a webex or anydesk type session to troubleshoot deeper?

06/03/2020 11:24:09 INFO Starting : ceph-volume --log-path /opt/petasan/log lvm prepare --bluestore --data /dev/sdb1 --block.db /dev/nvme0n1p1
06/03/2020 11:24:10 ERROR Error executing : ceph-volume --log-path /opt/petasan/log lvm prepare --bluestore --data /dev/sdb1 --block.db /dev/nvme0n1p1
06/03/2020 11:47:38 ERROR list index out of range
Traceback (most recent call last):
File "/opt/petasan/scripts/admin/node_manage_disks.py", line 128, in main_catch
func(args)
File "/opt/petasan/scripts/admin/node_manage_disks.py", line 518, in disk_avail_space
disk_free_space = ceph_disk_lib.disk_avail_space(args.disk_name)
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/ceph/ceph_disk_lib.py", line 794, in disk_avail_space
free_disk_space = float(free_disk_space[0]) + total_unused_partitions_size
IndexError: list index out of range
06/03/2020 11:47:38 ERROR Error while run command.

#3

admin
2,930 Posts

March 6, 2020, 6:27 pm
Quote from admin on March 6, 2020, 6:27 pm
we cannot do webex sessions for forum support, you can ofcourse buy support if you wish.

parted /dev/nvme0n1 print # after you added 2 disks successfully
ceph config get osd.* bluestore_block_db_size
cat /etc/ceph/ceph.conf | grep bluestore_block_db_size

are you using bluestore or filestore ?

we cannot do webex sessions for forum support, you can ofcourse buy support if you wish.

parted /dev/nvme0n1 print # after you added 2 disks successfully
ceph config get osd.* bluestore_block_db_size
cat /etc/ceph/ceph.conf | grep bluestore_block_db_size

are you using bluestore or filestore ?

#4

ghbiz
76 Posts

March 6, 2020, 8:51 pm
Quote from ghbiz on March 6, 2020, 8:51 pm
We have no issue with paying for support if we are truly stuck. We believe that this may be an oversight or a bug from the upgrade. see below for your requested command outputs.

we are using bluestore

root@ceph-node6:~# parted /dev/nvme0n1 print
Model: INTEL SSDPE2MD800G4 (nvme)
Disk /dev/nvme0n1: 800GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number Start End Size File system Name Flags
2 17.4kB 1049kB 1031kB ceph-journal-db
1 1049kB 800GB 800GB ceph-journal-db

root@ceph-node6:~# ceph config get osd.* bluestore_block_db_size
42949672960
root@ceph-node6:~# cat /etc/ceph/ceph.conf | grep bluestore_block_db_size
root@ceph-node6:~#

NOTE the above was showing nothing for the last grep command ####

We have no issue with paying for support if we are truly stuck. We believe that this may be an oversight or a bug from the upgrade. see below for your requested command outputs.

we are using bluestore

root@ceph-node6:~# parted /dev/nvme0n1 print
Model: INTEL SSDPE2MD800G4 (nvme)
Disk /dev/nvme0n1: 800GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number Start End Size File system Name Flags
2 17.4kB 1049kB 1031kB ceph-journal-db
1 1049kB 800GB 800GB ceph-journal-db

root@ceph-node6:~# ceph config get osd.* bluestore_block_db_size
42949672960
root@ceph-node6:~# cat /etc/ceph/ceph.conf | grep bluestore_block_db_size
root@ceph-node6:~#

NOTE the above was showing nothing for the last grep command ####

#5

admin
2,930 Posts

March 6, 2020, 9:50 pm
Quote from admin on March 6, 2020, 9:50 pm
This is indeed a bug. I will follow up..

This is indeed a bug. I will follow up..

#6

admin
2,930 Posts

March 7, 2020, 12:08 am
Quote from admin on March 7, 2020, 12:08 am
Can you try this patch, download:

https://drive.google.com/open?id=1oav5oglKbrgunnZOBYLXBndVPdYHIkFA

apply via:

patch -d / -p1 < get_journal_size.patch

delete the journal, re-add journal. let me know if all ok.

Can you try this patch, download:

https://drive.google.com/open?id=1oav5oglKbrgunnZOBYLXBndVPdYHIkFA

apply via:

patch -d / -p1 < get_journal_size.patch

delete the journal, re-add journal. let me know if all ok.

Last edited on March 7, 2020, 12:08 am by admin · #7

ghbiz
76 Posts

March 7, 2020, 2:07 am
Quote from ghbiz on March 7, 2020, 2:07 am
Hello,

Patch is working. Thanks allot.

Brian

Hello,

Patch is working. Thanks allot.

Brian

#8

Post Reply: Unable to add OSD to cluster

Cancel