Forums - PetaSAN

ForumGeneral DiscussionRe-Add OSD after Upgrade to 2.0
You need to log in to create posts and topics. Login · Register
Re-Add OSD after Upgrade to 2.0

trexman
60 Posts

February 15, 2018, 12:48 pm
Quote from trexman on February 15, 2018, 12:48 pm
Hello,

we did the manual upgrade using the upgrade package. Everything went fine.
Except we can't re-add one OSD.

In the PetaSAN.log you can see this error

15/02/2018 13:29:04 INFO Start add osd job for disk sda.
15/02/2018 13:29:05 INFO Start cleaning disk sda
15/02/2018 13:29:06 INFO Executing wipefs --all /dev/sda
15/02/2018 13:29:06 INFO Executing dd if=/dev/zero of=/dev/sda bs=1M count=500 oflag=direct,dsync >/dev/null 2>&1
15/02/2018 13:29:12 INFO Executing parted -s /dev/sda mklabel gpt
15/02/2018 13:29:12 INFO Executing partprobe /dev/sda
15/02/2018 13:29:15 INFO Starting ceph-disk zap /dev/sda
15/02/2018 13:29:18 INFO ceph-disk zap done
15/02/2018 13:29:18 INFO Auto select journal for disk sda.
15/02/2018 13:29:19 INFO User selected Auto journal, selected device is nvme0n1 disk for disk sda.
15/02/2018 13:29:19 INFO Start prepare osd sda
15/02/2018 13:29:19 INFO Starting ceph-disk prepare --zap-disk --bluestore --block-dev /dev/sda --block.db /dev/nvme0n1 --cluster HBSC
15/02/2018 13:29:22 ERROR Error executing ceph-disk prepare --zap-disk --bluestore --block-dev /dev/sda --block.db /dev/nvme0n1 --cluster HBSC

We tried to delete the old file system info's of the disk with dd, like it is explained here:

http://www.petasan.org/forums/?view=thread&id=152&part=2#postid-805

...without success.

After the error the disk looks like this (fdisk):

Disk /dev/sda: 7.3 TiB, 8001563222016 bytes, 1953506646 sectors
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 99E0DF6D-09A8-4134-A7A6-0AFA6B6E29BA

Device Start End Sectors Size Type
/dev/sda1 256 25855 25600 100M Ceph disk in creation

After the dd the disk looks like this:

Disk /dev/sda: 7.3 TiB, 8001563222016 bytes, 1953506646 sectors
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

If we do the steps of the PetaSAN.log manually, we get some messages which are maybe not ok?

# wipefs --all /dev/sda

# dd if=/dev/zero of=/dev/sda bs=1M count=500 oflag=direct,dsync
500+0 records in
500+0 records out
524288000 bytes (524 MB, 500 MiB) copied, 6.38178 s, 82.2 MB/s

# parted -s /dev/sda mklabel gpt

# partprobe /dev/sda

# ceph-disk zap /dev/sda
Caution: invalid backup GPT header, but valid main header; regenerating
backup header from main header.

Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
verification and recovery are STRONGLY recommended.

GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
Creating new GPT entries.
The operation has completed successfully.

# ceph-disk prepare --zap-disk --bluestore --block-dev /dev/sda --block.db /dev/nvme0n1 --cluster HBSC
Caution: invalid backup GPT header, but valid main header; regenerating
backup header from main header.

Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
verification and recovery are STRONGLY recommended.

GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
Creating new GPT entries.
The operation has completed successfully.
Setting name!
partNum is 0
REALLY setting name!
The operation has completed successfully.
Could not create partition 12 from 461375488 to 503318527
Setting name!
partNum is 11
REALLY setting name!
Unable to set partition 12's name to 'ceph block.db'!
Could not change partition 12's type code to 30cd0809-c2b2-499c-8879-2d6b785292be!
Error encountered; not saving changes.
'/sbin/sgdisk --new=12:0:+20480M --change-name=12:ceph block.db --partition-guid=12:5452c660-c1b4-4dbd-b967-0c927c434538 --typecode=12:30cd0809-c2b2-499c-8879-2d6b785292be --mbrtogpt -- /dev/nvme0n1' failed with status code 4

Any Ideas?

Thank you

Hello,

we did the manual upgrade using the upgrade package. Everything went fine.
Except we can't re-add one OSD.

In the PetaSAN.log you can see this error

15/02/2018 13:29:04 INFO Start add osd job for disk sda.
15/02/2018 13:29:05 INFO Start cleaning disk sda
15/02/2018 13:29:06 INFO Executing wipefs --all /dev/sda
15/02/2018 13:29:06 INFO Executing dd if=/dev/zero of=/dev/sda bs=1M count=500 oflag=direct,dsync >/dev/null 2>&1
15/02/2018 13:29:12 INFO Executing parted -s /dev/sda mklabel gpt
15/02/2018 13:29:12 INFO Executing partprobe /dev/sda
15/02/2018 13:29:15 INFO Starting ceph-disk zap /dev/sda
15/02/2018 13:29:18 INFO ceph-disk zap done
15/02/2018 13:29:18 INFO Auto select journal for disk sda.
15/02/2018 13:29:19 INFO User selected Auto journal, selected device is nvme0n1 disk for disk sda.
15/02/2018 13:29:19 INFO Start prepare osd sda
15/02/2018 13:29:19 INFO Starting ceph-disk prepare --zap-disk --bluestore --block-dev /dev/sda --block.db /dev/nvme0n1 --cluster HBSC
15/02/2018 13:29:22 ERROR Error executing ceph-disk prepare --zap-disk --bluestore --block-dev /dev/sda --block.db /dev/nvme0n1 --cluster HBSC

We tried to delete the old file system info's of the disk with dd, like it is explained here:

http://www.petasan.org/forums/?view=thread&id=152&part=2#postid-805

...without success.

After the error the disk looks like this (fdisk):

Disk /dev/sda: 7.3 TiB, 8001563222016 bytes, 1953506646 sectors
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 99E0DF6D-09A8-4134-A7A6-0AFA6B6E29BA

Device Start End Sectors Size Type
/dev/sda1 256 25855 25600 100M Ceph disk in creation

After the dd the disk looks like this:

Disk /dev/sda: 7.3 TiB, 8001563222016 bytes, 1953506646 sectors
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

If we do the steps of the PetaSAN.log manually, we get some messages which are maybe not ok?

# wipefs --all /dev/sda

# dd if=/dev/zero of=/dev/sda bs=1M count=500 oflag=direct,dsync
500+0 records in
500+0 records out
524288000 bytes (524 MB, 500 MiB) copied, 6.38178 s, 82.2 MB/s

# parted -s /dev/sda mklabel gpt

# partprobe /dev/sda

# ceph-disk zap /dev/sda
Caution: invalid backup GPT header, but valid main header; regenerating
backup header from main header.

****
Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
verification and recovery are STRONGLY recommended.

GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
Creating new GPT entries.
The operation has completed successfully.

# ceph-disk prepare --zap-disk --bluestore --block-dev /dev/sda --block.db /dev/nvme0n1 --cluster HBSC
Caution: invalid backup GPT header, but valid main header; regenerating
backup header from main header.

Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
verification and recovery are STRONGLY recommended.

GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
Creating new GPT entries.
The operation has completed successfully.
Setting name!
partNum is 0
REALLY setting name!
The operation has completed successfully.
Could not create partition 12 from 461375488 to 503318527
Setting name!
partNum is 11
REALLY setting name!
Unable to set partition 12's name to 'ceph block.db'!
Could not change partition 12's type code to 30cd0809-c2b2-499c-8879-2d6b785292be!
Error encountered; not saving changes.
'/sbin/sgdisk --new=12:0:+20480M --change-name=12:ceph block.db --partition-guid=12:5452c660-c1b4-4dbd-b967-0c927c434538 --typecode=12:30cd0809-c2b2-499c-8879-2d6b785292be --mbrtogpt -- /dev/nvme0n1' failed with status code 4

Any Ideas?

Thank you

Last edited on February 15, 2018, 12:49 pm by trexman · #1

admin
2,969 Posts

February 15, 2018, 1:07 pm
Quote from admin on February 15, 2018, 1:07 pm
The most common cases of such failure which is not handled well by us or by ceph-disk is that each wal/db partition is 20G, so make sure your disk has enough capacity if it serves a couple of OSDs.

The ceph-disk output you sent from your manual commands is not OK, it shows partition creation error.

Also did it work on some OSDs and some not or it fails all the time ? For testing can you create an OSD without specifying an external nvme ? would it work if there was a different external wal/db device ? Note if you add an OSD for test you can always remove it using the same method described in the upgrade doc for removing filestore OSDs.

If you do not find a fix, please send me the detail ceph-disk logs ( they have the verbose flag) and stores in /opt/petasan/logs/ceph-disk.log

The most common cases of such failure which is not handled well by us or by ceph-disk is that each wal/db partition is 20G, so make sure your disk has enough capacity if it serves a couple of OSDs.

The ceph-disk output you sent from your manual commands is not OK, it shows partition creation error.

Also did it work on some OSDs and some not or it fails all the time ? For testing can you create an OSD without specifying an external nvme ? would it work if there was a different external wal/db device ? Note if you add an OSD for test you can always remove it using the same method described in the upgrade doc for removing filestore OSDs.

If you do not find a fix, please send me the detail ceph-disk logs ( they have the verbose flag) and stores in /opt/petasan/logs/ceph-disk.log

#2

trexman
60 Posts

February 15, 2018, 1:50 pm
Quote from trexman on February 15, 2018, 1:50 pm
Hmm it looks like that it is a capacity problem.
We have 3 nodes for testing with each 4 disk.

No, we only have the problem with the last disk on the node. The 3 disk before went through the upgrade process without any problems.

Yes, we can successfully add the last disk without the use of a journal

But why is the journal SSD full? OK we did a few testing with e.g. more/other disk's.
The SSD looks like this:

# fdisk -l /dev/nvme0n1
Disk /dev/nvme0n1: 223.6 GiB, 240057409536 bytes, 468862128 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 7C6C6580-6DE5-47C5-8E8C-1988A2DEDA4C

Device Start End Sectors Size Type
/dev/nvme0n1p1 2048 41945087 41943040 20G Ceph Journal
/dev/nvme0n1p2 41945088 83888127 41943040 20G Ceph Journal
/dev/nvme0n1p3 83888128 125831167 41943040 20G Ceph Journal
/dev/nvme0n1p4 125831168 167774207 41943040 20G Ceph Journal
/dev/nvme0n1p5 167774208 209717247 41943040 20G Ceph Journal
/dev/nvme0n1p6 209717248 251660287 41943040 20G Ceph Journal
/dev/nvme0n1p7 251660288 293603327 41943040 20G Ceph Journal
/dev/nvme0n1p8 293603328 335546367 41943040 20G Ceph Journal
/dev/nvme0n1p9 335546368 377489407 41943040 20G unknown
/dev/nvme0n1p10 377489408 419432447 41943040 20G unknown
/dev/nvme0n1p11 419432448 461375487 41943040 20G unknown

Can we do a cleaning of the journal partitions?

Hmm it looks like that it is a capacity problem.
We have 3 nodes for testing with each 4 disk.

No, we only have the problem with the last disk on the node. The 3 disk before went through the upgrade process without any problems.

Yes, we can successfully add the last disk without the use of a journal

But why is the journal SSD full? OK we did a few testing with e.g. more/other disk's.
The SSD looks like this:

# fdisk -l /dev/nvme0n1
Disk /dev/nvme0n1: 223.6 GiB, 240057409536 bytes, 468862128 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 7C6C6580-6DE5-47C5-8E8C-1988A2DEDA4C

Device Start End Sectors Size Type
/dev/nvme0n1p1 2048 41945087 41943040 20G Ceph Journal
/dev/nvme0n1p2 41945088 83888127 41943040 20G Ceph Journal
/dev/nvme0n1p3 83888128 125831167 41943040 20G Ceph Journal
/dev/nvme0n1p4 125831168 167774207 41943040 20G Ceph Journal
/dev/nvme0n1p5 167774208 209717247 41943040 20G Ceph Journal
/dev/nvme0n1p6 209717248 251660287 41943040 20G Ceph Journal
/dev/nvme0n1p7 251660288 293603327 41943040 20G Ceph Journal
/dev/nvme0n1p8 293603328 335546367 41943040 20G Ceph Journal
/dev/nvme0n1p9 335546368 377489407 41943040 20G unknown
/dev/nvme0n1p10 377489408 419432447 41943040 20G unknown
/dev/nvme0n1p11 419432448 461375487 41943040 20G unknown

Can we do a cleaning of the journal partitions?

#3

trexman
60 Posts

February 15, 2018, 2:02 pm
Quote from trexman on February 15, 2018, 2:02 pm
BTW this is the journal SSD of the next node, before re-adding the disk's:

# fdisk -l /dev/nvme0n1
Disk /dev/nvme0n1: 223.6 GiB, 240057409536 bytes, 468862128 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 03359309-A2AF-46EF-9C04-849479FE394B

Device Start End Sectors Size Type
/dev/nvme0n1p1 2048 41945087 41943040 20G Ceph Journal
/dev/nvme0n1p2 41945088 83888127 41943040 20G Ceph Journal
/dev/nvme0n1p3 83888128 125831167 41943040 20G Ceph Journal
/dev/nvme0n1p4 125831168 167774207 41943040 20G Ceph Journal
/dev/nvme0n1p5 167774208 209717247 41943040 20G Ceph Journal
/dev/nvme0n1p6 209717248 251660287 41943040 20G Ceph Journal
/dev/nvme0n1p7 251660288 293603327 41943040 20G Ceph Journal
/dev/nvme0n1p8 293603328 335546367 41943040 20G Ceph Journal

BTW this is the journal SSD of the next node, before re-adding the disk's:

# fdisk -l /dev/nvme0n1
Disk /dev/nvme0n1: 223.6 GiB, 240057409536 bytes, 468862128 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 03359309-A2AF-46EF-9C04-849479FE394B

Device Start End Sectors Size Type
/dev/nvme0n1p1 2048 41945087 41943040 20G Ceph Journal
/dev/nvme0n1p2 41945088 83888127 41943040 20G Ceph Journal
/dev/nvme0n1p3 83888128 125831167 41943040 20G Ceph Journal
/dev/nvme0n1p4 125831168 167774207 41943040 20G Ceph Journal
/dev/nvme0n1p5 167774208 209717247 41943040 20G Ceph Journal
/dev/nvme0n1p6 209717248 251660287 41943040 20G Ceph Journal
/dev/nvme0n1p7 251660288 293603327 41943040 20G Ceph Journal
/dev/nvme0n1p8 293603328 335546367 41943040 20G Ceph Journal

#4

trexman
60 Posts

February 15, 2018, 3:08 pm
Quote from trexman on February 15, 2018, 3:08 pm
OK. We solved it by removing all OSD and re-adding all at one time.

The journal SSD is now "clean":

# fdisk -l /dev/nvme0n1
Disk /dev/nvme0n1: 223.6 GiB, 240057409536 bytes, 468862128 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: F466D308-7D82-46B2-BDA0-974D432421B4

Device Start End Sectors Size Type
/dev/nvme0n1p1 2048 41945087 41943040 20G unknown
/dev/nvme0n1p2 41945088 83888127 41943040 20G unknown
/dev/nvme0n1p3 83888128 125831167 41943040 20G unknown
/dev/nvme0n1p4 125831168 167774207 41943040 20G unknown

But why is the type of the partition now "unknown"? Before it had the type "Ceph Journal" as you can see above.

OK. We solved it by removing all OSD and re-adding all at one time.

The journal SSD is now "clean":

# fdisk -l /dev/nvme0n1
Disk /dev/nvme0n1: 223.6 GiB, 240057409536 bytes, 468862128 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: F466D308-7D82-46B2-BDA0-974D432421B4

Device Start End Sectors Size Type
/dev/nvme0n1p1 2048 41945087 41943040 20G unknown
/dev/nvme0n1p2 41945088 83888127 41943040 20G unknown
/dev/nvme0n1p3 83888128 125831167 41943040 20G unknown
/dev/nvme0n1p4 125831168 167774207 41943040 20G unknown

But why is the type of the partition now "unknown"? Before it had the type "Ceph Journal" as you can see above.

#5

admin
2,969 Posts

February 15, 2018, 3:45 pm
Quote from admin on February 15, 2018, 3:45 pm
The earlier capacity issue when upgrading one by one is that ceph-disk will create a new 20 G partition at the end of the disk, since there are older filestore OSDs still using the journal disk. When you do delete all OSDs in a node, there are no active OSDs using the journal, so we can safely clean it and starts using partitions from the beginning. Although not recommended by Ceph, removing all OSDs in a single node is not that bad since no 2 data replicas will exist on a single node.

When you remove all OSDs in a node and start adding them with journal (wal/db) it should start creating partitions from the beginning. can you please try this.

Technically filestore uses journal, bluestore use wal(which is a journal) + db for metadata. In PetaSAN we call both journal, but this is not ceph terminology. It is OK if you see unknown partition type in ceph-disk if it is not being currently used by an OSD. Once you have an OSD using the journal you should see something like :

/dev/sdd : <-- journal ( wal/db)
/dev/sdd1 ceph block.db, for /dev/sde1
/dev/sde : <-- osd
/dev/sde1 ceph data, active, cluster demo, osd.6, block /dev/sde2, block.db /dev/sdd1

The earlier capacity issue when upgrading one by one is that ceph-disk will create a new 20 G partition at the end of the disk, since there are older filestore OSDs still using the journal disk. When you do delete all OSDs in a node, there are no active OSDs using the journal, so we can safely clean it and starts using partitions from the beginning. Although not recommended by Ceph, removing all OSDs in a single node is not that bad since no 2 data replicas will exist on a single node.

When you remove all OSDs in a node and start adding them with journal (wal/db) it should start creating partitions from the beginning. can you please try this.

Technically filestore uses journal, bluestore use wal(which is a journal) + db for metadata. In PetaSAN we call both journal, but this is not ceph terminology. It is OK if you see unknown partition type in ceph-disk if it is not being currently used by an OSD. Once you have an OSD using the journal you should see something like :

/dev/sdd : <-- journal ( wal/db)
/dev/sdd1 ceph block.db, for /dev/sde1
/dev/sde : <-- osd
/dev/sde1 ceph data, active, cluster demo, osd.6, block /dev/sde2, block.db /dev/sdd1

Last edited on February 15, 2018, 3:49 pm by admin · #6

admin
2,969 Posts

February 16, 2018, 12:10 pm
Quote from admin on February 16, 2018, 12:10 pm
To summarize this case so to help others:

Currently when we add a new OSD and specify an external journal (wal/db), we prepare the journal as follows:

If the journal is not serving any local OSDs, the entire journal disk is cleaned and its partition table deleted.

A new partition (20 GB by default but can be changed in conf file) is created on the journal disk for use by the new OSD

In the majority of cases this is good enough. For the case of OSD conversions from filestore -> bluestore, you need to insure the journal has enough space since the current filestore journal partitions will not be used for new bluestore wal/db partitions, this could be a concern if the jounral disk is around 200 GB or less (typically journal should serve 4 OSDs). In such cases it may be better to delete all filestore OSDs at once on a single node to allow the system to wipe out all journal disks clean, the drawback of removing all OSDs in a node is there will be more recovery traffic in the cluster but it should not affect data safety since no 2 replicas of the same data will be stored on a single node.

We will look into a more clever way of re-using old journal partitions, this will involve changes to ceph-disk itself.

To summarize this case so to help others:

Currently when we add a new OSD and specify an external journal (wal/db), we prepare the journal as follows:

If the journal is not serving any local OSDs, the entire journal disk is cleaned and its partition table deleted.

A new partition (20 GB by default but can be changed in conf file) is created on the journal disk for use by the new OSD

In the majority of cases this is good enough. For the case of OSD conversions from filestore -> bluestore, you need to insure the journal has enough space since the current filestore journal partitions will not be used for new bluestore wal/db partitions, this could be a concern if the jounral disk is around 200 GB or less (typically journal should serve 4 OSDs). In such cases it may be better to delete all filestore OSDs at once on a single node to allow the system to wipe out all journal disks clean, the drawback of removing all OSDs in a node is there will be more recovery traffic in the cluster but it should not affect data safety since no 2 replicas of the same data will be stored on a single node.

We will look into a more clever way of re-using old journal partitions, this will involve changes to ceph-disk itself.

Last edited on February 16, 2018, 12:14 pm by admin · #7

Post Reply: Re-Add OSD after Upgrade to 2.0

Cancel