Testing Petasan - Drive Failure Procedure
RobertH
27 Posts
July 31, 2020, 2:10 pmQuote from RobertH on July 31, 2020, 2:10 pmDoing a lab-build testing of PetaSAN before putting into production and trying different things to build internal documentation / howtos.
The lab setup is:
6 node cluster
Each Node:
= 4x 10GB NICs (2x 2 port cards, all bonded with interfaces running on vlans)
= 1x PCIe NVME (512GB Journal)
= 1x OS Drive (80GB SATA)
= 8x OSD drives (300GB SAS 10K) connected to HBA
Came up with a scenario that Im not sure how to deal with: We have the OSD drives setup to use the NVMe as a journal disk in each node.
What we are trying to come up with is the replacement procedure for an OSD should one fail, we tried it in our first pass before adding journal NVMe and it was just a matter of stopping the service, swapping the drive, and re-adding the disk in petasan, with the journal it seems to be more complicated.
Procedure tried so far:
- Drive is reported as failed by the controller (smart, offline, etc)
- Locate OSD name in PetaSAN gui to get the underlying service (ie ceph-osd@##)
- Log into the console on the node that has disk issue using SSH
- Execute the service stop for the ceph osd (systemctl stop ceph-osd@##)
- Wait for the petasan gui to show disk as stopped
- Wait for the petasan gui to show that the data has been rebalanced
- Click the X button to remove the disk
- Once disk is removed and marked as unused in the gui remove physical disk from host
- Insert new physical disk to host
- Wait for petasan gui to show new drive in host
- Click add button
- Select to journal the drive
- failure journal does not have additional space
From our testing it seems the only way we can find (short of running commands on the console that Im not aware of) to get the replacement drive back into the journal on the NVMe is to remove all of the drives in the host from petasan so that petasan clears the journal drives partitions, and then add them back in one at a time
So is there a documentation ( looked through the admin manual ) on what the proper procedure is to swap a failing / failed drive out that is using a journal on another disk??
Thanks
Doing a lab-build testing of PetaSAN before putting into production and trying different things to build internal documentation / howtos.
The lab setup is:
6 node cluster
Each Node:
= 4x 10GB NICs (2x 2 port cards, all bonded with interfaces running on vlans)
= 1x PCIe NVME (512GB Journal)
= 1x OS Drive (80GB SATA)
= 8x OSD drives (300GB SAS 10K) connected to HBA
Came up with a scenario that Im not sure how to deal with: We have the OSD drives setup to use the NVMe as a journal disk in each node.
What we are trying to come up with is the replacement procedure for an OSD should one fail, we tried it in our first pass before adding journal NVMe and it was just a matter of stopping the service, swapping the drive, and re-adding the disk in petasan, with the journal it seems to be more complicated.
Procedure tried so far:
- Drive is reported as failed by the controller (smart, offline, etc)
- Locate OSD name in PetaSAN gui to get the underlying service (ie ceph-osd@##)
- Log into the console on the node that has disk issue using SSH
- Execute the service stop for the ceph osd (systemctl stop ceph-osd@##)
- Wait for the petasan gui to show disk as stopped
- Wait for the petasan gui to show that the data has been rebalanced
- Click the X button to remove the disk
- Once disk is removed and marked as unused in the gui remove physical disk from host
- Insert new physical disk to host
- Wait for petasan gui to show new drive in host
- Click add button
- Select to journal the drive
- failure journal does not have additional space
From our testing it seems the only way we can find (short of running commands on the console that Im not aware of) to get the replacement drive back into the journal on the NVMe is to remove all of the drives in the host from petasan so that petasan clears the journal drives partitions, and then add them back in one at a time
So is there a documentation ( looked through the admin manual ) on what the proper procedure is to swap a failing / failed drive out that is using a journal on another disk??
Thanks
admin
2,930 Posts
July 31, 2020, 4:22 pmQuote from admin on July 31, 2020, 4:22 pmIf the OSD is down, the ui shows the delete button. If it is a working OSD, you can stop it via systemctl to delete it.
If an OSD is defective or physically removed, it will show up in the Disk List in a separate row by itself since it would not have a physical device associated, still it will have a delete button so it can be deleted from the cluster and crush tree.
Things are more complex when a journal is involved, first Ceph expects you to manage the journal partitions yourself, earlier we used to require a new journal partition each time we added a new OSD, later we added code to tag a journal partition whether it was available or used. When we delete a disk, we tag its journal partition as free so it can be re-used when adding new disk. This does not work in all cases if the OSD device is not readable we cannot know its uuid and identify what journal partition needs to be freed. We cannot assume that any current unconnected partitions are un-unused, the OSD could be temporarily down or removed so it is risky for us to free its partition. In version 2.6 we added scripts used by our support to tag free partitions manually yourself once you are sure:
/opt/petasan/scripts/util/make_journal_partition_free.py
/opt/petasan/scripts/util/journal_active_partitions.py
If the OSD is down, the ui shows the delete button. If it is a working OSD, you can stop it via systemctl to delete it.
If an OSD is defective or physically removed, it will show up in the Disk List in a separate row by itself since it would not have a physical device associated, still it will have a delete button so it can be deleted from the cluster and crush tree.
Things are more complex when a journal is involved, first Ceph expects you to manage the journal partitions yourself, earlier we used to require a new journal partition each time we added a new OSD, later we added code to tag a journal partition whether it was available or used. When we delete a disk, we tag its journal partition as free so it can be re-used when adding new disk. This does not work in all cases if the OSD device is not readable we cannot know its uuid and identify what journal partition needs to be freed. We cannot assume that any current unconnected partitions are un-unused, the OSD could be temporarily down or removed so it is risky for us to free its partition. In version 2.6 we added scripts used by our support to tag free partitions manually yourself once you are sure:
/opt/petasan/scripts/util/make_journal_partition_free.py
/opt/petasan/scripts/util/journal_active_partitions.py
Testing Petasan - Drive Failure Procedure
RobertH
27 Posts
Quote from RobertH on July 31, 2020, 2:10 pmDoing a lab-build testing of PetaSAN before putting into production and trying different things to build internal documentation / howtos.
The lab setup is:
6 node cluster
Each Node:
= 4x 10GB NICs (2x 2 port cards, all bonded with interfaces running on vlans)
= 1x PCIe NVME (512GB Journal)
= 1x OS Drive (80GB SATA)
= 8x OSD drives (300GB SAS 10K) connected to HBACame up with a scenario that Im not sure how to deal with: We have the OSD drives setup to use the NVMe as a journal disk in each node.
What we are trying to come up with is the replacement procedure for an OSD should one fail, we tried it in our first pass before adding journal NVMe and it was just a matter of stopping the service, swapping the drive, and re-adding the disk in petasan, with the journal it seems to be more complicated.
Procedure tried so far:
- Drive is reported as failed by the controller (smart, offline, etc)
- Locate OSD name in PetaSAN gui to get the underlying service (ie ceph-osd@##)
- Log into the console on the node that has disk issue using SSH
- Execute the service stop for the ceph osd (systemctl stop ceph-osd@##)
- Wait for the petasan gui to show disk as stopped
- Wait for the petasan gui to show that the data has been rebalanced
- Click the X button to remove the disk
- Once disk is removed and marked as unused in the gui remove physical disk from host
- Insert new physical disk to host
- Wait for petasan gui to show new drive in host
- Click add button
- Select to journal the drive
- failure journal does not have additional space
From our testing it seems the only way we can find (short of running commands on the console that Im not aware of) to get the replacement drive back into the journal on the NVMe is to remove all of the drives in the host from petasan so that petasan clears the journal drives partitions, and then add them back in one at a time
So is there a documentation ( looked through the admin manual ) on what the proper procedure is to swap a failing / failed drive out that is using a journal on another disk??
Thanks
Doing a lab-build testing of PetaSAN before putting into production and trying different things to build internal documentation / howtos.
The lab setup is:
6 node cluster
Each Node:
= 4x 10GB NICs (2x 2 port cards, all bonded with interfaces running on vlans)
= 1x PCIe NVME (512GB Journal)
= 1x OS Drive (80GB SATA)
= 8x OSD drives (300GB SAS 10K) connected to HBA
Came up with a scenario that Im not sure how to deal with: We have the OSD drives setup to use the NVMe as a journal disk in each node.
What we are trying to come up with is the replacement procedure for an OSD should one fail, we tried it in our first pass before adding journal NVMe and it was just a matter of stopping the service, swapping the drive, and re-adding the disk in petasan, with the journal it seems to be more complicated.
Procedure tried so far:
- Drive is reported as failed by the controller (smart, offline, etc)
- Locate OSD name in PetaSAN gui to get the underlying service (ie ceph-osd@##)
- Log into the console on the node that has disk issue using SSH
- Execute the service stop for the ceph osd (systemctl stop ceph-osd@##)
- Wait for the petasan gui to show disk as stopped
- Wait for the petasan gui to show that the data has been rebalanced
- Click the X button to remove the disk
- Once disk is removed and marked as unused in the gui remove physical disk from host
- Insert new physical disk to host
- Wait for petasan gui to show new drive in host
- Click add button
- Select to journal the drive
- failure journal does not have additional space
From our testing it seems the only way we can find (short of running commands on the console that Im not aware of) to get the replacement drive back into the journal on the NVMe is to remove all of the drives in the host from petasan so that petasan clears the journal drives partitions, and then add them back in one at a time
So is there a documentation ( looked through the admin manual ) on what the proper procedure is to swap a failing / failed drive out that is using a journal on another disk??
Thanks
admin
2,930 Posts
Quote from admin on July 31, 2020, 4:22 pmIf the OSD is down, the ui shows the delete button. If it is a working OSD, you can stop it via systemctl to delete it.
If an OSD is defective or physically removed, it will show up in the Disk List in a separate row by itself since it would not have a physical device associated, still it will have a delete button so it can be deleted from the cluster and crush tree.Things are more complex when a journal is involved, first Ceph expects you to manage the journal partitions yourself, earlier we used to require a new journal partition each time we added a new OSD, later we added code to tag a journal partition whether it was available or used. When we delete a disk, we tag its journal partition as free so it can be re-used when adding new disk. This does not work in all cases if the OSD device is not readable we cannot know its uuid and identify what journal partition needs to be freed. We cannot assume that any current unconnected partitions are un-unused, the OSD could be temporarily down or removed so it is risky for us to free its partition. In version 2.6 we added scripts used by our support to tag free partitions manually yourself once you are sure:
/opt/petasan/scripts/util/make_journal_partition_free.py
/opt/petasan/scripts/util/journal_active_partitions.py
If the OSD is down, the ui shows the delete button. If it is a working OSD, you can stop it via systemctl to delete it.
If an OSD is defective or physically removed, it will show up in the Disk List in a separate row by itself since it would not have a physical device associated, still it will have a delete button so it can be deleted from the cluster and crush tree.
Things are more complex when a journal is involved, first Ceph expects you to manage the journal partitions yourself, earlier we used to require a new journal partition each time we added a new OSD, later we added code to tag a journal partition whether it was available or used. When we delete a disk, we tag its journal partition as free so it can be re-used when adding new disk. This does not work in all cases if the OSD device is not readable we cannot know its uuid and identify what journal partition needs to be freed. We cannot assume that any current unconnected partitions are un-unused, the OSD could be temporarily down or removed so it is risky for us to free its partition. In version 2.6 we added scripts used by our support to tag free partitions manually yourself once you are sure:
/opt/petasan/scripts/util/make_journal_partition_free.py
/opt/petasan/scripts/util/journal_active_partitions.py