OSD Down, Process to locate physical slot
rvalkenburg
8 Posts
December 16, 2021, 4:42 pmQuote from rvalkenburg on December 16, 2021, 4:42 pmHello, I have an OSD showing down. The problem is I can't tell physically what drive this is. From the physical drive list in the web console, the OSD shows no information just down. There is no information icon, just the X button to delete the OSD.
I checked the node logs from the web console, no mention of the OSD going down.
I connected to the node via SSH and I cannot figure out how to associate the failed OSD with anything useful like the SAS GUID.
What's a good way to find this information?
Hello, I have an OSD showing down. The problem is I can't tell physically what drive this is. From the physical drive list in the web console, the OSD shows no information just down. There is no information icon, just the X button to delete the OSD.
I checked the node logs from the web console, no mention of the OSD going down.
I connected to the node via SSH and I cannot figure out how to associate the failed OSD with anything useful like the SAS GUID.
What's a good way to find this information?
rvalkenburg
8 Posts
December 16, 2021, 4:58 pmQuote from rvalkenburg on December 16, 2021, 4:58 pmTo add more I have the block ID from:
root@xxxxxxx:/var/lib/ceph/osd/ceph-142# ls -l
lrwxrwxrwx 1 ceph ceph 93 Nov 22 09:41 block -> /dev/ceph-7b07254c-b904-435a-b57f-d253e733a940/osd-block-23129bcb-f01c-4640-bd9b-5af17db38c3b
When I use #ceph-volume lvm list, I am unable to find drive.
To add more I have the block ID from:
root@xxxxxxx:/var/lib/ceph/osd/ceph-142# ls -l
lrwxrwxrwx 1 ceph ceph 93 Nov 22 09:41 block -> /dev/ceph-7b07254c-b904-435a-b57f-d253e733a940/osd-block-23129bcb-f01c-4640-bd9b-5af17db38c3b
When I use #ceph-volume lvm list, I am unable to find drive.
admin
2,930 Posts
December 16, 2021, 6:46 pmQuote from admin on December 16, 2021, 6:46 pmThe best option is to look at disk vendor / model / serial number. If a disk dies it will not respond to this info, if you do not have this recorded you can look at other up disks and deduce the failed one from serial numbers.
The best option is to look at disk vendor / model / serial number. If a disk dies it will not respond to this info, if you do not have this recorded you can look at other up disks and deduce the failed one from serial numbers.
rvalkenburg
8 Posts
December 16, 2021, 7:56 pmQuote from rvalkenburg on December 16, 2021, 7:56 pmGood thinking. I just checked using sas3ircu to see if I could find a SAS GUID that was missing from the known GUIDs using the physical disk list in the web console as a reference. So I did find one not listed. Ironically my HBA appears to have taken the slot offline completely. So the slot count data reads...Slot 15 (stuff), Slot 16 (Stuff), Slot 18 (Stuff)...
Slot 17 is just missing completely now. This is starting to make sense as to why PetaSAN knows nothing about the drive's physical information.
I've had one other OSD failure before and have not had this issue so I thought I was crazy. Thank you for your time.
I plan on swapping the drive to see what happens, I suspect I will have to reboot the node to bring that slot back online. I'll update tomorrow for completeness.
For informational purposes by build is a:
Supermicro 36 Bay 4U Box - PIO-648R-E1CR36L+-ST031
X10DRi-T4+
2x Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
128 GB RAM
SAS3 with expander Backplane
SAS3008 HBA
Good thinking. I just checked using sas3ircu to see if I could find a SAS GUID that was missing from the known GUIDs using the physical disk list in the web console as a reference. So I did find one not listed. Ironically my HBA appears to have taken the slot offline completely. So the slot count data reads...Slot 15 (stuff), Slot 16 (Stuff), Slot 18 (Stuff)...
Slot 17 is just missing completely now. This is starting to make sense as to why PetaSAN knows nothing about the drive's physical information.
I've had one other OSD failure before and have not had this issue so I thought I was crazy. Thank you for your time.
I plan on swapping the drive to see what happens, I suspect I will have to reboot the node to bring that slot back online. I'll update tomorrow for completeness.
For informational purposes by build is a:
Supermicro 36 Bay 4U Box - PIO-648R-E1CR36L+-ST031
X10DRi-T4+
2x Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
128 GB RAM
SAS3 with expander Backplane
SAS3008 HBA
rvalkenburg
8 Posts
December 21, 2021, 1:36 pmQuote from rvalkenburg on December 21, 2021, 1:36 pmUpdate - Final:
So I pulled the disk and instantly placed it back in the slot. The slot recovered and everything is fine. I talked to some service techs I know and they seem to have trouble right now with SAS3008 and SAS3108 based cards. Same symptoms, slots vanish, or 20 slots will go offline at once and then come back.
So all is good. Thanks again.
Update - Final:
So I pulled the disk and instantly placed it back in the slot. The slot recovered and everything is fine. I talked to some service techs I know and they seem to have trouble right now with SAS3008 and SAS3108 based cards. Same symptoms, slots vanish, or 20 slots will go offline at once and then come back.
So all is good. Thanks again.
Shiori
86 Posts
January 12, 2022, 7:08 pmQuote from Shiori on January 12, 2022, 7:08 pmgood to know, I have a sas3008 based card on the shelf just waiting to be flashed to IT mode.
good to know, I have a sas3008 based card on the shelf just waiting to be flashed to IT mode.
OSD Down, Process to locate physical slot
rvalkenburg
8 Posts
Quote from rvalkenburg on December 16, 2021, 4:42 pmHello, I have an OSD showing down. The problem is I can't tell physically what drive this is. From the physical drive list in the web console, the OSD shows no information just down. There is no information icon, just the X button to delete the OSD.
I checked the node logs from the web console, no mention of the OSD going down.
I connected to the node via SSH and I cannot figure out how to associate the failed OSD with anything useful like the SAS GUID.
What's a good way to find this information?
Hello, I have an OSD showing down. The problem is I can't tell physically what drive this is. From the physical drive list in the web console, the OSD shows no information just down. There is no information icon, just the X button to delete the OSD.
I checked the node logs from the web console, no mention of the OSD going down.
I connected to the node via SSH and I cannot figure out how to associate the failed OSD with anything useful like the SAS GUID.
What's a good way to find this information?
rvalkenburg
8 Posts
Quote from rvalkenburg on December 16, 2021, 4:58 pmTo add more I have the block ID from:
root@xxxxxxx:/var/lib/ceph/osd/ceph-142# ls -l
lrwxrwxrwx 1 ceph ceph 93 Nov 22 09:41 block -> /dev/ceph-7b07254c-b904-435a-b57f-d253e733a940/osd-block-23129bcb-f01c-4640-bd9b-5af17db38c3bWhen I use #ceph-volume lvm list, I am unable to find drive.
To add more I have the block ID from:
root@xxxxxxx:/var/lib/ceph/osd/ceph-142# ls -l
lrwxrwxrwx 1 ceph ceph 93 Nov 22 09:41 block -> /dev/ceph-7b07254c-b904-435a-b57f-d253e733a940/osd-block-23129bcb-f01c-4640-bd9b-5af17db38c3b
When I use #ceph-volume lvm list, I am unable to find drive.
admin
2,930 Posts
Quote from admin on December 16, 2021, 6:46 pmThe best option is to look at disk vendor / model / serial number. If a disk dies it will not respond to this info, if you do not have this recorded you can look at other up disks and deduce the failed one from serial numbers.
The best option is to look at disk vendor / model / serial number. If a disk dies it will not respond to this info, if you do not have this recorded you can look at other up disks and deduce the failed one from serial numbers.
rvalkenburg
8 Posts
Quote from rvalkenburg on December 16, 2021, 7:56 pmGood thinking. I just checked using sas3ircu to see if I could find a SAS GUID that was missing from the known GUIDs using the physical disk list in the web console as a reference. So I did find one not listed. Ironically my HBA appears to have taken the slot offline completely. So the slot count data reads...Slot 15 (stuff), Slot 16 (Stuff), Slot 18 (Stuff)...
Slot 17 is just missing completely now. This is starting to make sense as to why PetaSAN knows nothing about the drive's physical information.
I've had one other OSD failure before and have not had this issue so I thought I was crazy. Thank you for your time.I plan on swapping the drive to see what happens, I suspect I will have to reboot the node to bring that slot back online. I'll update tomorrow for completeness.
For informational purposes by build is a:
Supermicro 36 Bay 4U Box - PIO-648R-E1CR36L+-ST031
X10DRi-T4+
2x Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
128 GB RAM
SAS3 with expander Backplane
SAS3008 HBA
Good thinking. I just checked using sas3ircu to see if I could find a SAS GUID that was missing from the known GUIDs using the physical disk list in the web console as a reference. So I did find one not listed. Ironically my HBA appears to have taken the slot offline completely. So the slot count data reads...Slot 15 (stuff), Slot 16 (Stuff), Slot 18 (Stuff)...
Slot 17 is just missing completely now. This is starting to make sense as to why PetaSAN knows nothing about the drive's physical information.
I've had one other OSD failure before and have not had this issue so I thought I was crazy. Thank you for your time.
I plan on swapping the drive to see what happens, I suspect I will have to reboot the node to bring that slot back online. I'll update tomorrow for completeness.
For informational purposes by build is a:
Supermicro 36 Bay 4U Box - PIO-648R-E1CR36L+-ST031
X10DRi-T4+
2x Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
128 GB RAM
SAS3 with expander Backplane
SAS3008 HBA
rvalkenburg
8 Posts
Quote from rvalkenburg on December 21, 2021, 1:36 pmUpdate - Final:
So I pulled the disk and instantly placed it back in the slot. The slot recovered and everything is fine. I talked to some service techs I know and they seem to have trouble right now with SAS3008 and SAS3108 based cards. Same symptoms, slots vanish, or 20 slots will go offline at once and then come back.
So all is good. Thanks again.
Update - Final:
So I pulled the disk and instantly placed it back in the slot. The slot recovered and everything is fine. I talked to some service techs I know and they seem to have trouble right now with SAS3008 and SAS3108 based cards. Same symptoms, slots vanish, or 20 slots will go offline at once and then come back.
So all is good. Thanks again.
Shiori
86 Posts
Quote from Shiori on January 12, 2022, 7:08 pmgood to know, I have a sas3008 based card on the shelf just waiting to be flashed to IT mode.
good to know, I have a sas3008 based card on the shelf just waiting to be flashed to IT mode.