Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

OSD Down, Process to locate physical slot

Hello, I have an OSD showing down. The problem is I can't tell physically what drive this is. From the physical drive list in the web console, the OSD shows no information just down. There is no information icon, just the X button to delete the OSD.

I checked the node logs from the web console, no mention of the OSD going down.

I connected to the node via SSH and I cannot figure out how to associate the failed OSD with anything useful like the SAS GUID.

What's a good way to find this information?

To add more I have the block ID from:
root@xxxxxxx:/var/lib/ceph/osd/ceph-142# ls -l
lrwxrwxrwx 1 ceph ceph 93 Nov 22 09:41 block -> /dev/ceph-7b07254c-b904-435a-b57f-d253e733a940/osd-block-23129bcb-f01c-4640-bd9b-5af17db38c3b

When I use #ceph-volume lvm list, I am unable to find drive.

The best option is to look at disk vendor / model / serial number. If a disk dies it will not respond to this info, if you do not have this recorded you can look at other up disks and deduce the failed one from serial numbers.

Good thinking. I just checked using sas3ircu to see if I could find a SAS GUID that was missing from the known GUIDs using the physical disk list in the web console as a reference. So I did find one not listed. Ironically my HBA appears to have taken the slot offline completely. So the slot count data reads...Slot 15 (stuff), Slot 16 (Stuff), Slot 18 (Stuff)...

Slot 17 is just missing completely now. This is starting to make sense as to why PetaSAN knows nothing about the drive's physical information.
I've had one other OSD failure before and have not had this issue so I thought I was crazy. Thank you for your time.

I plan on swapping the drive to see what happens, I suspect I will have to reboot the node to bring that slot back online. I'll update tomorrow for completeness.


For informational purposes by build is a:

Supermicro 36 Bay 4U Box - PIO-648R-E1CR36L+-ST031
X10DRi-T4+
2x Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
128 GB RAM
SAS3 with expander Backplane
SAS3008 HBA

 

Update - Final:

So I pulled the disk and instantly placed it back in the slot. The slot recovered and everything is fine. I talked to some service techs I know and they seem to have trouble right now with SAS3008 and SAS3108 based cards. Same symptoms, slots vanish, or 20 slots will go offline at once and then come back.

So all is good. Thanks again.

 

good to know, I have a sas3008 based card on the shelf just waiting to be flashed to IT mode.