Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Physical Disk Layout doesn't load

Hi,

Have just setup a brand new cluster. We didn't add any OSDs during deployment as we want to configure the cache devices more specifically.

Once the third node is joined, and we look at physical disk layout on any of the nodes, it just spins and never loads.

I checked /opt/petasan/scripts/detect-disks.sh and get the following output and wonder if this is related?

root@gd-san-02-01:~# /opt/petasan/scripts/detect-disks.sh
device=sda,size=7814037168,bus=SCSI,fixed=Yes,ssd=No,vendor=SEAGATE,model=ST4000NM0023,serial=5000c50058f35d57
/opt/petasan/scripts/detect-disks.sh: 59: [: Illegal number:
device=sdb,size=234441648,bus=ATA,fixed=Yes,ssd=Yes,vendor=ATA,model=SATA_SSD,serial=Not Detected
/opt/petasan/scripts/detect-disks.sh: 59: [: Illegal number:
device=sdc,size=234441648,bus=ATA,fixed=Yes,ssd=Yes,vendor=ATA,model=SATA_SSD,serial=Not Detected
device=sdd,size=7814037168,bus=SCSI,fixed=Yes,ssd=No,vendor=SEAGATE,model=ST4000NM0023,serial=5000c5006383ca3b
device=sde,size=7814037168,bus=SCSI,fixed=Yes,ssd=No,vendor=SEAGATE,model=ST4000NM0023,serial=5000c50063795a97
device=sdf,size=7814037168,bus=SCSI,fixed=Yes,ssd=No,vendor=SEAGATE,model=ST4000NM0023,serial=5000c50058f299b7
device=sdg,size=7814037168,bus=SCSI,fixed=Yes,ssd=No,vendor=SEAGATE,model=ST4000NM0023,serial=5000c50058f2940f
device=sdh,size=7814037168,bus=SCSI,fixed=Yes,ssd=No,vendor=SEAGATE,model=ST4000NM0023,serial=5000c5008443d74f
device=sdi,size=7814037168,bus=SCSI,fixed=Yes,ssd=No,vendor=SEAGATE,model=ST4000NM0023,serial=5000c5006376b117
device=sdj,size=7814037168,bus=SCSI,fixed=Yes,ssd=No,vendor=SEAGATE,model=ST4000NM0023,serial=5000c5006383be3b
device=sdk,size=7814037168,bus=SCSI,fixed=Yes,ssd=No,vendor=SEAGATE,model=ST4000NM0023,serial=5000c50058f2c3ef

Is this error here what is causing the disk layout page to also fail?

It seems to be failing with both of the SSDs. Maybe as the vendor shows as "ATA"?

 

root@gd-san-02-01:~# udevadm info --query=all --name=/dev/sdb | grep -oE 'ID_BUS=.*' | cut -d '=' -f2
ata

root@gd-san-02-01:~# udevadm info --query=all --name=/dev/sdb | grep -oE 'ID_ATA_SATA=.*' | cut -d '=' -f2
root@gd-san-02-01:~# udevadm info --query=all --name=/dev/sda | grep -oE 'ID_ATA_SATA=.*' | cut -d '=' -f2
root@gd-san-02-01:~#

Any suggestions welcome

Thanks!

 

Extra info if it helps:

root@gd-san-02-03:~# udevadm info --query=all --name=/dev/sdb
P: /devices/pci0000:80/0000:80:03.0/0000:82:00.0/host0/port-0:0/expander-0:0/port-0:0:1/end_device-0:0:1/target0:0:1/0:0:1:0/block/sdb
N: sdb
L: 0
S: disk/by-id/scsi-1ATA_SATA_SSD_3A0A071A082E00321534
S: disk/by-id/ata-SATA_SSD_3A0A071A082E00321534
S: disk/by-path/pci-0000:82:00.0-sas-exp0x500304801e8162bf-phy1-lun-0
S: disk/by-id/scsi-SATA_SATA_SSD_3A0A071A082E00321534
E: DEVPATH=/devices/pci0000:80/0000:80:03.0/0000:82:00.0/host0/port-0:0/expander-0:0/port-0:0:1/end_device-0:0:1/target0:0:1/0:0:1:0/block/sdb
E: DEVNAME=/dev/sdb
E: DEVTYPE=disk
E: MAJOR=8
E: MINOR=16
E: SUBSYSTEM=block
E: USEC_INITIALIZED=3859244
E: SCSI_TPGS=0
E: SCSI_TYPE=disk
E: SCSI_VENDOR=ATA
E: SCSI_VENDOR_ENC=ATA\x20\x20\x20\x20\x20
E: SCSI_MODEL=SATA_SSD
E: SCSI_MODEL_ENC=SATA\x20SSD\x20\x20\x20\x20\x20\x20\x20\x20
E: SCSI_REVISION=61.5
E: ID_SCSI=1
E: ID_SCSI_INQUIRY=1
E: ID_VENDOR=ATA
E: ID_VENDOR_ENC=ATA\x20\x20\x20\x20\x20
E: ID_MODEL=SATA_SSD
E: ID_MODEL_ENC=SATA\x20SSD\x20\x20\x20\x20\x20\x20\x20\x20
E: ID_REVISION=61.5
E: ID_TYPE=disk
E: SCSI_IDENT_SERIAL=3A0A071A082E00321534
E: SCSI_IDENT_LUN_T10=ATA_SATA_SSD_3A0A071A082E00321534
E: SCSI_IDENT_LUN_ATA=SATA_SSD_3A0A071A082E00321534
E: SCSI_IDENT_PORT_NAA_REG=500304801e816281
E: SCSI_IDENT_PORT_RELATIVE=0
E: ID_BUS=ata
E: ID_ATA=1
E: ID_SERIAL=SATA_SSD_3A0A071A082E00321534
E: MPATH_SBIN_PATH=/sbin
E: DM_MULTIPATH_DEVICE_PATH=0
E: ID_PATH=pci-0000:82:00.0-sas-exp0x500304801e8162bf-phy1-lun-0
E: ID_PATH_TAG=pci-0000_82_00_0-sas-exp0x500304801e8162bf-phy1-lun-0
E: ID_PART_TABLE_UUID=c4a47825-e2b1-4494-9ccd-053125e4d549
E: ID_PART_TABLE_TYPE=gpt
E: DEVLINKS=/dev/disk/by-id/scsi-1ATA_SATA_SSD_3A0A071A082E00321534 /dev/disk/by-id/ata-SATA_SSD_3A0A071A082E00321534 /dev/disk/by-path/pci-0000:82:00.0-sas-exp0x500304801e8162bf-phy1-lun-0 /dev/disk/by-id/scsi-SATA_SATA_SSD_3A0A071A082E00321534
E: TAGS=:systemd:

Modifying line 59 as to change the expression as such fixes this SATA check:

udevadm info --query=all --name=/dev/sdb | grep -oE 'ID_ATA_SATA=.*'\|'ID_ATA=.*' | cut -d '=' -f2

Unfortunately it doesn't cause the Physical Disk UI to now load 🙁

Not sure if I'm barking up the wrong tree or not...

Stepped through to where the UI eventually gets things from, and seems this could be the cause as it doesn't seem to return? When I run it from the CLI:

root@gd-san-02-03:/opt/petasan/services/web# python /opt/petasan/scripts/admin/node_manage_disks.py disk-list -pid 0
/opt/petasan/scripts/admin/node_manage_disks.py:222: SyntaxWarning: "is not" with a literal. Did you mean "!="?
if journal_partition is not "no_journal":

<hangs for a long time now... still waiting after a few minutes>

Stopped after 40 minutes, this is the trace of where it was hung:

root@gd-san-02-03:/opt/petasan/services/web# python /opt/petasan/scripts/admin/node_manage_disks.py disk-list -pid 0
/opt/petasan/scripts/admin/node_manage_disks.py:222: SyntaxWarning: "is not" with a literal. Did you mean "!="?
if journal_partition is not "no_journal":
^CTraceback (most recent call last):
File "/opt/petasan/scripts/admin/node_manage_disks.py", line 737, in <module>
main(sys.argv[1:])
File "/opt/petasan/scripts/admin/node_manage_disks.py", line 156, in main
main_catch(args.func, args)
File "/opt/petasan/scripts/admin/node_manage_disks.py", line 147, in main_catch
func(args)
File "/opt/petasan/scripts/admin/node_manage_disks.py", line 161, in node_disk_list_json
print (json.dumps([o.get_dict() for o in ceph_disk_lib.get_full_disk_list(args.pid)]))
File "/usr/lib/python3/dist-packages/PetaSAN/core/ceph/ceph_disk_lib.py", line 443, in get_full_disk_list
osds = json.loads(OSDUsageCachedData().read())
File "/usr/lib/python3/dist-packages/PetaSAN/core/common/cached_data.py", line 56, in read
data = self.refresh_data()
File "/usr/lib/python3/dist-packages/PetaSAN/core/ceph/osd_usage_cached_data.py", line 38, in refresh_data
osds = [osd.__dict__ for osd in api.get_osds_storage()]
File "/usr/lib/python3/dist-packages/PetaSAN/core/ceph/api.py", line 2356, in get_osds_storage
ret, out, err = exec_command_ex(cmd)
File "/usr/lib/python3/dist-packages/PetaSAN/core/common/cmd.py", line 111, in exec_command_ex
stdout, stderr = p.communicate(timeout=timeout)
File "/usr/lib/python3.8/subprocess.py", line 1028, in communicate
stdout, stderr = self._communicate(input, endtime, timeout)
File "/usr/lib/python3.8/subprocess.py", line 1868, in _communicate
ready = selector.select(timeout)
File "/usr/lib/python3.8/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
KeyboardInterrupt

So managed to fix it... will post solution here incase it helps anyone.

I discvoered the delay was that "ceph osd df --format json-pretty" was being run and not returning, this led me to check the managers.

So seems the mgr wasn't starting:

Dec 20 21:59:03 gd-san-02-01 systemd[1]: Started Ceph cluster manager daemon.
-- Subject: A start job for unit ceph-mgr@gd-san-02-01.service has finished successfully
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A start job for unit ceph-mgr@gd-san-02-01.service has finished successfully.
--
-- The job identifier is 13165.
Dec 20 21:59:03 gd-san-02-01 ceph-mgr[116124]: 2021-12-20T21:59:03.706+0000 7f85906ad040 -1 auth: unable to find a keyring on /var/lib/ceph/mgr/ceph-gd-san-02-01/keyring: (2) No such file or directory
Dec 20 21:59:03 gd-san-02-01 ceph-mgr[116124]: 2021-12-20T21:59:03.706+0000 7f85906ad040 -1 AuthRegistry(0x5578c806c140) no keyring found at /var/lib/ceph/mgr/ceph-gd-san-02-01/keyring, disabling cephx
Dec 20 21:59:03 gd-san-02-01 ceph-mgr[116124]: 2021-12-20T21:59:03.706+0000 7f85906ad040 -1 auth: unable to find a keyring on /var/lib/ceph/mgr/ceph-gd-san-02-01/keyring: (2) No such file or directory
Dec 20 21:59:03 gd-san-02-01 ceph-mgr[116124]: 2021-12-20T21:59:03.706+0000 7f85906ad040 -1 AuthRegistry(0x7fffda81f0e0) no keyring found at /var/lib/ceph/mgr/ceph-gd-san-02-01/keyring, disabling cephx
Dec 20 21:59:03 gd-san-02-01 ceph-mgr[116124]: failed to fetch mon config (--no-mon-config to skip)

 

So I have to edit /opt/petasan/config/flags/flags.json to set mgr_installed to false, and then run:

/opt/petasan/scripts/create_mgr.py

Then repeated on the other nodes, and all managers are now working as is the "ceph osd df --format json-pretty", and in turn the disk display in the UI!

Not sure how this happened, there were time differences when we first deployed, could that have done it?

Thank you so much for this valuable info. For detect disk script, it could be some drives do not return ID_ATA_SATA, we will update to your fix.

Glad you were able to find and fix the mgr issue. 🙂

 

I hope it's OK if I necro this. I had a similar problem and Google led me straight here, so I wanted to add what caused my issue in case others find their way here.

In my case I had a virtual CD image mounted via the Supermicro BMC backed by a no longer accessible file share. The "blkid" command being run by the python script was hanging on /dev/sr0. Unmounting all of the virtual media resolved the problem.