LUN NAA ID Changes When Path Reassigned

entrigant
9 Posts
July 30, 2023, 5:00 amQuote from entrigant on July 30, 2023, 5:00 amHello,
I'm running into a problem on a cluster running 2.7.2 where moving an iSCSI path can sometimes, but not always, result in the NAA ID changing. This causes ESXi to believe the different paths are separate devices. In the worst case, e.g. a cold start, both paths can change and ESXi considers the LUN a snapshot. I'm struggling to come up with an explanation for this, and I would appreciate any insight. I'm even afraid to upgrade the cluster out of fear these changes will result in downtime as I move paths around. Even if an upgrade fixes the changing NAA ID, I cannot be sure the NAA ID it settles on will be the same as the one currently in use.
As an example, we had a LUN with the ID "naa.6001405dd3bb2da00005000000000000" after initial provisioning. After moving one path, the moved path has the ID "naa.60014050000500000000000000000000". ESXi now sees a failed path on the initial ID and the new ID as a completely separate device.
Is this a known issue on the version we're using? Can I somehow force a static ID? What will happen if we try to upgrade this cluster to 3.0 or 3.1?
Thank you!
Hello,
I'm running into a problem on a cluster running 2.7.2 where moving an iSCSI path can sometimes, but not always, result in the NAA ID changing. This causes ESXi to believe the different paths are separate devices. In the worst case, e.g. a cold start, both paths can change and ESXi considers the LUN a snapshot. I'm struggling to come up with an explanation for this, and I would appreciate any insight. I'm even afraid to upgrade the cluster out of fear these changes will result in downtime as I move paths around. Even if an upgrade fixes the changing NAA ID, I cannot be sure the NAA ID it settles on will be the same as the one currently in use.
As an example, we had a LUN with the ID "naa.6001405dd3bb2da00005000000000000" after initial provisioning. After moving one path, the moved path has the ID "naa.60014050000500000000000000000000". ESXi now sees a failed path on the initial ID and the new ID as a completely separate device.
Is this a known issue on the version we're using? Can I somehow force a static ID? What will happen if we try to upgrade this cluster to 3.0 or 3.1?
Thank you!
Last edited on August 4, 2023, 8:03 am by entrigant · #1

entrigant
9 Posts
July 30, 2023, 5:19 amQuote from entrigant on July 30, 2023, 5:19 amHere is the log output of the last set of path reassignments:
20/07/2023 21:40:45 INFO Found pool:rbd_hdd for disk:00003 via consul
20/07/2023 21:40:45 INFO Image image-00003 mapped successfully.
20/07/2023 21:40:48 INFO LIO add_target() disk wwn is 00003
20/07/2023 21:40:48 INFO Path 00003/2 acquired successfully
20/07/2023 21:40:48 INFO Updating path 10.158.208.10 status to 2
20/07/2023 21:40:48 INFO Path 10.158.208.10 status updated to 2
20/07/2023 21:40:55 INFO Found pool:rbd_ssd for disk:00017 via consul
20/07/2023 21:40:55 INFO Image image-00017 mapped successfully.
20/07/2023 21:40:58 INFO LIO add_target() disk wwn is dd3bb2da00017
20/07/2023 21:40:58 INFO Path 00017/2 acquired successfully
20/07/2023 21:40:58 INFO Updating path 10.158.208.100 status to 2
20/07/2023 21:40:58 INFO Path 10.158.208.100 status updated to 2
20/07/2023 21:41:06 INFO Found pool:rbd_hdd for disk:00018 via consul
20/07/2023 21:41:06 INFO Image image-00018 mapped successfully.
20/07/2023 21:41:09 INFO LIO add_target() disk wwn is 00018
20/07/2023 21:41:09 INFO Path 00018/2 acquired successfully
20/07/2023 21:41:09 INFO Updating path 10.158.208.101 status to 2
20/07/2023 21:41:09 INFO Path 10.158.208.101 status updated to 2
20/07/2023 21:41:15 INFO Found pool:rbd_hdd for disk:00019 via consul
20/07/2023 21:41:16 INFO Image image-00019 mapped successfully.
20/07/2023 21:41:19 INFO LIO add_target() disk wwn is 00019
20/07/2023 21:41:19 INFO Path 00019/2 acquired successfully
20/07/2023 21:41:19 INFO Updating path 10.158.208.102 status to 2
20/07/2023 21:41:19 INFO Path 10.158.208.102 status updated to 2
20/07/2023 21:41:26 INFO Found pool:rbd_ssd for disk:00004 via consul
20/07/2023 21:41:26 INFO Image image-00004 mapped successfully.
20/07/2023 21:41:29 INFO LIO add_target() disk wwn is 00004
20/07/2023 21:41:29 INFO Path 00004/2 acquired successfully
20/07/2023 21:41:29 INFO Updating path 10.158.208.11 status to 2
20/07/2023 21:41:29 INFO Path 10.158.208.11 status updated to 2
20/07/2023 21:41:35 INFO Found pool:rbd_hdd for disk:00005 via consul
20/07/2023 21:41:36 INFO Image image-00005 mapped successfully.
20/07/2023 21:41:39 INFO LIO add_target() disk wwn is 00005
20/07/2023 21:41:39 INFO Path 00005/2 acquired successfully
20/07/2023 21:41:39 INFO Updating path 10.158.208.12 status to 2
20/07/2023 21:41:39 INFO Path 10.158.208.12 status updated to 2
20/07/2023 21:41:45 INFO Found pool:rbd_hdd for disk:00008 via consul
20/07/2023 21:41:46 INFO Image image-00008 mapped successfully.
20/07/2023 21:41:49 INFO LIO add_target() disk wwn is dd3bb2da00008
20/07/2023 21:41:49 INFO Path 00008/2 acquired successfully
20/07/2023 21:41:49 INFO Updating path 10.158.208.14 status to 2
20/07/2023 21:41:49 INFO Path 10.158.208.14 status updated to 2
20/07/2023 21:41:55 INFO Found pool:rbd_hdd for disk:00010 via consul
20/07/2023 21:41:56 INFO Image image-00010 mapped successfully.
20/07/2023 21:41:59 INFO LIO add_target() disk wwn is 00010
20/07/2023 21:41:59 INFO Path 00010/2 acquired successfully
20/07/2023 21:41:59 INFO Updating path 10.158.208.15 status to 2
20/07/2023 21:41:59 INFO Path 10.158.208.15 status updated to 2
20/07/2023 21:42:06 INFO Found pool:rbd_hdd for disk:00011 via consul
20/07/2023 21:42:06 INFO Image image-00011 mapped successfully.
20/07/2023 21:42:09 INFO LIO add_target() disk wwn is 00011
20/07/2023 21:42:09 INFO Path 00011/2 acquired successfully
20/07/2023 21:42:09 INFO Updating path 10.158.208.16 status to 2
20/07/2023 21:42:09 INFO Path 10.158.208.16 status updated to 2
20/07/2023 21:42:16 INFO Found pool:rbd_hdd for disk:00012 via consul
20/07/2023 21:42:16 INFO Image image-00012 mapped successfully.
20/07/2023 21:42:19 INFO LIO add_target() disk wwn is dd3bb2da00012
20/07/2023 21:42:19 INFO Path 00012/2 acquired successfully
20/07/2023 21:42:19 INFO Updating path 10.158.208.17 status to 2
20/07/2023 21:42:19 INFO Path 10.158.208.17 status updated to 2
20/07/2023 21:42:25 INFO Found pool:rbd_hdd for disk:00013 via consul
20/07/2023 21:42:26 INFO Image image-00013 mapped successfully.
20/07/2023 21:42:29 INFO LIO add_target() disk wwn is dd3bb2da00013
20/07/2023 21:42:29 INFO Path 00013/2 acquired successfully
20/07/2023 21:42:29 INFO Updating path 10.158.208.18 status to 2
20/07/2023 21:42:29 INFO Path 10.158.208.18 status updated to 2
20/07/2023 21:42:35 INFO Found pool:rbd_hdd for disk:00014 via consul
20/07/2023 21:42:36 INFO Image image-00014 mapped successfully.
20/07/2023 21:42:39 INFO LIO add_target() disk wwn is 00014
20/07/2023 21:42:39 INFO Path 00014/2 acquired successfully
20/07/2023 21:42:39 INFO Updating path 10.158.208.19 status to 2
20/07/2023 21:42:39 INFO Path 10.158.208.19 status updated to 2
You can see here that the WWN assigned seems to randomly prefix "dd3bb2da", or I cannot see any pattern when it does and does not, and it's not consistent. It does appear to be "sticky", however. Once a LUN loses that prefix, it never seems to gain it back no matter how often I move it or where I move it.
Here is the log output of the last set of path reassignments:
20/07/2023 21:40:45 INFO Found pool:rbd_hdd for disk:00003 via consul
20/07/2023 21:40:45 INFO Image image-00003 mapped successfully.
20/07/2023 21:40:48 INFO LIO add_target() disk wwn is 00003
20/07/2023 21:40:48 INFO Path 00003/2 acquired successfully
20/07/2023 21:40:48 INFO Updating path 10.158.208.10 status to 2
20/07/2023 21:40:48 INFO Path 10.158.208.10 status updated to 2
20/07/2023 21:40:55 INFO Found pool:rbd_ssd for disk:00017 via consul
20/07/2023 21:40:55 INFO Image image-00017 mapped successfully.
20/07/2023 21:40:58 INFO LIO add_target() disk wwn is dd3bb2da00017
20/07/2023 21:40:58 INFO Path 00017/2 acquired successfully
20/07/2023 21:40:58 INFO Updating path 10.158.208.100 status to 2
20/07/2023 21:40:58 INFO Path 10.158.208.100 status updated to 2
20/07/2023 21:41:06 INFO Found pool:rbd_hdd for disk:00018 via consul
20/07/2023 21:41:06 INFO Image image-00018 mapped successfully.
20/07/2023 21:41:09 INFO LIO add_target() disk wwn is 00018
20/07/2023 21:41:09 INFO Path 00018/2 acquired successfully
20/07/2023 21:41:09 INFO Updating path 10.158.208.101 status to 2
20/07/2023 21:41:09 INFO Path 10.158.208.101 status updated to 2
20/07/2023 21:41:15 INFO Found pool:rbd_hdd for disk:00019 via consul
20/07/2023 21:41:16 INFO Image image-00019 mapped successfully.
20/07/2023 21:41:19 INFO LIO add_target() disk wwn is 00019
20/07/2023 21:41:19 INFO Path 00019/2 acquired successfully
20/07/2023 21:41:19 INFO Updating path 10.158.208.102 status to 2
20/07/2023 21:41:19 INFO Path 10.158.208.102 status updated to 2
20/07/2023 21:41:26 INFO Found pool:rbd_ssd for disk:00004 via consul
20/07/2023 21:41:26 INFO Image image-00004 mapped successfully.
20/07/2023 21:41:29 INFO LIO add_target() disk wwn is 00004
20/07/2023 21:41:29 INFO Path 00004/2 acquired successfully
20/07/2023 21:41:29 INFO Updating path 10.158.208.11 status to 2
20/07/2023 21:41:29 INFO Path 10.158.208.11 status updated to 2
20/07/2023 21:41:35 INFO Found pool:rbd_hdd for disk:00005 via consul
20/07/2023 21:41:36 INFO Image image-00005 mapped successfully.
20/07/2023 21:41:39 INFO LIO add_target() disk wwn is 00005
20/07/2023 21:41:39 INFO Path 00005/2 acquired successfully
20/07/2023 21:41:39 INFO Updating path 10.158.208.12 status to 2
20/07/2023 21:41:39 INFO Path 10.158.208.12 status updated to 2
20/07/2023 21:41:45 INFO Found pool:rbd_hdd for disk:00008 via consul
20/07/2023 21:41:46 INFO Image image-00008 mapped successfully.
20/07/2023 21:41:49 INFO LIO add_target() disk wwn is dd3bb2da00008
20/07/2023 21:41:49 INFO Path 00008/2 acquired successfully
20/07/2023 21:41:49 INFO Updating path 10.158.208.14 status to 2
20/07/2023 21:41:49 INFO Path 10.158.208.14 status updated to 2
20/07/2023 21:41:55 INFO Found pool:rbd_hdd for disk:00010 via consul
20/07/2023 21:41:56 INFO Image image-00010 mapped successfully.
20/07/2023 21:41:59 INFO LIO add_target() disk wwn is 00010
20/07/2023 21:41:59 INFO Path 00010/2 acquired successfully
20/07/2023 21:41:59 INFO Updating path 10.158.208.15 status to 2
20/07/2023 21:41:59 INFO Path 10.158.208.15 status updated to 2
20/07/2023 21:42:06 INFO Found pool:rbd_hdd for disk:00011 via consul
20/07/2023 21:42:06 INFO Image image-00011 mapped successfully.
20/07/2023 21:42:09 INFO LIO add_target() disk wwn is 00011
20/07/2023 21:42:09 INFO Path 00011/2 acquired successfully
20/07/2023 21:42:09 INFO Updating path 10.158.208.16 status to 2
20/07/2023 21:42:09 INFO Path 10.158.208.16 status updated to 2
20/07/2023 21:42:16 INFO Found pool:rbd_hdd for disk:00012 via consul
20/07/2023 21:42:16 INFO Image image-00012 mapped successfully.
20/07/2023 21:42:19 INFO LIO add_target() disk wwn is dd3bb2da00012
20/07/2023 21:42:19 INFO Path 00012/2 acquired successfully
20/07/2023 21:42:19 INFO Updating path 10.158.208.17 status to 2
20/07/2023 21:42:19 INFO Path 10.158.208.17 status updated to 2
20/07/2023 21:42:25 INFO Found pool:rbd_hdd for disk:00013 via consul
20/07/2023 21:42:26 INFO Image image-00013 mapped successfully.
20/07/2023 21:42:29 INFO LIO add_target() disk wwn is dd3bb2da00013
20/07/2023 21:42:29 INFO Path 00013/2 acquired successfully
20/07/2023 21:42:29 INFO Updating path 10.158.208.18 status to 2
20/07/2023 21:42:29 INFO Path 10.158.208.18 status updated to 2
20/07/2023 21:42:35 INFO Found pool:rbd_hdd for disk:00014 via consul
20/07/2023 21:42:36 INFO Image image-00014 mapped successfully.
20/07/2023 21:42:39 INFO LIO add_target() disk wwn is 00014
20/07/2023 21:42:39 INFO Path 00014/2 acquired successfully
20/07/2023 21:42:39 INFO Updating path 10.158.208.19 status to 2
20/07/2023 21:42:39 INFO Path 10.158.208.19 status updated to 2
You can see here that the WWN assigned seems to randomly prefix "dd3bb2da", or I cannot see any pattern when it does and does not, and it's not consistent. It does appear to be "sticky", however. Once a LUN loses that prefix, it never seems to gain it back no matter how often I move it or where I move it.

entrigant
9 Posts
July 30, 2023, 6:26 amQuote from entrigant on July 30, 2023, 6:26 amI've continued digging, and so far I've discovered that the LIO API add_target() method will set a WWN stored in the rbd image metadata if it exists. Otherwise, it will set the disk ID.
wwn = disk_meta.id
if hasattr(disk_meta,'wwn') and disk_meta.wwn and len(disk_meta.wwn) > 0:
wwn = disk_meta.wwn
Investigating the metadata of one of these volumes shows that it does not have a WWN set:
# rbd info rbd_hdd/image-00005 | grep rbd_data
block_name_prefix: rbd_data.6bd23ef936f614
# rados -p rbd_hdd getxattr rbd_header.6bd23ef936f614 petasan-metada | grep wwn
"wwn": ""
I decided to check one of the LUN's that survived the path reassignment with a full WWN, and sure enough, it has a WWN set:
# rbd info rbd_hdd/image-00013 | grep rbd_data
block_name_prefix: rbd_data.a30954ade2bdf8
# rados -p rbd_hdd getxattr rbd_header.a30954ade2bdf8 petasan-metada | grep wwn
"wwn": "dd3bb2da00013"
It looks like I could possibly correct the current issue by manually altering this xattr, but I am worried about what side effects that might have. I also cannot explain why some images have one set and why some do not.
I've continued digging, and so far I've discovered that the LIO API add_target() method will set a WWN stored in the rbd image metadata if it exists. Otherwise, it will set the disk ID.
wwn = disk_meta.id
if hasattr(disk_meta,'wwn') and disk_meta.wwn and len(disk_meta.wwn) > 0:
wwn = disk_meta.wwn
Investigating the metadata of one of these volumes shows that it does not have a WWN set:
# rbd info rbd_hdd/image-00005 | grep rbd_data
block_name_prefix: rbd_data.6bd23ef936f614
# rados -p rbd_hdd getxattr rbd_header.6bd23ef936f614 petasan-metada | grep wwn
"wwn": ""
I decided to check one of the LUN's that survived the path reassignment with a full WWN, and sure enough, it has a WWN set:
# rbd info rbd_hdd/image-00013 | grep rbd_data
block_name_prefix: rbd_data.a30954ade2bdf8
# rados -p rbd_hdd getxattr rbd_header.a30954ade2bdf8 petasan-metada | grep wwn
"wwn": "dd3bb2da00013"
It looks like I could possibly correct the current issue by manually altering this xattr, but I am worried about what side effects that might have. I also cannot explain why some images have one set and why some do not.

entrigant
9 Posts
July 30, 2023, 7:14 amQuote from entrigant on July 30, 2023, 7:14 amI found the log for when this disk was added. Things were looking OK then.
28/07/2021 20:21:49 INFO include_wwn_fsid_tag() is true
28/07/2021 20:21:49 INFO add disk wwn is dd3bb2da00005
28/07/2021 20:21:54 INFO Disk <name-redacted> created
28/07/2021 20:21:55 INFO Successfully created key 00005 for new disk.
28/07/2021 20:21:55 INFO Successfully created key /00005/1 for new disk.
28/07/2021 20:21:55 INFO Successfully created key /00005/2 for new disk.
28/07/2021 20:22:03 INFO Found pool:rbd_hdd for disk:00005 via consul
28/07/2021 20:22:10 INFO Image image-00005 mapped successfully.
28/07/2021 20:22:13 INFO LIO add_target() disk wwn is dd3bb2da00005
28/07/2021 20:22:13 INFO Path 00005/1 acquired successfully
I found the log for when this disk was added. Things were looking OK then.
28/07/2021 20:21:49 INFO include_wwn_fsid_tag() is true
28/07/2021 20:21:49 INFO add disk wwn is dd3bb2da00005
28/07/2021 20:21:54 INFO Disk <name-redacted> created
28/07/2021 20:21:55 INFO Successfully created key 00005 for new disk.
28/07/2021 20:21:55 INFO Successfully created key /00005/1 for new disk.
28/07/2021 20:21:55 INFO Successfully created key /00005/2 for new disk.
28/07/2021 20:22:03 INFO Found pool:rbd_hdd for disk:00005 via consul
28/07/2021 20:22:10 INFO Image image-00005 mapped successfully.
28/07/2021 20:22:13 INFO LIO add_target() disk wwn is dd3bb2da00005
28/07/2021 20:22:13 INFO Path 00005/1 acquired successfully

entrigant
9 Posts
July 30, 2023, 8:26 amQuote from entrigant on July 30, 2023, 8:26 amI found it! The bug is in the "update_disk" method in "web/admin_controller/disk.py". It doesn't copy the WWN and strips it. This gets called whenever you edit an iSCSI volume in the web UI. I confirmed via nginx access logs that every single disk that had a POST to "/disk/list/edit/<disk-id>/<pool>" has no WWN in the rbd xattr metadata, and every disk that has not been edited has a WWN.
There's my cause.
I believe I understand the API well enough to re-add the WWN with a simple Python script. My thoughts are I can do that then move the path once again to bring the LUN back up with the correct NAA ID in ESXi.
However, if anyone out there is able, I'd really appreciate any extra guidance on if this is a good idea or not.
Thanks!
I found it! The bug is in the "update_disk" method in "web/admin_controller/disk.py". It doesn't copy the WWN and strips it. This gets called whenever you edit an iSCSI volume in the web UI. I confirmed via nginx access logs that every single disk that had a POST to "/disk/list/edit/<disk-id>/<pool>" has no WWN in the rbd xattr metadata, and every disk that has not been edited has a WWN.
There's my cause.
I believe I understand the API well enough to re-add the WWN with a simple Python script. My thoughts are I can do that then move the path once again to bring the LUN back up with the correct NAA ID in ESXi.
However, if anyone out there is able, I'd really appreciate any extra guidance on if this is a good idea or not.
Thanks!

admin
2,967 Posts
July 30, 2023, 8:44 amQuote from admin on July 30, 2023, 8:44 amThanks a lot for the info. there could have been a bug in the edit disk a long time ago with wwn, i believe it was fixed in earlier versions.
Thanks a lot for the info. there could have been a bug in the edit disk a long time ago with wwn, i believe it was fixed in earlier versions.

admin
2,967 Posts
July 30, 2023, 8:57 amQuote from admin on July 30, 2023, 8:57 amwe also have a scrip
/opt/petasan/scripts/util/disk_meta.py
that can help you get and set metadata, so you can edit the wwn if needed. its syntax is a bit odd and i believe it changed in recent versions, but there should be a simple command help. i would test it on a test disk first before applying.
we also have a scrip
/opt/petasan/scripts/util/disk_meta.py
that can help you get and set metadata, so you can edit the wwn if needed. its syntax is a bit odd and i believe it changed in recent versions, but there should be a simple command help. i would test it on a test disk first before applying.

entrigant
9 Posts
August 4, 2023, 2:50 amQuote from entrigant on August 4, 2023, 2:50 amA final update in case this ends up in any search results.
I know this bug exists in 2.7.2. I know it _does not exist_ in 3.0.1 or 3.0.2. Those are the only versions I have active and can test, so it is fixed in any reasonably newer version. However, if like me you find yourself bitten, the proposed solution works. Use the built in script to dump the metadata to a file:
$ /opt/petasan/scripts/util/disk_meta.py read --pool <pool> --image <image> > metadata.json
Edit the dumped file and correct the WWN field. The field is a combination of the first 4 bytes (8 digits) of the ceph fsid (obtained by running "ceph fsid") and the 0 padded disk id as show in the first column of the web GUI iSCSI disk list. It will appear in the NAA ID after the digits "6001405".
Then write the altered metadata to the disk:
$ /opt/petasan/scripts/util/disk_meta.py write --pool <pool> --image <image> --file metadata.json
Confirm the operation by reading the metadata with the rbd and rados tools. First, grab the block name prefix:
$ rbd info <pool>/<image> | grep rbd_data
block_name_prefix: rbd_data.4410d5e28557fc
The value we're interested in is the number after "rbd_data". Use that to query the image metadata:
$ rados -p <pool> getxattr rbd_header.4410d5e28557fc petasan-metada
That's not a typo. The key really is "petasan-metada". Check the json that spits out to make sure it has your changes and that everything else is correct.
Finally, migrate the affected path(s) to a different node to force the change to take effect. You can move it back right after if you want to. Rescan the HBA in VMware, and you should see the correct NAA ID now.
A final update in case this ends up in any search results.
I know this bug exists in 2.7.2. I know it _does not exist_ in 3.0.1 or 3.0.2. Those are the only versions I have active and can test, so it is fixed in any reasonably newer version. However, if like me you find yourself bitten, the proposed solution works. Use the built in script to dump the metadata to a file:
$ /opt/petasan/scripts/util/disk_meta.py read --pool <pool> --image <image> > metadata.json
Edit the dumped file and correct the WWN field. The field is a combination of the first 4 bytes (8 digits) of the ceph fsid (obtained by running "ceph fsid") and the 0 padded disk id as show in the first column of the web GUI iSCSI disk list. It will appear in the NAA ID after the digits "6001405".
Then write the altered metadata to the disk:
$ /opt/petasan/scripts/util/disk_meta.py write --pool <pool> --image <image> --file metadata.json
Confirm the operation by reading the metadata with the rbd and rados tools. First, grab the block name prefix:
$ rbd info <pool>/<image> | grep rbd_data
block_name_prefix: rbd_data.4410d5e28557fc
The value we're interested in is the number after "rbd_data". Use that to query the image metadata:
$ rados -p <pool> getxattr rbd_header.4410d5e28557fc petasan-metada
That's not a typo. The key really is "petasan-metada". Check the json that spits out to make sure it has your changes and that everything else is correct.
Finally, migrate the affected path(s) to a different node to force the change to take effect. You can move it back right after if you want to. Rescan the HBA in VMware, and you should see the correct NAA ID now.
Last edited on August 4, 2023, 8:02 am by entrigant · #8
LUN NAA ID Changes When Path Reassigned
entrigant
9 Posts
Quote from entrigant on July 30, 2023, 5:00 amHello,
I'm running into a problem on a cluster running 2.7.2 where moving an iSCSI path can sometimes, but not always, result in the NAA ID changing. This causes ESXi to believe the different paths are separate devices. In the worst case, e.g. a cold start, both paths can change and ESXi considers the LUN a snapshot. I'm struggling to come up with an explanation for this, and I would appreciate any insight. I'm even afraid to upgrade the cluster out of fear these changes will result in downtime as I move paths around. Even if an upgrade fixes the changing NAA ID, I cannot be sure the NAA ID it settles on will be the same as the one currently in use.
As an example, we had a LUN with the ID "naa.6001405dd3bb2da00005000000000000" after initial provisioning. After moving one path, the moved path has the ID "naa.60014050000500000000000000000000". ESXi now sees a failed path on the initial ID and the new ID as a completely separate device.
Is this a known issue on the version we're using? Can I somehow force a static ID? What will happen if we try to upgrade this cluster to 3.0 or 3.1?
Thank you!
Hello,
I'm running into a problem on a cluster running 2.7.2 where moving an iSCSI path can sometimes, but not always, result in the NAA ID changing. This causes ESXi to believe the different paths are separate devices. In the worst case, e.g. a cold start, both paths can change and ESXi considers the LUN a snapshot. I'm struggling to come up with an explanation for this, and I would appreciate any insight. I'm even afraid to upgrade the cluster out of fear these changes will result in downtime as I move paths around. Even if an upgrade fixes the changing NAA ID, I cannot be sure the NAA ID it settles on will be the same as the one currently in use.
As an example, we had a LUN with the ID "naa.6001405dd3bb2da00005000000000000" after initial provisioning. After moving one path, the moved path has the ID "naa.60014050000500000000000000000000". ESXi now sees a failed path on the initial ID and the new ID as a completely separate device.
Is this a known issue on the version we're using? Can I somehow force a static ID? What will happen if we try to upgrade this cluster to 3.0 or 3.1?
Thank you!
entrigant
9 Posts
Quote from entrigant on July 30, 2023, 5:19 amHere is the log output of the last set of path reassignments:
20/07/2023 21:40:45 INFO Found pool:rbd_hdd for disk:00003 via consul
20/07/2023 21:40:45 INFO Image image-00003 mapped successfully.
20/07/2023 21:40:48 INFO LIO add_target() disk wwn is 00003
20/07/2023 21:40:48 INFO Path 00003/2 acquired successfully
20/07/2023 21:40:48 INFO Updating path 10.158.208.10 status to 2
20/07/2023 21:40:48 INFO Path 10.158.208.10 status updated to 2
20/07/2023 21:40:55 INFO Found pool:rbd_ssd for disk:00017 via consul
20/07/2023 21:40:55 INFO Image image-00017 mapped successfully.
20/07/2023 21:40:58 INFO LIO add_target() disk wwn is dd3bb2da00017
20/07/2023 21:40:58 INFO Path 00017/2 acquired successfully
20/07/2023 21:40:58 INFO Updating path 10.158.208.100 status to 2
20/07/2023 21:40:58 INFO Path 10.158.208.100 status updated to 2
20/07/2023 21:41:06 INFO Found pool:rbd_hdd for disk:00018 via consul
20/07/2023 21:41:06 INFO Image image-00018 mapped successfully.
20/07/2023 21:41:09 INFO LIO add_target() disk wwn is 00018
20/07/2023 21:41:09 INFO Path 00018/2 acquired successfully
20/07/2023 21:41:09 INFO Updating path 10.158.208.101 status to 2
20/07/2023 21:41:09 INFO Path 10.158.208.101 status updated to 2
20/07/2023 21:41:15 INFO Found pool:rbd_hdd for disk:00019 via consul
20/07/2023 21:41:16 INFO Image image-00019 mapped successfully.
20/07/2023 21:41:19 INFO LIO add_target() disk wwn is 00019
20/07/2023 21:41:19 INFO Path 00019/2 acquired successfully
20/07/2023 21:41:19 INFO Updating path 10.158.208.102 status to 2
20/07/2023 21:41:19 INFO Path 10.158.208.102 status updated to 2
20/07/2023 21:41:26 INFO Found pool:rbd_ssd for disk:00004 via consul
20/07/2023 21:41:26 INFO Image image-00004 mapped successfully.
20/07/2023 21:41:29 INFO LIO add_target() disk wwn is 00004
20/07/2023 21:41:29 INFO Path 00004/2 acquired successfully
20/07/2023 21:41:29 INFO Updating path 10.158.208.11 status to 2
20/07/2023 21:41:29 INFO Path 10.158.208.11 status updated to 2
20/07/2023 21:41:35 INFO Found pool:rbd_hdd for disk:00005 via consul
20/07/2023 21:41:36 INFO Image image-00005 mapped successfully.
20/07/2023 21:41:39 INFO LIO add_target() disk wwn is 00005
20/07/2023 21:41:39 INFO Path 00005/2 acquired successfully
20/07/2023 21:41:39 INFO Updating path 10.158.208.12 status to 2
20/07/2023 21:41:39 INFO Path 10.158.208.12 status updated to 2
20/07/2023 21:41:45 INFO Found pool:rbd_hdd for disk:00008 via consul
20/07/2023 21:41:46 INFO Image image-00008 mapped successfully.
20/07/2023 21:41:49 INFO LIO add_target() disk wwn is dd3bb2da00008
20/07/2023 21:41:49 INFO Path 00008/2 acquired successfully
20/07/2023 21:41:49 INFO Updating path 10.158.208.14 status to 2
20/07/2023 21:41:49 INFO Path 10.158.208.14 status updated to 2
20/07/2023 21:41:55 INFO Found pool:rbd_hdd for disk:00010 via consul
20/07/2023 21:41:56 INFO Image image-00010 mapped successfully.
20/07/2023 21:41:59 INFO LIO add_target() disk wwn is 00010
20/07/2023 21:41:59 INFO Path 00010/2 acquired successfully
20/07/2023 21:41:59 INFO Updating path 10.158.208.15 status to 2
20/07/2023 21:41:59 INFO Path 10.158.208.15 status updated to 2
20/07/2023 21:42:06 INFO Found pool:rbd_hdd for disk:00011 via consul
20/07/2023 21:42:06 INFO Image image-00011 mapped successfully.
20/07/2023 21:42:09 INFO LIO add_target() disk wwn is 00011
20/07/2023 21:42:09 INFO Path 00011/2 acquired successfully
20/07/2023 21:42:09 INFO Updating path 10.158.208.16 status to 2
20/07/2023 21:42:09 INFO Path 10.158.208.16 status updated to 2
20/07/2023 21:42:16 INFO Found pool:rbd_hdd for disk:00012 via consul
20/07/2023 21:42:16 INFO Image image-00012 mapped successfully.
20/07/2023 21:42:19 INFO LIO add_target() disk wwn is dd3bb2da00012
20/07/2023 21:42:19 INFO Path 00012/2 acquired successfully
20/07/2023 21:42:19 INFO Updating path 10.158.208.17 status to 2
20/07/2023 21:42:19 INFO Path 10.158.208.17 status updated to 2
20/07/2023 21:42:25 INFO Found pool:rbd_hdd for disk:00013 via consul
20/07/2023 21:42:26 INFO Image image-00013 mapped successfully.
20/07/2023 21:42:29 INFO LIO add_target() disk wwn is dd3bb2da00013
20/07/2023 21:42:29 INFO Path 00013/2 acquired successfully
20/07/2023 21:42:29 INFO Updating path 10.158.208.18 status to 2
20/07/2023 21:42:29 INFO Path 10.158.208.18 status updated to 2
20/07/2023 21:42:35 INFO Found pool:rbd_hdd for disk:00014 via consul
20/07/2023 21:42:36 INFO Image image-00014 mapped successfully.
20/07/2023 21:42:39 INFO LIO add_target() disk wwn is 00014
20/07/2023 21:42:39 INFO Path 00014/2 acquired successfully
20/07/2023 21:42:39 INFO Updating path 10.158.208.19 status to 2
20/07/2023 21:42:39 INFO Path 10.158.208.19 status updated to 2You can see here that the WWN assigned seems to randomly prefix "dd3bb2da", or I cannot see any pattern when it does and does not, and it's not consistent. It does appear to be "sticky", however. Once a LUN loses that prefix, it never seems to gain it back no matter how often I move it or where I move it.
Here is the log output of the last set of path reassignments:
20/07/2023 21:40:45 INFO Found pool:rbd_hdd for disk:00003 via consul
20/07/2023 21:40:45 INFO Image image-00003 mapped successfully.
20/07/2023 21:40:48 INFO LIO add_target() disk wwn is 00003
20/07/2023 21:40:48 INFO Path 00003/2 acquired successfully
20/07/2023 21:40:48 INFO Updating path 10.158.208.10 status to 2
20/07/2023 21:40:48 INFO Path 10.158.208.10 status updated to 2
20/07/2023 21:40:55 INFO Found pool:rbd_ssd for disk:00017 via consul
20/07/2023 21:40:55 INFO Image image-00017 mapped successfully.
20/07/2023 21:40:58 INFO LIO add_target() disk wwn is dd3bb2da00017
20/07/2023 21:40:58 INFO Path 00017/2 acquired successfully
20/07/2023 21:40:58 INFO Updating path 10.158.208.100 status to 2
20/07/2023 21:40:58 INFO Path 10.158.208.100 status updated to 2
20/07/2023 21:41:06 INFO Found pool:rbd_hdd for disk:00018 via consul
20/07/2023 21:41:06 INFO Image image-00018 mapped successfully.
20/07/2023 21:41:09 INFO LIO add_target() disk wwn is 00018
20/07/2023 21:41:09 INFO Path 00018/2 acquired successfully
20/07/2023 21:41:09 INFO Updating path 10.158.208.101 status to 2
20/07/2023 21:41:09 INFO Path 10.158.208.101 status updated to 2
20/07/2023 21:41:15 INFO Found pool:rbd_hdd for disk:00019 via consul
20/07/2023 21:41:16 INFO Image image-00019 mapped successfully.
20/07/2023 21:41:19 INFO LIO add_target() disk wwn is 00019
20/07/2023 21:41:19 INFO Path 00019/2 acquired successfully
20/07/2023 21:41:19 INFO Updating path 10.158.208.102 status to 2
20/07/2023 21:41:19 INFO Path 10.158.208.102 status updated to 2
20/07/2023 21:41:26 INFO Found pool:rbd_ssd for disk:00004 via consul
20/07/2023 21:41:26 INFO Image image-00004 mapped successfully.
20/07/2023 21:41:29 INFO LIO add_target() disk wwn is 00004
20/07/2023 21:41:29 INFO Path 00004/2 acquired successfully
20/07/2023 21:41:29 INFO Updating path 10.158.208.11 status to 2
20/07/2023 21:41:29 INFO Path 10.158.208.11 status updated to 2
20/07/2023 21:41:35 INFO Found pool:rbd_hdd for disk:00005 via consul
20/07/2023 21:41:36 INFO Image image-00005 mapped successfully.
20/07/2023 21:41:39 INFO LIO add_target() disk wwn is 00005
20/07/2023 21:41:39 INFO Path 00005/2 acquired successfully
20/07/2023 21:41:39 INFO Updating path 10.158.208.12 status to 2
20/07/2023 21:41:39 INFO Path 10.158.208.12 status updated to 2
20/07/2023 21:41:45 INFO Found pool:rbd_hdd for disk:00008 via consul
20/07/2023 21:41:46 INFO Image image-00008 mapped successfully.
20/07/2023 21:41:49 INFO LIO add_target() disk wwn is dd3bb2da00008
20/07/2023 21:41:49 INFO Path 00008/2 acquired successfully
20/07/2023 21:41:49 INFO Updating path 10.158.208.14 status to 2
20/07/2023 21:41:49 INFO Path 10.158.208.14 status updated to 2
20/07/2023 21:41:55 INFO Found pool:rbd_hdd for disk:00010 via consul
20/07/2023 21:41:56 INFO Image image-00010 mapped successfully.
20/07/2023 21:41:59 INFO LIO add_target() disk wwn is 00010
20/07/2023 21:41:59 INFO Path 00010/2 acquired successfully
20/07/2023 21:41:59 INFO Updating path 10.158.208.15 status to 2
20/07/2023 21:41:59 INFO Path 10.158.208.15 status updated to 2
20/07/2023 21:42:06 INFO Found pool:rbd_hdd for disk:00011 via consul
20/07/2023 21:42:06 INFO Image image-00011 mapped successfully.
20/07/2023 21:42:09 INFO LIO add_target() disk wwn is 00011
20/07/2023 21:42:09 INFO Path 00011/2 acquired successfully
20/07/2023 21:42:09 INFO Updating path 10.158.208.16 status to 2
20/07/2023 21:42:09 INFO Path 10.158.208.16 status updated to 2
20/07/2023 21:42:16 INFO Found pool:rbd_hdd for disk:00012 via consul
20/07/2023 21:42:16 INFO Image image-00012 mapped successfully.
20/07/2023 21:42:19 INFO LIO add_target() disk wwn is dd3bb2da00012
20/07/2023 21:42:19 INFO Path 00012/2 acquired successfully
20/07/2023 21:42:19 INFO Updating path 10.158.208.17 status to 2
20/07/2023 21:42:19 INFO Path 10.158.208.17 status updated to 2
20/07/2023 21:42:25 INFO Found pool:rbd_hdd for disk:00013 via consul
20/07/2023 21:42:26 INFO Image image-00013 mapped successfully.
20/07/2023 21:42:29 INFO LIO add_target() disk wwn is dd3bb2da00013
20/07/2023 21:42:29 INFO Path 00013/2 acquired successfully
20/07/2023 21:42:29 INFO Updating path 10.158.208.18 status to 2
20/07/2023 21:42:29 INFO Path 10.158.208.18 status updated to 2
20/07/2023 21:42:35 INFO Found pool:rbd_hdd for disk:00014 via consul
20/07/2023 21:42:36 INFO Image image-00014 mapped successfully.
20/07/2023 21:42:39 INFO LIO add_target() disk wwn is 00014
20/07/2023 21:42:39 INFO Path 00014/2 acquired successfully
20/07/2023 21:42:39 INFO Updating path 10.158.208.19 status to 2
20/07/2023 21:42:39 INFO Path 10.158.208.19 status updated to 2
You can see here that the WWN assigned seems to randomly prefix "dd3bb2da", or I cannot see any pattern when it does and does not, and it's not consistent. It does appear to be "sticky", however. Once a LUN loses that prefix, it never seems to gain it back no matter how often I move it or where I move it.
entrigant
9 Posts
Quote from entrigant on July 30, 2023, 6:26 amI've continued digging, and so far I've discovered that the LIO API add_target() method will set a WWN stored in the rbd image metadata if it exists. Otherwise, it will set the disk ID.
wwn = disk_meta.id
if hasattr(disk_meta,'wwn') and disk_meta.wwn and len(disk_meta.wwn) > 0:
wwn = disk_meta.wwnInvestigating the metadata of one of these volumes shows that it does not have a WWN set:
# rbd info rbd_hdd/image-00005 | grep rbd_data
block_name_prefix: rbd_data.6bd23ef936f614
# rados -p rbd_hdd getxattr rbd_header.6bd23ef936f614 petasan-metada | grep wwn
"wwn": ""I decided to check one of the LUN's that survived the path reassignment with a full WWN, and sure enough, it has a WWN set:
# rbd info rbd_hdd/image-00013 | grep rbd_data
block_name_prefix: rbd_data.a30954ade2bdf8
# rados -p rbd_hdd getxattr rbd_header.a30954ade2bdf8 petasan-metada | grep wwn
"wwn": "dd3bb2da00013"It looks like I could possibly correct the current issue by manually altering this xattr, but I am worried about what side effects that might have. I also cannot explain why some images have one set and why some do not.
I've continued digging, and so far I've discovered that the LIO API add_target() method will set a WWN stored in the rbd image metadata if it exists. Otherwise, it will set the disk ID.
wwn = disk_meta.id
if hasattr(disk_meta,'wwn') and disk_meta.wwn and len(disk_meta.wwn) > 0:
wwn = disk_meta.wwn
Investigating the metadata of one of these volumes shows that it does not have a WWN set:
# rbd info rbd_hdd/image-00005 | grep rbd_data
block_name_prefix: rbd_data.6bd23ef936f614
# rados -p rbd_hdd getxattr rbd_header.6bd23ef936f614 petasan-metada | grep wwn
"wwn": ""
I decided to check one of the LUN's that survived the path reassignment with a full WWN, and sure enough, it has a WWN set:
# rbd info rbd_hdd/image-00013 | grep rbd_data
block_name_prefix: rbd_data.a30954ade2bdf8
# rados -p rbd_hdd getxattr rbd_header.a30954ade2bdf8 petasan-metada | grep wwn
"wwn": "dd3bb2da00013"
It looks like I could possibly correct the current issue by manually altering this xattr, but I am worried about what side effects that might have. I also cannot explain why some images have one set and why some do not.
entrigant
9 Posts
Quote from entrigant on July 30, 2023, 7:14 amI found the log for when this disk was added. Things were looking OK then.
28/07/2021 20:21:49 INFO include_wwn_fsid_tag() is true
28/07/2021 20:21:49 INFO add disk wwn is dd3bb2da00005
28/07/2021 20:21:54 INFO Disk <name-redacted> created
28/07/2021 20:21:55 INFO Successfully created key 00005 for new disk.
28/07/2021 20:21:55 INFO Successfully created key /00005/1 for new disk.
28/07/2021 20:21:55 INFO Successfully created key /00005/2 for new disk.
28/07/2021 20:22:03 INFO Found pool:rbd_hdd for disk:00005 via consul
28/07/2021 20:22:10 INFO Image image-00005 mapped successfully.
28/07/2021 20:22:13 INFO LIO add_target() disk wwn is dd3bb2da00005
28/07/2021 20:22:13 INFO Path 00005/1 acquired successfully
I found the log for when this disk was added. Things were looking OK then.
28/07/2021 20:21:49 INFO include_wwn_fsid_tag() is true
28/07/2021 20:21:49 INFO add disk wwn is dd3bb2da00005
28/07/2021 20:21:54 INFO Disk <name-redacted> created
28/07/2021 20:21:55 INFO Successfully created key 00005 for new disk.
28/07/2021 20:21:55 INFO Successfully created key /00005/1 for new disk.
28/07/2021 20:21:55 INFO Successfully created key /00005/2 for new disk.
28/07/2021 20:22:03 INFO Found pool:rbd_hdd for disk:00005 via consul
28/07/2021 20:22:10 INFO Image image-00005 mapped successfully.
28/07/2021 20:22:13 INFO LIO add_target() disk wwn is dd3bb2da00005
28/07/2021 20:22:13 INFO Path 00005/1 acquired successfully
entrigant
9 Posts
Quote from entrigant on July 30, 2023, 8:26 amI found it! The bug is in the "update_disk" method in "web/admin_controller/disk.py". It doesn't copy the WWN and strips it. This gets called whenever you edit an iSCSI volume in the web UI. I confirmed via nginx access logs that every single disk that had a POST to "/disk/list/edit/<disk-id>/<pool>" has no WWN in the rbd xattr metadata, and every disk that has not been edited has a WWN.
There's my cause.
I believe I understand the API well enough to re-add the WWN with a simple Python script. My thoughts are I can do that then move the path once again to bring the LUN back up with the correct NAA ID in ESXi.
However, if anyone out there is able, I'd really appreciate any extra guidance on if this is a good idea or not.
Thanks!
I found it! The bug is in the "update_disk" method in "web/admin_controller/disk.py". It doesn't copy the WWN and strips it. This gets called whenever you edit an iSCSI volume in the web UI. I confirmed via nginx access logs that every single disk that had a POST to "/disk/list/edit/<disk-id>/<pool>" has no WWN in the rbd xattr metadata, and every disk that has not been edited has a WWN.
There's my cause.
I believe I understand the API well enough to re-add the WWN with a simple Python script. My thoughts are I can do that then move the path once again to bring the LUN back up with the correct NAA ID in ESXi.
However, if anyone out there is able, I'd really appreciate any extra guidance on if this is a good idea or not.
Thanks!
admin
2,967 Posts
Quote from admin on July 30, 2023, 8:44 amThanks a lot for the info. there could have been a bug in the edit disk a long time ago with wwn, i believe it was fixed in earlier versions.
Thanks a lot for the info. there could have been a bug in the edit disk a long time ago with wwn, i believe it was fixed in earlier versions.
admin
2,967 Posts
Quote from admin on July 30, 2023, 8:57 amwe also have a scrip
/opt/petasan/scripts/util/disk_meta.py
that can help you get and set metadata, so you can edit the wwn if needed. its syntax is a bit odd and i believe it changed in recent versions, but there should be a simple command help. i would test it on a test disk first before applying.
we also have a scrip
/opt/petasan/scripts/util/disk_meta.py
that can help you get and set metadata, so you can edit the wwn if needed. its syntax is a bit odd and i believe it changed in recent versions, but there should be a simple command help. i would test it on a test disk first before applying.
entrigant
9 Posts
Quote from entrigant on August 4, 2023, 2:50 amA final update in case this ends up in any search results.
I know this bug exists in 2.7.2. I know it _does not exist_ in 3.0.1 or 3.0.2. Those are the only versions I have active and can test, so it is fixed in any reasonably newer version. However, if like me you find yourself bitten, the proposed solution works. Use the built in script to dump the metadata to a file:
$ /opt/petasan/scripts/util/disk_meta.py read --pool <pool> --image <image> > metadata.json
Edit the dumped file and correct the WWN field. The field is a combination of the first 4 bytes (8 digits) of the ceph fsid (obtained by running "ceph fsid") and the 0 padded disk id as show in the first column of the web GUI iSCSI disk list. It will appear in the NAA ID after the digits "6001405".
Then write the altered metadata to the disk:
$ /opt/petasan/scripts/util/disk_meta.py write --pool <pool> --image <image> --file metadata.json
Confirm the operation by reading the metadata with the rbd and rados tools. First, grab the block name prefix:
$ rbd info <pool>/<image> | grep rbd_data
block_name_prefix: rbd_data.4410d5e28557fcThe value we're interested in is the number after "rbd_data". Use that to query the image metadata:
$ rados -p <pool> getxattr rbd_header.4410d5e28557fc petasan-metada
That's not a typo. The key really is "petasan-metada". Check the json that spits out to make sure it has your changes and that everything else is correct.
Finally, migrate the affected path(s) to a different node to force the change to take effect. You can move it back right after if you want to. Rescan the HBA in VMware, and you should see the correct NAA ID now.
A final update in case this ends up in any search results.
I know this bug exists in 2.7.2. I know it _does not exist_ in 3.0.1 or 3.0.2. Those are the only versions I have active and can test, so it is fixed in any reasonably newer version. However, if like me you find yourself bitten, the proposed solution works. Use the built in script to dump the metadata to a file:
$ /opt/petasan/scripts/util/disk_meta.py read --pool <pool> --image <image> > metadata.json
Edit the dumped file and correct the WWN field. The field is a combination of the first 4 bytes (8 digits) of the ceph fsid (obtained by running "ceph fsid") and the 0 padded disk id as show in the first column of the web GUI iSCSI disk list. It will appear in the NAA ID after the digits "6001405".
Then write the altered metadata to the disk:
$ /opt/petasan/scripts/util/disk_meta.py write --pool <pool> --image <image> --file metadata.json
Confirm the operation by reading the metadata with the rbd and rados tools. First, grab the block name prefix:
$ rbd info <pool>/<image> | grep rbd_data
block_name_prefix: rbd_data.4410d5e28557fc
The value we're interested in is the number after "rbd_data". Use that to query the image metadata:
$ rados -p <pool> getxattr rbd_header.4410d5e28557fc petasan-metada
That's not a typo. The key really is "petasan-metada". Check the json that spits out to make sure it has your changes and that everything else is correct.
Finally, migrate the affected path(s) to a different node to force the change to take effect. You can move it back right after if you want to. Rescan the HBA in VMware, and you should see the correct NAA ID now.