Unable to Stop iSCSI Disk
am549
4 Posts
December 12, 2019, 11:02 amQuote from am549 on December 12, 2019, 11:02 amHi
We recently inherited a petasan setup from one of our new customers and have been providing reasonable endeavors support on it. There was an air-conditioning failure at the weekend and a rather panicy on-site admin buttoned 5 out of the 6 petasan servers.
They are now up and running however and the cluster health shows warning with the following message:
noscrub,nodeep-scrub flag(s) set
application not enabled on 1 pool(s)
It seems that maintenance is currently turned on.
Unfortunately the former admin of this server (who left which is how we've picked this install up) put CHAP usernames and passwords on a large chunk of the disks, not documenting the passwords anywhere. I needed to reconnect a disk to a server so stopped the disk from the web portal, however it appears to be stuck in the stopping state with the following errors:
12/12/2019 10:48:28 ERROR Cannot unmap image image-00007. error, no mapped images found.
12/12/2019 10:48:28 ERROR LIO error deleting Target None, maybe the iqn is not exists.
12/12/2019 10:48:28 ERROR Could not find ips for image-00007
12/12/2019 10:48:28 INFO Stopping disk 00007
There is a 6 node cluster (3 storage, 3 management). The error above comes from the node logs on one of the management servers (cs-01). If I go to that management server and perform rbd-showmapped --cluster <clustername> nothing is shown. If I run the same command on one of the other management servers (cs-03) I can see that image-00007 is mapped there and has a watcher. I found a previous thread (https://www.petasan.org/forums/?view=thread&id=528) which seemed to highlight a similar issue. I can see the image under backstores/rbd using targetcli (which also gives me a number of errors when I try to use the exit command, despite making no changes, had to close terminal session to get out).
As a full on noob to this particular product any help someone can provide on how I can stop this iSCSI disk without losing it's data.
Hi
We recently inherited a petasan setup from one of our new customers and have been providing reasonable endeavors support on it. There was an air-conditioning failure at the weekend and a rather panicy on-site admin buttoned 5 out of the 6 petasan servers.
They are now up and running however and the cluster health shows warning with the following message:
noscrub,nodeep-scrub flag(s) set
application not enabled on 1 pool(s)
It seems that maintenance is currently turned on.
Unfortunately the former admin of this server (who left which is how we've picked this install up) put CHAP usernames and passwords on a large chunk of the disks, not documenting the passwords anywhere. I needed to reconnect a disk to a server so stopped the disk from the web portal, however it appears to be stuck in the stopping state with the following errors:
12/12/2019 10:48:28 ERROR Cannot unmap image image-00007. error, no mapped images found.
12/12/2019 10:48:28 ERROR LIO error deleting Target None, maybe the iqn is not exists.
12/12/2019 10:48:28 ERROR Could not find ips for image-00007
12/12/2019 10:48:28 INFO Stopping disk 00007
There is a 6 node cluster (3 storage, 3 management). The error above comes from the node logs on one of the management servers (cs-01). If I go to that management server and perform rbd-showmapped --cluster <clustername> nothing is shown. If I run the same command on one of the other management servers (cs-03) I can see that image-00007 is mapped there and has a watcher. I found a previous thread (https://www.petasan.org/forums/?view=thread&id=528) which seemed to highlight a similar issue. I can see the image under backstores/rbd using targetcli (which also gives me a number of errors when I try to use the exit command, despite making no changes, had to close terminal session to get out).
As a full on noob to this particular product any help someone can provide on how I can stop this iSCSI disk without losing it's data.
admin
2,930 Posts
December 12, 2019, 12:33 pmQuote from admin on December 12, 2019, 12:33 pmThe linked topic is probably not related, as it was using image clone feature which is not supported.
I understand 5 of 6 nodes were down. If so i would expect the cluster to have all disks stopped when rebooted, so either the current node with mapped drives was not restarted, or the disk was added after the reboot. I presume you have no client io, if so i would recommend rebooting all nodes or the node with the mapped drive. This will make sure no iSCSI disks are started.
Before starting any iSCSI disks, i would recommend first make sure the Ceph layer is OK, ceph status OK, all pgs active clean + all osds up. Also check consul system is up:
consul members
If all OK then start the iSCSI disks
If all OK, as a root ssh user you can retrieve the passwords on the image via a script which reads metadata, the script syntax has changed between versions, but it has a help param:
/opt/petasan/scripts/util/disk_meta.py read -h
Good luck
The linked topic is probably not related, as it was using image clone feature which is not supported.
I understand 5 of 6 nodes were down. If so i would expect the cluster to have all disks stopped when rebooted, so either the current node with mapped drives was not restarted, or the disk was added after the reboot. I presume you have no client io, if so i would recommend rebooting all nodes or the node with the mapped drive. This will make sure no iSCSI disks are started.
Before starting any iSCSI disks, i would recommend first make sure the Ceph layer is OK, ceph status OK, all pgs active clean + all osds up. Also check consul system is up:
consul members
If all OK then start the iSCSI disks
If all OK, as a root ssh user you can retrieve the passwords on the image via a script which reads metadata, the script syntax has changed between versions, but it has a help param:
/opt/petasan/scripts/util/disk_meta.py read -h
Good luck
Last edited on December 12, 2019, 12:38 pm by admin · #2
am549
4 Posts
December 12, 2019, 2:35 pmQuote from am549 on December 12, 2019, 2:35 pmHi
Currently all of the other iSCSI disks are working without any issue and were able to be stopped/restarted. It's just this singular disk which appears to be the problem currently as it won't stop. We're wanting to take the passwords off the disks as it adds another layer of complexity in reconnecting them to servers when issues do arise, which was part of the reason for stopping this disk in the first instance
We've got a reboot planned in for the node which has the disk currently stuck "stopping" which hopefully will resolve the problem.
We're not seeing any issues with OSD's or CEPH from the statuses in the web console and can consul members shows all nodes as alive.
Will update once the reboot has been completed.
Hi
Currently all of the other iSCSI disks are working without any issue and were able to be stopped/restarted. It's just this singular disk which appears to be the problem currently as it won't stop. We're wanting to take the passwords off the disks as it adds another layer of complexity in reconnecting them to servers when issues do arise, which was part of the reason for stopping this disk in the first instance
We've got a reboot planned in for the node which has the disk currently stuck "stopping" which hopefully will resolve the problem.
We're not seeing any issues with OSD's or CEPH from the statuses in the web console and can consul members shows all nodes as alive.
Will update once the reboot has been completed.
admin
2,930 Posts
December 12, 2019, 3:50 pmQuote from admin on December 12, 2019, 3:50 pmremember you can always buy professional support from us.
remember you can always buy professional support from us.
am549
4 Posts
December 12, 2019, 4:32 pmQuote from am549 on December 12, 2019, 4:32 pmThanks
Just before this reboot goes ahead, what happens to the currently assigned images on the host when it's rebooted, will they be reassigned to another host in the cluster or will they go offline?
Thanks
Just before this reboot goes ahead, what happens to the currently assigned images on the host when it's rebooted, will they be reassigned to another host in the cluster or will they go offline?
admin
2,930 Posts
December 12, 2019, 4:49 pmQuote from admin on December 12, 2019, 4:49 pmimages are stored in Ceph rbd pool which is assigned to all nodes with OSDs.
iSCSI paths to these images are handled by Consul, in normal case when the cluster is up and only 1 or 2 nodes fail, then paths are re-assigned to another host, if however the entire Consul cluster is down and restarted, then it will restarted without any iSCSI paths mapped to images..ie all disks stopped.
images are stored in Ceph rbd pool which is assigned to all nodes with OSDs.
iSCSI paths to these images are handled by Consul, in normal case when the cluster is up and only 1 or 2 nodes fail, then paths are re-assigned to another host, if however the entire Consul cluster is down and restarted, then it will restarted without any iSCSI paths mapped to images..ie all disks stopped.
am549
4 Posts
December 12, 2019, 5:17 pmQuote from am549 on December 12, 2019, 5:17 pmThanks, I had thought as much.
The reboot did the trick, disk went offline, was able to remove the chap auth and get the disk back up and connected to the server.
Many thanks for your assistance
Thanks, I had thought as much.
The reboot did the trick, disk went offline, was able to remove the chap auth and get the disk back up and connected to the server.
Many thanks for your assistance
Unable to Stop iSCSI Disk
am549
4 Posts
Quote from am549 on December 12, 2019, 11:02 amHi
We recently inherited a petasan setup from one of our new customers and have been providing reasonable endeavors support on it. There was an air-conditioning failure at the weekend and a rather panicy on-site admin buttoned 5 out of the 6 petasan servers.
They are now up and running however and the cluster health shows warning with the following message:
noscrub,nodeep-scrub flag(s) set
application not enabled on 1 pool(s)It seems that maintenance is currently turned on.
Unfortunately the former admin of this server (who left which is how we've picked this install up) put CHAP usernames and passwords on a large chunk of the disks, not documenting the passwords anywhere. I needed to reconnect a disk to a server so stopped the disk from the web portal, however it appears to be stuck in the stopping state with the following errors:
12/12/2019 10:48:28 ERROR Cannot unmap image image-00007. error, no mapped images found.
12/12/2019 10:48:28 ERROR LIO error deleting Target None, maybe the iqn is not exists.
12/12/2019 10:48:28 ERROR Could not find ips for image-00007
12/12/2019 10:48:28 INFO Stopping disk 00007There is a 6 node cluster (3 storage, 3 management). The error above comes from the node logs on one of the management servers (cs-01). If I go to that management server and perform rbd-showmapped --cluster <clustername> nothing is shown. If I run the same command on one of the other management servers (cs-03) I can see that image-00007 is mapped there and has a watcher. I found a previous thread (https://www.petasan.org/forums/?view=thread&id=528) which seemed to highlight a similar issue. I can see the image under backstores/rbd using targetcli (which also gives me a number of errors when I try to use the exit command, despite making no changes, had to close terminal session to get out).
As a full on noob to this particular product any help someone can provide on how I can stop this iSCSI disk without losing it's data.
Hi
We recently inherited a petasan setup from one of our new customers and have been providing reasonable endeavors support on it. There was an air-conditioning failure at the weekend and a rather panicy on-site admin buttoned 5 out of the 6 petasan servers.
They are now up and running however and the cluster health shows warning with the following message:
noscrub,nodeep-scrub flag(s) set
application not enabled on 1 pool(s)
It seems that maintenance is currently turned on.
Unfortunately the former admin of this server (who left which is how we've picked this install up) put CHAP usernames and passwords on a large chunk of the disks, not documenting the passwords anywhere. I needed to reconnect a disk to a server so stopped the disk from the web portal, however it appears to be stuck in the stopping state with the following errors:
12/12/2019 10:48:28 ERROR Cannot unmap image image-00007. error, no mapped images found.
12/12/2019 10:48:28 ERROR LIO error deleting Target None, maybe the iqn is not exists.
12/12/2019 10:48:28 ERROR Could not find ips for image-00007
12/12/2019 10:48:28 INFO Stopping disk 00007
There is a 6 node cluster (3 storage, 3 management). The error above comes from the node logs on one of the management servers (cs-01). If I go to that management server and perform rbd-showmapped --cluster <clustername> nothing is shown. If I run the same command on one of the other management servers (cs-03) I can see that image-00007 is mapped there and has a watcher. I found a previous thread (https://www.petasan.org/forums/?view=thread&id=528) which seemed to highlight a similar issue. I can see the image under backstores/rbd using targetcli (which also gives me a number of errors when I try to use the exit command, despite making no changes, had to close terminal session to get out).
As a full on noob to this particular product any help someone can provide on how I can stop this iSCSI disk without losing it's data.
admin
2,930 Posts
Quote from admin on December 12, 2019, 12:33 pmThe linked topic is probably not related, as it was using image clone feature which is not supported.
I understand 5 of 6 nodes were down. If so i would expect the cluster to have all disks stopped when rebooted, so either the current node with mapped drives was not restarted, or the disk was added after the reboot. I presume you have no client io, if so i would recommend rebooting all nodes or the node with the mapped drive. This will make sure no iSCSI disks are started.
Before starting any iSCSI disks, i would recommend first make sure the Ceph layer is OK, ceph status OK, all pgs active clean + all osds up. Also check consul system is up:
consul members
If all OK then start the iSCSI disks
If all OK, as a root ssh user you can retrieve the passwords on the image via a script which reads metadata, the script syntax has changed between versions, but it has a help param:
/opt/petasan/scripts/util/disk_meta.py read -h
Good luck
The linked topic is probably not related, as it was using image clone feature which is not supported.
I understand 5 of 6 nodes were down. If so i would expect the cluster to have all disks stopped when rebooted, so either the current node with mapped drives was not restarted, or the disk was added after the reboot. I presume you have no client io, if so i would recommend rebooting all nodes or the node with the mapped drive. This will make sure no iSCSI disks are started.
Before starting any iSCSI disks, i would recommend first make sure the Ceph layer is OK, ceph status OK, all pgs active clean + all osds up. Also check consul system is up:
consul members
If all OK then start the iSCSI disks
If all OK, as a root ssh user you can retrieve the passwords on the image via a script which reads metadata, the script syntax has changed between versions, but it has a help param:
/opt/petasan/scripts/util/disk_meta.py read -h
Good luck
am549
4 Posts
Quote from am549 on December 12, 2019, 2:35 pmHi
Currently all of the other iSCSI disks are working without any issue and were able to be stopped/restarted. It's just this singular disk which appears to be the problem currently as it won't stop. We're wanting to take the passwords off the disks as it adds another layer of complexity in reconnecting them to servers when issues do arise, which was part of the reason for stopping this disk in the first instance
We've got a reboot planned in for the node which has the disk currently stuck "stopping" which hopefully will resolve the problem.
We're not seeing any issues with OSD's or CEPH from the statuses in the web console and can consul members shows all nodes as alive.
Will update once the reboot has been completed.
Hi
Currently all of the other iSCSI disks are working without any issue and were able to be stopped/restarted. It's just this singular disk which appears to be the problem currently as it won't stop. We're wanting to take the passwords off the disks as it adds another layer of complexity in reconnecting them to servers when issues do arise, which was part of the reason for stopping this disk in the first instance
We've got a reboot planned in for the node which has the disk currently stuck "stopping" which hopefully will resolve the problem.
We're not seeing any issues with OSD's or CEPH from the statuses in the web console and can consul members shows all nodes as alive.
Will update once the reboot has been completed.
admin
2,930 Posts
Quote from admin on December 12, 2019, 3:50 pmremember you can always buy professional support from us.
remember you can always buy professional support from us.
am549
4 Posts
Quote from am549 on December 12, 2019, 4:32 pmThanks
Just before this reboot goes ahead, what happens to the currently assigned images on the host when it's rebooted, will they be reassigned to another host in the cluster or will they go offline?
Thanks
Just before this reboot goes ahead, what happens to the currently assigned images on the host when it's rebooted, will they be reassigned to another host in the cluster or will they go offline?
admin
2,930 Posts
Quote from admin on December 12, 2019, 4:49 pmimages are stored in Ceph rbd pool which is assigned to all nodes with OSDs.
iSCSI paths to these images are handled by Consul, in normal case when the cluster is up and only 1 or 2 nodes fail, then paths are re-assigned to another host, if however the entire Consul cluster is down and restarted, then it will restarted without any iSCSI paths mapped to images..ie all disks stopped.
images are stored in Ceph rbd pool which is assigned to all nodes with OSDs.
iSCSI paths to these images are handled by Consul, in normal case when the cluster is up and only 1 or 2 nodes fail, then paths are re-assigned to another host, if however the entire Consul cluster is down and restarted, then it will restarted without any iSCSI paths mapped to images..ie all disks stopped.
am549
4 Posts
Quote from am549 on December 12, 2019, 5:17 pmThanks, I had thought as much.
The reboot did the trick, disk went offline, was able to remove the chap auth and get the disk back up and connected to the server.
Many thanks for your assistance
Thanks, I had thought as much.
The reboot did the trick, disk went offline, was able to remove the chap auth and get the disk back up and connected to the server.
Many thanks for your assistance