Can't stop iSCSI disks
Pages: 1 2
protocol6v
85 Posts
April 6, 2018, 7:39 pmQuote from protocol6v on April 6, 2018, 7:39 pmJust an update for when you get back, and I'll keep updating this post with developments.
Was able to replicate this issues several times, each on fresh install. I rotated which nodes I used for #1, 2, 3, and 4, so I know this is not hardware related
Here's my workflow:
After a fresh install and setup of the initial 3 nodes, add 4th node. Config NTP and SMTP settings. Config iSCSI ip range 1 (172.31.1.200-254) and 2 (172.31.2.200-254)
create a 20GB iSCSI disk with 2 auto paths and an access control list of comma separated IQNs. waited for it to say start, verified two paths are assigned to two nodes
create an additional 50TB disk with 8 auto paths and the same list of IQNs. Wait for it to start, then checked paths and only two are assigned out of 8. Then try to stop the 50TB disk, and Node1 shuts down. I did not try exclusively doing these actions from a different nodes webUI yet. will try that next.
After bringing node1 back online, the PGs never resync. they go to an "unknown" state. This particular time the cluster thinks 3/4 of the PGs are unknown:
root@bd-ceph-sd2:~# ceph --cluster BD-Ceph-Cl1
ceph> status
cluster:
id: 93b0c771-30d5-4572-b612-7a95a31c4ec2
health: HEALTH_WARN
Reduced data availability: 795 pgs inactive
Degraded data redundancy: 795 pgs unclean
services:
mon: 3 daemons, quorum bd-ceph-sd1,bd-ceph-sd2,bd-ceph-sd3
mgr: bd-ceph-sd1(active)
osd: 16 osds: 16 up, 16 in
data:
pools: 1 pools, 1024 pgs
objects: 0 objects, 0 bytes
usage: 86065 MB used, 11173 GB / 11257 GB avail
pgs: 77.637% pgs unknown
795 unknown
229 active+clean
In petasan log there's some errors about paths not existing, which i believe is why this disk is taking so long to stop...
06/04/2018 14:49:23 ERROR LIO error could not create target for disk 00002.
06/04/2018 14:49:23 ERROR Could not create Target in configFS.
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/lio/api.py", line 63, in add_target
target = Target(fabric, disk_meta.iqn)
File "/usr/lib/python2.7/dist-packages/rtslib/target.py", line 1214, in __init__
self._create_in_cfs_ine(mode)
File "/usr/lib/python2.7/dist-packages/rtslib/node.py", line 77, in _create_in_cfs_ine
% self.__class__.__name__)
RTSLibError: Could not create Target in configFS.
06/04/2018 14:49:23 ERROR Error could not acquire path 00002/5
06/04/2018 14:49:23 INFO Stopping disk 00002
06/04/2018 14:49:23 ERROR Could not find ips for image-00002
06/04/2018 14:49:23 INFO LIO deleted backstore image image-00002
06/04/2018 14:49:23 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00002, maybe the iqn is not exists.
06/04/2018 14:49:23 INFO Image image-00002 unmapped successfully.
06/04/2018 14:49:25 INFO PetaSAN Cleaned rbd backstores.
06/04/2018 14:49:25 INFO Stopping disk 00002
06/04/2018 14:49:25 ERROR Could not find ips for image-00002
06/04/2018 14:49:25 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00002, maybe the iqn is not exists.
06/04/2018 14:49:25 ERROR Cannot unmap image image-00002. error.
The petasan log actually has some interesting info so I'm going to send all three management node logs along to your email.
Will update with anything else i find.
Just an update for when you get back, and I'll keep updating this post with developments.
Was able to replicate this issues several times, each on fresh install. I rotated which nodes I used for #1, 2, 3, and 4, so I know this is not hardware related
Here's my workflow:
After a fresh install and setup of the initial 3 nodes, add 4th node. Config NTP and SMTP settings. Config iSCSI ip range 1 (172.31.1.200-254) and 2 (172.31.2.200-254)
create a 20GB iSCSI disk with 2 auto paths and an access control list of comma separated IQNs. waited for it to say start, verified two paths are assigned to two nodes
create an additional 50TB disk with 8 auto paths and the same list of IQNs. Wait for it to start, then checked paths and only two are assigned out of 8. Then try to stop the 50TB disk, and Node1 shuts down. I did not try exclusively doing these actions from a different nodes webUI yet. will try that next.
After bringing node1 back online, the PGs never resync. they go to an "unknown" state. This particular time the cluster thinks 3/4 of the PGs are unknown:
root@bd-ceph-sd2:~# ceph --cluster BD-Ceph-Cl1
ceph> status
cluster:
id: 93b0c771-30d5-4572-b612-7a95a31c4ec2
health: HEALTH_WARN
Reduced data availability: 795 pgs inactive
Degraded data redundancy: 795 pgs unclean
services:
mon: 3 daemons, quorum bd-ceph-sd1,bd-ceph-sd2,bd-ceph-sd3
mgr: bd-ceph-sd1(active)
osd: 16 osds: 16 up, 16 in
data:
pools: 1 pools, 1024 pgs
objects: 0 objects, 0 bytes
usage: 86065 MB used, 11173 GB / 11257 GB avail
pgs: 77.637% pgs unknown
795 unknown
229 active+clean
In petasan log there's some errors about paths not existing, which i believe is why this disk is taking so long to stop...
06/04/2018 14:49:23 ERROR LIO error could not create target for disk 00002.
06/04/2018 14:49:23 ERROR Could not create Target in configFS.
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/lio/api.py", line 63, in add_target
target = Target(fabric, disk_meta.iqn)
File "/usr/lib/python2.7/dist-packages/rtslib/target.py", line 1214, in __init__
self._create_in_cfs_ine(mode)
File "/usr/lib/python2.7/dist-packages/rtslib/node.py", line 77, in _create_in_cfs_ine
% self.__class__.__name__)
RTSLibError: Could not create Target in configFS.
06/04/2018 14:49:23 ERROR Error could not acquire path 00002/5
06/04/2018 14:49:23 INFO Stopping disk 00002
06/04/2018 14:49:23 ERROR Could not find ips for image-00002
06/04/2018 14:49:23 INFO LIO deleted backstore image image-00002
06/04/2018 14:49:23 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00002, maybe the iqn is not exists.
06/04/2018 14:49:23 INFO Image image-00002 unmapped successfully.
06/04/2018 14:49:25 INFO PetaSAN Cleaned rbd backstores.
06/04/2018 14:49:25 INFO Stopping disk 00002
06/04/2018 14:49:25 ERROR Could not find ips for image-00002
06/04/2018 14:49:25 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00002, maybe the iqn is not exists.
06/04/2018 14:49:25 ERROR Cannot unmap image image-00002. error.
The petasan log actually has some interesting info so I'm going to send all three management node logs along to your email.
Will update with anything else i find.
Last edited on April 6, 2018, 7:48 pm by protocol6v · #11
protocol6v
85 Posts
April 6, 2018, 7:55 pmQuote from protocol6v on April 6, 2018, 7:55 pmSaw your post after this, and have also emailed you, so to avoid splitting this convo, i'm going to let you have your weekend and stop here!
Saw your post after this, and have also emailed you, so to avoid splitting this convo, i'm going to let you have your weekend and stop here!
admin
2,930 Posts
April 6, 2018, 8:37 pmQuote from admin on April 6, 2018, 8:37 pmCeph should really not be failing at all, even if the iSCSI layer or Consul go bad, it is independent and unaware of them. We need to get more detail on why 3/4 of PGs are in unknown state.
use either command and to identify a PG that is in unknown state
ceph health detail --cluster X
ceph pg dump --cluster X
now get PG info and please send it via email
ceph pg PG_NUM query --cluster X
After you added all OSDs and before you added any iSCSI disks, was Ceph status all active+clean ? and if you shut a node down and restart it, does Ceph come back to active+clean quickly ? If you look at the PG Status chart on dashboard, do the PGs go from clean to unkown suddenly when the node gets shutdown ?
When you add iSCSI disks, can you see if Ceph is responsive via
rbd ls --cluster X
Now to consul, can you check if consul is up by running
consul members
Now on iSCSI, from the steps you describe, the problem happens when you add the second 50TB disk..correct? It is not related to stopping disks and there is no client traffic..correct?
Does the problem happen if you do not add ACLs ?
If you create just the 50TB image, does it happen ?
Last thing, can you please double check the ip subnet ranges just to make sure there could be no ip overlap.
Ceph should really not be failing at all, even if the iSCSI layer or Consul go bad, it is independent and unaware of them. We need to get more detail on why 3/4 of PGs are in unknown state.
use either command and to identify a PG that is in unknown state
ceph health detail --cluster X
ceph pg dump --cluster X
now get PG info and please send it via email
ceph pg PG_NUM query --cluster X
After you added all OSDs and before you added any iSCSI disks, was Ceph status all active+clean ? and if you shut a node down and restart it, does Ceph come back to active+clean quickly ? If you look at the PG Status chart on dashboard, do the PGs go from clean to unkown suddenly when the node gets shutdown ?
When you add iSCSI disks, can you see if Ceph is responsive via
rbd ls --cluster X
Now to consul, can you check if consul is up by running
consul members
Now on iSCSI, from the steps you describe, the problem happens when you add the second 50TB disk..correct? It is not related to stopping disks and there is no client traffic..correct?
Does the problem happen if you do not add ACLs ?
If you create just the 50TB image, does it happen ?
Last thing, can you please double check the ip subnet ranges just to make sure there could be no ip overlap.
Last edited on April 6, 2018, 8:39 pm by admin · #13
protocol6v
85 Posts
April 10, 2018, 3:02 pmQuote from protocol6v on April 10, 2018, 3:02 pmI did a ton of testing of this over the weekend, had extremely detailed notes about each time I erased and reinstalled and tested. Did that at least 8 times. Then forgot to save the notes, and when got back... blank. Wonderful.
I'm going to re-test again today and follow up. I do know these things for sure:
When two iSCSI disks exist, it doesn't seem to matter what order you start them, if one has more paths than the other, it causes the cluster to freak out, and fence hosts. This is reproducible every time with the settings I'm using. Will try again without IQN ACLs. Usually just one host shuts down, but I had two shutdown ( a management node and a storage only node) the last time i tried this. That was where I gave up for the weekend.
Will collect the pg info on the next batch of tests and let you know. Will also try different size disks and see if it is the same result.
I did verify networking, all good. Each node can ping eachother on their respective subnets, and there are no duplicate IPs in use.
I did a ton of testing of this over the weekend, had extremely detailed notes about each time I erased and reinstalled and tested. Did that at least 8 times. Then forgot to save the notes, and when got back... blank. Wonderful.
I'm going to re-test again today and follow up. I do know these things for sure:
When two iSCSI disks exist, it doesn't seem to matter what order you start them, if one has more paths than the other, it causes the cluster to freak out, and fence hosts. This is reproducible every time with the settings I'm using. Will try again without IQN ACLs. Usually just one host shuts down, but I had two shutdown ( a management node and a storage only node) the last time i tried this. That was where I gave up for the weekend.
Will collect the pg info on the next batch of tests and let you know. Will also try different size disks and see if it is the same result.
I did verify networking, all good. Each node can ping eachother on their respective subnets, and there are no duplicate IPs in use.
admin
2,930 Posts
April 10, 2018, 4:16 pmQuote from admin on April 10, 2018, 4:16 pmHi,
We do a lot of tests starting hundreds of disks via scripts, so it should work. One thing is not clear..do you still see the Ceph error with only 1/4 of PGs active ? If so then this is the root problem that we need to see what caused it and how to recover from it. If it is still happening it can have side effects like iSCSI disks not starting or maybe fencing and stopping one another.
If you see the iSCSI disk issue on a healthy Ceph cluster, then yes please let me know the steps to reproduce and we will look into it.
Hi,
We do a lot of tests starting hundreds of disks via scripts, so it should work. One thing is not clear..do you still see the Ceph error with only 1/4 of PGs active ? If so then this is the root problem that we need to see what caused it and how to recover from it. If it is still happening it can have side effects like iSCSI disks not starting or maybe fencing and stopping one another.
If you see the iSCSI disk issue on a healthy Ceph cluster, then yes please let me know the steps to reproduce and we will look into it.
protocol6v
85 Posts
April 10, 2018, 4:45 pmQuote from protocol6v on April 10, 2018, 4:45 pmI haven't gotten far enough to reproduce the PGs unknown issue yet, but the iSCSI issue is definitely easily reproducible on a fresh healthy cluster.
Ok, starting with a clean install, all (4) nodes online, 4 OSDs and 1 journal per node. Health is clean and active. Set iSCSI subnets and IPs, configured NTP and SMTP.
Added first 20GB disk with no IQN ACLs this time, and 2 paths auto configured. Disk came up no problem, but stopping takes a very long time. Here's what's in PetaSAN.log (on node1) while waiting for it to stop:
10/04/2018 12:13:22 INFO Disk BD-Ceph-Cl1_BD-E7k-HV-Cl1-Quorum created
10/04/2018 12:13:22 INFO Successfully created key 00001 for new disk.
10/04/2018 12:13:22 INFO Successfully created key /00001/1 for new disk.
10/04/2018 12:13:22 INFO Successfully created key /00001/2 for new disk.
10/04/2018 12:13:42 INFO Could not lock path 00001/2 with session befd9ded-050a-9d39-1df4-b160e77d3bfa.
10/04/2018 12:14:03 INFO Successfully created key 00001 for new disk.
10/04/2018 12:14:05 INFO Stopping disk 00001
10/04/2018 12:14:05 ERROR Could not find ips for image-00001
10/04/2018 12:14:05 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:14:05 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:14:19 INFO Stopping disk 00001
10/04/2018 12:14:40 ERROR Could not find ips for image-00001
10/04/2018 12:14:40 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:14:40 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:14:54 INFO Stopping disk 00001
10/04/2018 12:14:54 ERROR Could not find ips for image-00001
10/04/2018 12:14:54 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:14:54 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:14:58 INFO Stopping disk 00001
10/04/2018 12:15:11 ERROR Could not find ips for image-00001
10/04/2018 12:15:11 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:15:11 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:15:15 INFO Stopping disk 00001
10/04/2018 12:15:25 ERROR Could not find ips for image-00001
10/04/2018 12:15:25 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:15:25 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:15:42 INFO Stopping disk 00001
10/04/2018 12:15:58 ERROR Could not find ips for image-00001
10/04/2018 12:15:58 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:15:58 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:16:12 INFO Stopping disk 00001
10/04/2018 12:16:12 ERROR Could not find ips for image-00001
10/04/2018 12:16:12 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:16:12 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:16:27 INFO Stopping disk 00001
at this point, I'm still waiting for the disk to stop. It's been about 40 minutes so far.
Will update when I have hit the PG unknown issue.
I haven't gotten far enough to reproduce the PGs unknown issue yet, but the iSCSI issue is definitely easily reproducible on a fresh healthy cluster.
Ok, starting with a clean install, all (4) nodes online, 4 OSDs and 1 journal per node. Health is clean and active. Set iSCSI subnets and IPs, configured NTP and SMTP.
Added first 20GB disk with no IQN ACLs this time, and 2 paths auto configured. Disk came up no problem, but stopping takes a very long time. Here's what's in PetaSAN.log (on node1) while waiting for it to stop:
10/04/2018 12:13:22 INFO Disk BD-Ceph-Cl1_BD-E7k-HV-Cl1-Quorum created
10/04/2018 12:13:22 INFO Successfully created key 00001 for new disk.
10/04/2018 12:13:22 INFO Successfully created key /00001/1 for new disk.
10/04/2018 12:13:22 INFO Successfully created key /00001/2 for new disk.
10/04/2018 12:13:42 INFO Could not lock path 00001/2 with session befd9ded-050a-9d39-1df4-b160e77d3bfa.
10/04/2018 12:14:03 INFO Successfully created key 00001 for new disk.
10/04/2018 12:14:05 INFO Stopping disk 00001
10/04/2018 12:14:05 ERROR Could not find ips for image-00001
10/04/2018 12:14:05 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:14:05 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:14:19 INFO Stopping disk 00001
10/04/2018 12:14:40 ERROR Could not find ips for image-00001
10/04/2018 12:14:40 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:14:40 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:14:54 INFO Stopping disk 00001
10/04/2018 12:14:54 ERROR Could not find ips for image-00001
10/04/2018 12:14:54 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:14:54 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:14:58 INFO Stopping disk 00001
10/04/2018 12:15:11 ERROR Could not find ips for image-00001
10/04/2018 12:15:11 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:15:11 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:15:15 INFO Stopping disk 00001
10/04/2018 12:15:25 ERROR Could not find ips for image-00001
10/04/2018 12:15:25 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:15:25 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:15:42 INFO Stopping disk 00001
10/04/2018 12:15:58 ERROR Could not find ips for image-00001
10/04/2018 12:15:58 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:15:58 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:16:12 INFO Stopping disk 00001
10/04/2018 12:16:12 ERROR Could not find ips for image-00001
10/04/2018 12:16:12 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:16:12 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:16:27 INFO Stopping disk 00001
at this point, I'm still waiting for the disk to stop. It's been about 40 minutes so far.
Will update when I have hit the PG unknown issue.
admin
2,930 Posts
April 10, 2018, 10:28 pmQuote from admin on April 10, 2018, 10:28 pmFrom the logs the error looks like an incorrect iqn base prefix in the iSCSI settings:
iqn.2018-04.net.example.internal:bd-ceph-cl1
It should be for example:
iqn.2016-05.com.petasan
iqn.2018-04.net.example.internal
The iqn format is iqn.yyyy-mm.naming-authority:unique-name, on the client side you do enter the unique-name (or use the default set by the os, which appends the hostname), this defines the unique client endpoint. On the server/PetaSAN, you define the iqn base prefix for the cluster which is the left part of the :, then PetaSAN will append the disk id ( example 00001 ) to the cluster base iqn to create a unique endpoint for target disk.
We should have caught it in the ui validation, i am sure we use an iqn pattern matching but it seems it accepted it.
If you remove the :bd-ceph-cl1, things will work fine. To clean up the current settings, the system is not able to clean them itself since the iqn is not valid. You will need to do it manually, as an example:
# stop disk 00001 resource from Consul
consul kv delete -recurse PetaSAN/Disks/00001
# remove iqn from LIO target
targetcli /iscsi delete iqn.2018-04.net.example.internal:bd-ceph-cl1:00001
# remove local path ip
ip addr delete 10.0.2.100/24 dev eth0
From the logs the error looks like an incorrect iqn base prefix in the iSCSI settings:
iqn.2018-04.net.example.internal:bd-ceph-cl1
It should be for example:
iqn.2016-05.com.petasan
iqn.2018-04.net.example.internal
The iqn format is iqn.yyyy-mm.naming-authority:unique-name, on the client side you do enter the unique-name (or use the default set by the os, which appends the hostname), this defines the unique client endpoint. On the server/PetaSAN, you define the iqn base prefix for the cluster which is the left part of the :, then PetaSAN will append the disk id ( example 00001 ) to the cluster base iqn to create a unique endpoint for target disk.
We should have caught it in the ui validation, i am sure we use an iqn pattern matching but it seems it accepted it.
If you remove the :bd-ceph-cl1, things will work fine. To clean up the current settings, the system is not able to clean them itself since the iqn is not valid. You will need to do it manually, as an example:
# stop disk 00001 resource from Consul
consul kv delete -recurse PetaSAN/Disks/00001
# remove iqn from LIO target
targetcli /iscsi delete iqn.2018-04.net.example.internal:bd-ceph-cl1:00001
# remove local path ip
ip addr delete 10.0.2.100/24 dev eth0
Last edited on April 10, 2018, 10:31 pm by admin · #17
protocol6v
85 Posts
April 10, 2018, 10:45 pmQuote from protocol6v on April 10, 2018, 10:45 pmYou beat me to it! I caught that earlier and have been trying to test with revised iqn but got pulled from my desk will report back
You beat me to it! I caught that earlier and have been trying to test with revised iqn but got pulled from my desk will report back
protocol6v
85 Posts
April 11, 2018, 11:46 amQuote from protocol6v on April 11, 2018, 11:46 amThe base IQN change solved the iSCSI path and start/stop issues. Will begin simulating failures today to see if I can break anything else.
Thank you for your help!
The base IQN change solved the iSCSI path and start/stop issues. Will begin simulating failures today to see if I can break anything else.
Thank you for your help!
Pages: 1 2
Can't stop iSCSI disks
protocol6v
85 Posts
Quote from protocol6v on April 6, 2018, 7:39 pmJust an update for when you get back, and I'll keep updating this post with developments.
Was able to replicate this issues several times, each on fresh install. I rotated which nodes I used for #1, 2, 3, and 4, so I know this is not hardware related
Here's my workflow:
After a fresh install and setup of the initial 3 nodes, add 4th node. Config NTP and SMTP settings. Config iSCSI ip range 1 (172.31.1.200-254) and 2 (172.31.2.200-254)
create a 20GB iSCSI disk with 2 auto paths and an access control list of comma separated IQNs. waited for it to say start, verified two paths are assigned to two nodes
create an additional 50TB disk with 8 auto paths and the same list of IQNs. Wait for it to start, then checked paths and only two are assigned out of 8. Then try to stop the 50TB disk, and Node1 shuts down. I did not try exclusively doing these actions from a different nodes webUI yet. will try that next.
After bringing node1 back online, the PGs never resync. they go to an "unknown" state. This particular time the cluster thinks 3/4 of the PGs are unknown:
root@bd-ceph-sd2:~# ceph --cluster BD-Ceph-Cl1
ceph> status
cluster:
id: 93b0c771-30d5-4572-b612-7a95a31c4ec2
health: HEALTH_WARN
Reduced data availability: 795 pgs inactive
Degraded data redundancy: 795 pgs uncleanservices:
mon: 3 daemons, quorum bd-ceph-sd1,bd-ceph-sd2,bd-ceph-sd3
mgr: bd-ceph-sd1(active)
osd: 16 osds: 16 up, 16 indata:
pools: 1 pools, 1024 pgs
objects: 0 objects, 0 bytes
usage: 86065 MB used, 11173 GB / 11257 GB avail
pgs: 77.637% pgs unknown
795 unknown
229 active+cleanIn petasan log there's some errors about paths not existing, which i believe is why this disk is taking so long to stop...
06/04/2018 14:49:23 ERROR LIO error could not create target for disk 00002.
06/04/2018 14:49:23 ERROR Could not create Target in configFS.
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/lio/api.py", line 63, in add_target
target = Target(fabric, disk_meta.iqn)
File "/usr/lib/python2.7/dist-packages/rtslib/target.py", line 1214, in __init__
self._create_in_cfs_ine(mode)
File "/usr/lib/python2.7/dist-packages/rtslib/node.py", line 77, in _create_in_cfs_ine
% self.__class__.__name__)
RTSLibError: Could not create Target in configFS.
06/04/2018 14:49:23 ERROR Error could not acquire path 00002/5
06/04/2018 14:49:23 INFO Stopping disk 00002
06/04/2018 14:49:23 ERROR Could not find ips for image-00002
06/04/2018 14:49:23 INFO LIO deleted backstore image image-00002
06/04/2018 14:49:23 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00002, maybe the iqn is not exists.
06/04/2018 14:49:23 INFO Image image-00002 unmapped successfully.
06/04/2018 14:49:25 INFO PetaSAN Cleaned rbd backstores.
06/04/2018 14:49:25 INFO Stopping disk 00002
06/04/2018 14:49:25 ERROR Could not find ips for image-00002
06/04/2018 14:49:25 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00002, maybe the iqn is not exists.
06/04/2018 14:49:25 ERROR Cannot unmap image image-00002. error.The petasan log actually has some interesting info so I'm going to send all three management node logs along to your email.
Will update with anything else i find.
Just an update for when you get back, and I'll keep updating this post with developments.
Was able to replicate this issues several times, each on fresh install. I rotated which nodes I used for #1, 2, 3, and 4, so I know this is not hardware related
Here's my workflow:
After a fresh install and setup of the initial 3 nodes, add 4th node. Config NTP and SMTP settings. Config iSCSI ip range 1 (172.31.1.200-254) and 2 (172.31.2.200-254)
create a 20GB iSCSI disk with 2 auto paths and an access control list of comma separated IQNs. waited for it to say start, verified two paths are assigned to two nodes
create an additional 50TB disk with 8 auto paths and the same list of IQNs. Wait for it to start, then checked paths and only two are assigned out of 8. Then try to stop the 50TB disk, and Node1 shuts down. I did not try exclusively doing these actions from a different nodes webUI yet. will try that next.
After bringing node1 back online, the PGs never resync. they go to an "unknown" state. This particular time the cluster thinks 3/4 of the PGs are unknown:
root@bd-ceph-sd2:~# ceph --cluster BD-Ceph-Cl1
ceph> status
cluster:
id: 93b0c771-30d5-4572-b612-7a95a31c4ec2
health: HEALTH_WARN
Reduced data availability: 795 pgs inactive
Degraded data redundancy: 795 pgs uncleanservices:
mon: 3 daemons, quorum bd-ceph-sd1,bd-ceph-sd2,bd-ceph-sd3
mgr: bd-ceph-sd1(active)
osd: 16 osds: 16 up, 16 indata:
pools: 1 pools, 1024 pgs
objects: 0 objects, 0 bytes
usage: 86065 MB used, 11173 GB / 11257 GB avail
pgs: 77.637% pgs unknown
795 unknown
229 active+clean
In petasan log there's some errors about paths not existing, which i believe is why this disk is taking so long to stop...
06/04/2018 14:49:23 ERROR LIO error could not create target for disk 00002.
06/04/2018 14:49:23 ERROR Could not create Target in configFS.
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/lio/api.py", line 63, in add_target
target = Target(fabric, disk_meta.iqn)
File "/usr/lib/python2.7/dist-packages/rtslib/target.py", line 1214, in __init__
self._create_in_cfs_ine(mode)
File "/usr/lib/python2.7/dist-packages/rtslib/node.py", line 77, in _create_in_cfs_ine
% self.__class__.__name__)
RTSLibError: Could not create Target in configFS.
06/04/2018 14:49:23 ERROR Error could not acquire path 00002/5
06/04/2018 14:49:23 INFO Stopping disk 00002
06/04/2018 14:49:23 ERROR Could not find ips for image-00002
06/04/2018 14:49:23 INFO LIO deleted backstore image image-00002
06/04/2018 14:49:23 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00002, maybe the iqn is not exists.
06/04/2018 14:49:23 INFO Image image-00002 unmapped successfully.
06/04/2018 14:49:25 INFO PetaSAN Cleaned rbd backstores.
06/04/2018 14:49:25 INFO Stopping disk 00002
06/04/2018 14:49:25 ERROR Could not find ips for image-00002
06/04/2018 14:49:25 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00002, maybe the iqn is not exists.
06/04/2018 14:49:25 ERROR Cannot unmap image image-00002. error.
The petasan log actually has some interesting info so I'm going to send all three management node logs along to your email.
Will update with anything else i find.
protocol6v
85 Posts
Quote from protocol6v on April 6, 2018, 7:55 pmSaw your post after this, and have also emailed you, so to avoid splitting this convo, i'm going to let you have your weekend and stop here!
Saw your post after this, and have also emailed you, so to avoid splitting this convo, i'm going to let you have your weekend and stop here!
admin
2,930 Posts
Quote from admin on April 6, 2018, 8:37 pmCeph should really not be failing at all, even if the iSCSI layer or Consul go bad, it is independent and unaware of them. We need to get more detail on why 3/4 of PGs are in unknown state.
use either command and to identify a PG that is in unknown state
ceph health detail --cluster X
ceph pg dump --cluster Xnow get PG info and please send it via email
ceph pg PG_NUM query --cluster X
After you added all OSDs and before you added any iSCSI disks, was Ceph status all active+clean ? and if you shut a node down and restart it, does Ceph come back to active+clean quickly ? If you look at the PG Status chart on dashboard, do the PGs go from clean to unkown suddenly when the node gets shutdown ?
When you add iSCSI disks, can you see if Ceph is responsive via
rbd ls --cluster XNow to consul, can you check if consul is up by running
consul members
Now on iSCSI, from the steps you describe, the problem happens when you add the second 50TB disk..correct? It is not related to stopping disks and there is no client traffic..correct?
Does the problem happen if you do not add ACLs ?
If you create just the 50TB image, does it happen ?Last thing, can you please double check the ip subnet ranges just to make sure there could be no ip overlap.
Ceph should really not be failing at all, even if the iSCSI layer or Consul go bad, it is independent and unaware of them. We need to get more detail on why 3/4 of PGs are in unknown state.
use either command and to identify a PG that is in unknown state
ceph health detail --cluster X
ceph pg dump --cluster X
now get PG info and please send it via email
ceph pg PG_NUM query --cluster X
After you added all OSDs and before you added any iSCSI disks, was Ceph status all active+clean ? and if you shut a node down and restart it, does Ceph come back to active+clean quickly ? If you look at the PG Status chart on dashboard, do the PGs go from clean to unkown suddenly when the node gets shutdown ?
When you add iSCSI disks, can you see if Ceph is responsive via
rbd ls --cluster X
Now to consul, can you check if consul is up by running
consul members
Now on iSCSI, from the steps you describe, the problem happens when you add the second 50TB disk..correct? It is not related to stopping disks and there is no client traffic..correct?
Does the problem happen if you do not add ACLs ?
If you create just the 50TB image, does it happen ?
Last thing, can you please double check the ip subnet ranges just to make sure there could be no ip overlap.
protocol6v
85 Posts
Quote from protocol6v on April 10, 2018, 3:02 pmI did a ton of testing of this over the weekend, had extremely detailed notes about each time I erased and reinstalled and tested. Did that at least 8 times. Then forgot to save the notes, and when got back... blank. Wonderful.
I'm going to re-test again today and follow up. I do know these things for sure:
When two iSCSI disks exist, it doesn't seem to matter what order you start them, if one has more paths than the other, it causes the cluster to freak out, and fence hosts. This is reproducible every time with the settings I'm using. Will try again without IQN ACLs. Usually just one host shuts down, but I had two shutdown ( a management node and a storage only node) the last time i tried this. That was where I gave up for the weekend.
Will collect the pg info on the next batch of tests and let you know. Will also try different size disks and see if it is the same result.
I did verify networking, all good. Each node can ping eachother on their respective subnets, and there are no duplicate IPs in use.
I did a ton of testing of this over the weekend, had extremely detailed notes about each time I erased and reinstalled and tested. Did that at least 8 times. Then forgot to save the notes, and when got back... blank. Wonderful.
I'm going to re-test again today and follow up. I do know these things for sure:
When two iSCSI disks exist, it doesn't seem to matter what order you start them, if one has more paths than the other, it causes the cluster to freak out, and fence hosts. This is reproducible every time with the settings I'm using. Will try again without IQN ACLs. Usually just one host shuts down, but I had two shutdown ( a management node and a storage only node) the last time i tried this. That was where I gave up for the weekend.
Will collect the pg info on the next batch of tests and let you know. Will also try different size disks and see if it is the same result.
I did verify networking, all good. Each node can ping eachother on their respective subnets, and there are no duplicate IPs in use.
admin
2,930 Posts
Quote from admin on April 10, 2018, 4:16 pmHi,
We do a lot of tests starting hundreds of disks via scripts, so it should work. One thing is not clear..do you still see the Ceph error with only 1/4 of PGs active ? If so then this is the root problem that we need to see what caused it and how to recover from it. If it is still happening it can have side effects like iSCSI disks not starting or maybe fencing and stopping one another.
If you see the iSCSI disk issue on a healthy Ceph cluster, then yes please let me know the steps to reproduce and we will look into it.
Hi,
We do a lot of tests starting hundreds of disks via scripts, so it should work. One thing is not clear..do you still see the Ceph error with only 1/4 of PGs active ? If so then this is the root problem that we need to see what caused it and how to recover from it. If it is still happening it can have side effects like iSCSI disks not starting or maybe fencing and stopping one another.
If you see the iSCSI disk issue on a healthy Ceph cluster, then yes please let me know the steps to reproduce and we will look into it.
protocol6v
85 Posts
Quote from protocol6v on April 10, 2018, 4:45 pmI haven't gotten far enough to reproduce the PGs unknown issue yet, but the iSCSI issue is definitely easily reproducible on a fresh healthy cluster.
Ok, starting with a clean install, all (4) nodes online, 4 OSDs and 1 journal per node. Health is clean and active. Set iSCSI subnets and IPs, configured NTP and SMTP.
Added first 20GB disk with no IQN ACLs this time, and 2 paths auto configured. Disk came up no problem, but stopping takes a very long time. Here's what's in PetaSAN.log (on node1) while waiting for it to stop:
10/04/2018 12:13:22 INFO Disk BD-Ceph-Cl1_BD-E7k-HV-Cl1-Quorum created
10/04/2018 12:13:22 INFO Successfully created key 00001 for new disk.
10/04/2018 12:13:22 INFO Successfully created key /00001/1 for new disk.
10/04/2018 12:13:22 INFO Successfully created key /00001/2 for new disk.
10/04/2018 12:13:42 INFO Could not lock path 00001/2 with session befd9ded-050a-9d39-1df4-b160e77d3bfa.
10/04/2018 12:14:03 INFO Successfully created key 00001 for new disk.
10/04/2018 12:14:05 INFO Stopping disk 00001
10/04/2018 12:14:05 ERROR Could not find ips for image-00001
10/04/2018 12:14:05 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:14:05 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:14:19 INFO Stopping disk 00001
10/04/2018 12:14:40 ERROR Could not find ips for image-00001
10/04/2018 12:14:40 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:14:40 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:14:54 INFO Stopping disk 00001
10/04/2018 12:14:54 ERROR Could not find ips for image-00001
10/04/2018 12:14:54 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:14:54 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:14:58 INFO Stopping disk 00001
10/04/2018 12:15:11 ERROR Could not find ips for image-00001
10/04/2018 12:15:11 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:15:11 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:15:15 INFO Stopping disk 00001
10/04/2018 12:15:25 ERROR Could not find ips for image-00001
10/04/2018 12:15:25 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:15:25 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:15:42 INFO Stopping disk 00001
10/04/2018 12:15:58 ERROR Could not find ips for image-00001
10/04/2018 12:15:58 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:15:58 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:16:12 INFO Stopping disk 00001
10/04/2018 12:16:12 ERROR Could not find ips for image-00001
10/04/2018 12:16:12 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:16:12 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:16:27 INFO Stopping disk 00001at this point, I'm still waiting for the disk to stop. It's been about 40 minutes so far.
Will update when I have hit the PG unknown issue.
I haven't gotten far enough to reproduce the PGs unknown issue yet, but the iSCSI issue is definitely easily reproducible on a fresh healthy cluster.
Ok, starting with a clean install, all (4) nodes online, 4 OSDs and 1 journal per node. Health is clean and active. Set iSCSI subnets and IPs, configured NTP and SMTP.
Added first 20GB disk with no IQN ACLs this time, and 2 paths auto configured. Disk came up no problem, but stopping takes a very long time. Here's what's in PetaSAN.log (on node1) while waiting for it to stop:
10/04/2018 12:13:22 INFO Disk BD-Ceph-Cl1_BD-E7k-HV-Cl1-Quorum created
10/04/2018 12:13:22 INFO Successfully created key 00001 for new disk.
10/04/2018 12:13:22 INFO Successfully created key /00001/1 for new disk.
10/04/2018 12:13:22 INFO Successfully created key /00001/2 for new disk.
10/04/2018 12:13:42 INFO Could not lock path 00001/2 with session befd9ded-050a-9d39-1df4-b160e77d3bfa.
10/04/2018 12:14:03 INFO Successfully created key 00001 for new disk.
10/04/2018 12:14:05 INFO Stopping disk 00001
10/04/2018 12:14:05 ERROR Could not find ips for image-00001
10/04/2018 12:14:05 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:14:05 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:14:19 INFO Stopping disk 00001
10/04/2018 12:14:40 ERROR Could not find ips for image-00001
10/04/2018 12:14:40 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:14:40 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:14:54 INFO Stopping disk 00001
10/04/2018 12:14:54 ERROR Could not find ips for image-00001
10/04/2018 12:14:54 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:14:54 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:14:58 INFO Stopping disk 00001
10/04/2018 12:15:11 ERROR Could not find ips for image-00001
10/04/2018 12:15:11 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:15:11 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:15:15 INFO Stopping disk 00001
10/04/2018 12:15:25 ERROR Could not find ips for image-00001
10/04/2018 12:15:25 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:15:25 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:15:42 INFO Stopping disk 00001
10/04/2018 12:15:58 ERROR Could not find ips for image-00001
10/04/2018 12:15:58 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:15:58 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:16:12 INFO Stopping disk 00001
10/04/2018 12:16:12 ERROR Could not find ips for image-00001
10/04/2018 12:16:12 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:16:12 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:16:27 INFO Stopping disk 00001
at this point, I'm still waiting for the disk to stop. It's been about 40 minutes so far.
Will update when I have hit the PG unknown issue.
admin
2,930 Posts
Quote from admin on April 10, 2018, 10:28 pmFrom the logs the error looks like an incorrect iqn base prefix in the iSCSI settings:
iqn.2018-04.net.example.internal:bd-ceph-cl1
It should be for example:
iqn.2016-05.com.petasan
iqn.2018-04.net.example.internalThe iqn format is iqn.yyyy-mm.naming-authority:unique-name, on the client side you do enter the unique-name (or use the default set by the os, which appends the hostname), this defines the unique client endpoint. On the server/PetaSAN, you define the iqn base prefix for the cluster which is the left part of the :, then PetaSAN will append the disk id ( example 00001 ) to the cluster base iqn to create a unique endpoint for target disk.
We should have caught it in the ui validation, i am sure we use an iqn pattern matching but it seems it accepted it.If you remove the :bd-ceph-cl1, things will work fine. To clean up the current settings, the system is not able to clean them itself since the iqn is not valid. You will need to do it manually, as an example:
# stop disk 00001 resource from Consul
consul kv delete -recurse PetaSAN/Disks/00001
# remove iqn from LIO target
targetcli /iscsi delete iqn.2018-04.net.example.internal:bd-ceph-cl1:00001
# remove local path ip
ip addr delete 10.0.2.100/24 dev eth0
From the logs the error looks like an incorrect iqn base prefix in the iSCSI settings:
iqn.2018-04.net.example.internal:bd-ceph-cl1
It should be for example:
iqn.2016-05.com.petasan
iqn.2018-04.net.example.internal
The iqn format is iqn.yyyy-mm.naming-authority:unique-name, on the client side you do enter the unique-name (or use the default set by the os, which appends the hostname), this defines the unique client endpoint. On the server/PetaSAN, you define the iqn base prefix for the cluster which is the left part of the :, then PetaSAN will append the disk id ( example 00001 ) to the cluster base iqn to create a unique endpoint for target disk.
We should have caught it in the ui validation, i am sure we use an iqn pattern matching but it seems it accepted it.
If you remove the :bd-ceph-cl1, things will work fine. To clean up the current settings, the system is not able to clean them itself since the iqn is not valid. You will need to do it manually, as an example:
# stop disk 00001 resource from Consul
consul kv delete -recurse PetaSAN/Disks/00001
# remove iqn from LIO target
targetcli /iscsi delete iqn.2018-04.net.example.internal:bd-ceph-cl1:00001
# remove local path ip
ip addr delete 10.0.2.100/24 dev eth0
protocol6v
85 Posts
Quote from protocol6v on April 10, 2018, 10:45 pmYou beat me to it! I caught that earlier and have been trying to test with revised iqn but got pulled from my desk will report back
You beat me to it! I caught that earlier and have been trying to test with revised iqn but got pulled from my desk will report back
protocol6v
85 Posts
Quote from protocol6v on April 11, 2018, 11:46 amThe base IQN change solved the iSCSI path and start/stop issues. Will begin simulating failures today to see if I can break anything else.
Thank you for your help!
The base IQN change solved the iSCSI path and start/stop issues. Will begin simulating failures today to see if I can break anything else.
Thank you for your help!