Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Can't stop iSCSI disks

Pages: 1 2

Just an update for when you get back, and I'll keep updating this post with developments.

Was able to replicate this issues several times, each on fresh install. I rotated which nodes I used for #1, 2, 3, and 4, so I know this is not hardware related

Here's my workflow:

After a fresh install and setup of the initial 3 nodes, add 4th node. Config NTP and SMTP settings. Config iSCSI ip range 1 (172.31.1.200-254) and 2 (172.31.2.200-254)

create a 20GB iSCSI disk with 2 auto paths and an access control list of comma separated IQNs. waited for it to say start, verified two paths are assigned to two nodes

create an additional 50TB disk with 8 auto paths and the same list of IQNs. Wait for it to start, then checked paths and only two are assigned out of 8. Then try to stop the 50TB disk, and Node1 shuts down. I did not try exclusively doing these actions from a different nodes webUI yet. will try that next.

After bringing node1 back online, the PGs never resync. they go to an "unknown" state. This particular time the cluster thinks 3/4 of the PGs are unknown:

root@bd-ceph-sd2:~# ceph --cluster BD-Ceph-Cl1
ceph> status
cluster:
id: 93b0c771-30d5-4572-b612-7a95a31c4ec2
health: HEALTH_WARN
Reduced data availability: 795 pgs inactive
Degraded data redundancy: 795 pgs unclean

services:
mon: 3 daemons, quorum bd-ceph-sd1,bd-ceph-sd2,bd-ceph-sd3
mgr: bd-ceph-sd1(active)
osd: 16 osds: 16 up, 16 in

data:
pools: 1 pools, 1024 pgs
objects: 0 objects, 0 bytes
usage: 86065 MB used, 11173 GB / 11257 GB avail
pgs: 77.637% pgs unknown
795 unknown
229 active+clean

In petasan log there's some errors about paths not existing, which i believe is why this disk is taking so long to stop...

06/04/2018 14:49:23 ERROR LIO error could not create target for disk 00002.
06/04/2018 14:49:23 ERROR Could not create Target in configFS.
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/lio/api.py", line 63, in add_target
target = Target(fabric, disk_meta.iqn)
File "/usr/lib/python2.7/dist-packages/rtslib/target.py", line 1214, in __init__
self._create_in_cfs_ine(mode)
File "/usr/lib/python2.7/dist-packages/rtslib/node.py", line 77, in _create_in_cfs_ine
% self.__class__.__name__)
RTSLibError: Could not create Target in configFS.
06/04/2018 14:49:23 ERROR Error could not acquire path 00002/5
06/04/2018 14:49:23 INFO Stopping disk 00002
06/04/2018 14:49:23 ERROR Could not find ips for image-00002
06/04/2018 14:49:23 INFO LIO deleted backstore image image-00002
06/04/2018 14:49:23 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00002, maybe the iqn is not exists.
06/04/2018 14:49:23 INFO Image image-00002 unmapped successfully.
06/04/2018 14:49:25 INFO PetaSAN Cleaned rbd backstores.
06/04/2018 14:49:25 INFO Stopping disk 00002
06/04/2018 14:49:25 ERROR Could not find ips for image-00002
06/04/2018 14:49:25 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00002, maybe the iqn is not exists.
06/04/2018 14:49:25 ERROR Cannot unmap image image-00002. error.

The petasan log actually has some interesting info so I'm going to send all three management node logs along to your email.

Will update with anything else i find.

 

Saw your post after this, and have also emailed you, so to avoid splitting this convo, i'm going to let you have your weekend and stop here!

Ceph should really not be failing at all, even if the iSCSI layer or Consul go bad, it is independent and unaware of them. We need to get more detail on why 3/4 of PGs are in unknown state.

use either command and to identify a PG that is in unknown state

ceph health detail --cluster X
ceph pg dump --cluster X

now get PG info and please send it via email

ceph pg PG_NUM query --cluster X

After you added all OSDs and before you added any iSCSI disks, was Ceph status all active+clean ? and if you shut a node down and restart it, does Ceph come back to active+clean quickly ? If you look at the PG Status chart on dashboard, do the PGs go from clean to unkown suddenly when the node gets shutdown ?
When you add iSCSI disks, can you see if Ceph is responsive via
rbd ls --cluster X

Now to consul, can you check if consul is up by running

consul members

Now on iSCSI, from the steps you describe, the problem happens when you add the second 50TB disk..correct? It is not related to stopping disks and there is no client traffic..correct?
Does the problem happen if you do not add ACLs ?
If you create just the 50TB image, does it happen ?

Last thing, can you please double check the ip subnet ranges just to make sure there could be no ip overlap.

I did a ton of testing of this over the weekend, had extremely detailed notes about each time I erased and reinstalled and tested. Did that at least 8 times. Then forgot to save the notes, and when got back... blank. Wonderful.

I'm going to re-test again today and follow up. I do know these things for sure:

When two iSCSI disks exist, it doesn't seem to matter what order you start them, if one has more paths than the other, it causes the cluster to freak out, and fence hosts. This is reproducible every time with the settings I'm using. Will try again without IQN ACLs. Usually just one host shuts down, but I had two shutdown ( a management node and a storage only node) the last time i tried this. That was where I gave up for the weekend.

Will collect the pg info on the next batch of tests and let you know. Will also try different size disks and see if it is the same result.

I did verify networking, all good. Each node can ping eachother on their respective subnets, and there are no duplicate IPs in use.

Hi,

We do a lot of tests starting hundreds of disks via scripts, so it should work. One thing is not clear..do you still see the Ceph error with only 1/4 of PGs active ? If so then this is the root problem that we need to see what caused it and how to recover from it. If it is still happening it can have side effects like iSCSI disks not starting or maybe fencing and stopping one another.

If you see the iSCSI disk issue on a healthy Ceph cluster, then yes please let me know the steps to reproduce and we will look into it.

 

I haven't gotten far enough to reproduce the PGs unknown issue yet, but the iSCSI issue is definitely easily reproducible on a fresh healthy cluster.

Ok, starting with a clean install, all (4) nodes online, 4 OSDs and 1 journal per node. Health is clean and active. Set iSCSI subnets and IPs, configured NTP and SMTP.

Added first 20GB disk with no IQN ACLs this time, and 2 paths auto configured. Disk came up no problem, but stopping takes a very long time. Here's what's in PetaSAN.log (on node1) while waiting for it to stop:

10/04/2018 12:13:22 INFO Disk BD-Ceph-Cl1_BD-E7k-HV-Cl1-Quorum created
10/04/2018 12:13:22 INFO Successfully created key 00001 for new disk.
10/04/2018 12:13:22 INFO Successfully created key /00001/1 for new disk.
10/04/2018 12:13:22 INFO Successfully created key /00001/2 for new disk.
10/04/2018 12:13:42 INFO Could not lock path 00001/2 with session befd9ded-050a-9d39-1df4-b160e77d3bfa.
10/04/2018 12:14:03 INFO Successfully created key 00001 for new disk.
10/04/2018 12:14:05 INFO Stopping disk 00001
10/04/2018 12:14:05 ERROR Could not find ips for image-00001
10/04/2018 12:14:05 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:14:05 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:14:19 INFO Stopping disk 00001
10/04/2018 12:14:40 ERROR Could not find ips for image-00001
10/04/2018 12:14:40 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:14:40 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:14:54 INFO Stopping disk 00001
10/04/2018 12:14:54 ERROR Could not find ips for image-00001
10/04/2018 12:14:54 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:14:54 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:14:58 INFO Stopping disk 00001
10/04/2018 12:15:11 ERROR Could not find ips for image-00001
10/04/2018 12:15:11 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:15:11 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:15:15 INFO Stopping disk 00001
10/04/2018 12:15:25 ERROR Could not find ips for image-00001
10/04/2018 12:15:25 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:15:25 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:15:42 INFO Stopping disk 00001
10/04/2018 12:15:58 ERROR Could not find ips for image-00001
10/04/2018 12:15:58 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:15:58 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:16:12 INFO Stopping disk 00001
10/04/2018 12:16:12 ERROR Could not find ips for image-00001
10/04/2018 12:16:12 ERROR LIO error deleting Target iqn.2018-04.net.example.internal:bd-ceph-cl1:00001, maybe the iqn is not exists.
10/04/2018 12:16:12 ERROR Cannot unmap image image-00001. error.
10/04/2018 12:16:27 INFO Stopping disk 00001

at this point, I'm still waiting for the disk to stop. It's been about 40 minutes so far.

Will update when I have hit the PG unknown issue.

From the logs the error looks like an incorrect iqn base prefix in the iSCSI settings:
iqn.2018-04.net.example.internal:bd-ceph-cl1
It should be for example:
iqn.2016-05.com.petasan
iqn.2018-04.net.example.internal

The iqn format is iqn.yyyy-mm.naming-authority:unique-name, on the client side you do enter the unique-name (or use the default set by the os, which appends the hostname), this defines the unique client endpoint. On the server/PetaSAN, you define the iqn base prefix for the cluster which is the left part of the :, then PetaSAN will append the disk id ( example 00001 ) to the cluster base iqn to create a unique endpoint for target disk.
We should have caught it in the ui validation, i am sure we use an iqn pattern matching but it seems it accepted it.

If you remove the :bd-ceph-cl1, things will work fine. To clean up the current settings, the system is not able to clean them itself since the iqn is not valid. You will need to do it manually, as an example:

# stop disk 00001 resource from Consul
consul kv delete -recurse PetaSAN/Disks/00001
# remove iqn from LIO target
targetcli /iscsi delete iqn.2018-04.net.example.internal:bd-ceph-cl1:00001
# remove local path ip
ip addr delete 10.0.2.100/24 dev eth0

You beat me to it! I caught that earlier and have been trying to test with revised iqn but got pulled from my desk will report back

The base IQN change solved the iSCSI path and start/stop issues. Will begin simulating failures today to see if I can break anything else.

 

Thank you for your help!

Pages: 1 2