Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Problems with Auto Assignment of iSCSI Network Pathways

I have a 4 node cluster running on SuperMicro hardware.  I have only 4 iSCSI disks configured, each with 8 IP addresses each.  All 4 nodes run iSCSI.  I found that my distribution of addresses across the four nodes wasn't very even, so I elected to use the Auto path assignment option.  Since the distribution was so bad, many of the addresses were marked for re-assignment, and the move process began, but was moving very slowly.  Something occurred such that I had to shut the whole cluster down. I put it in maintenance mode and issued a "ceph osd pause" command.  After bringing it back up and unpausing it, I found that the process hadn't completed and was attempting to continue (at least it looked the same on the page).  But all iSCSI disks were stopped when the cluster came up (not sure if that is supposed to happen) so, I believe, no IP addresses should be assigned to any node.  I attempted to start one iSCSI disks and it hung at Starting.  I waited for the iSCSI Path Assignment process to time out or die or whatever, but after 4 hours, it remained whacked and the SAN wouldn't come up.   I attempted to start all iSCSI disks and a few more IPs showed up on the Path Assignment page, but not many and none moved or failed.  Even though all iSCSI disk were in the same Starting State, the Path Assignment page showed only a few IPs on 2 of the 4 servers, and this remained the same for several more hours, without resolution.

I spent a long time looking for any info on how to kill the reassignment process, but found none.  In the Admin Manual, it states that one has to detach iSCSI disks for a new iSCSI IP space to take effect. And here is how I got out of this mess.  I shut the SAN completely down again and brought it back up, to return the iSCSI drives to the Stopped state.  The reassignment page was as it was before.  I changed the iSCSI IP space and when I started the iSCSI disks, they immediately started w/o any problems.

So the bug is that the Auto Path Assignment process would not stop (at least I don't know how it can be), nor complete.  The process had trouble with only 32 IP addresses, the process didn't terminate when a single node was shutdown, nor when the entire SAN was shut down.  Worse, the only thing I could find to effect this required that I had to reconfigure all my SAN clients to use different IP addresses/portals. etc. just to get the SAN up.

I believe this whole Path Assignment process needs to be examined.  It seems that taking down a single node or finding that an iSCSI disk is no longer started should terminate all reassignments, auto or manual, as these change the entire address location equation and would cause the process to try to move IP addresses for disks that no longer have IP addresses active on the cluster to / from nodes that are no long in the cluster.

We just need a way out that's clean and less labor intensive, however you might do that.

Great product for sure, great rate of improvement and thanks for your consideration.

Jim

Thank you for this feedback. We definitely look for improving things.

Most issues we have with unbalanced path assignments for new disks are primarily due to load on servers, a server with less paths has a better weight to acquire new ones, but if it does not respond after a specific weighted delay, other servers will try to grab the path also using weights, it is a decentralised system so to remain highly available. Corrections is done via the Path Assignment page as you have tried to do, this could be either manual or automated, most issues/problems here would be errors with lower layers, is the cluster health ok ? no Ceph errors ? in less cases there could be errors due to Consul (they should be visible in the logs).

You should be able to delete the ip assignment task via

consul kv delete  PetaSAN/Assignment/IPAddress
example
consul kv delete  PetaSAN/Assignment/10.10.10.10

Bur you are correct, we should handle lower level errors in a better way in this page.

Great product for sure, great rate of improvement and thanks for your consideration.

Thank you , this is nice to hear 🙂

Thanks for the prompt response.

The iSCSI reassignment was running when I had a read error on one of 54 drives in the cluster.  This resulted in a single inconsistent PG in the default rbd pool (3 copies, 2 required) that was not remedied with a "repair" command.

As a side note, I was surprised ceph didn't just move the PG to another area on the OSD since it was only 44% full. When I late did a surface test of the drive in question, I found only 8K on a 3000GB enterprise drive that showed any problems.

So, while the reassignment was going on, I did have 1 ceph issue.  I pulled the OSD while the PetaSAN node was still up and bad things started to happen on the iSCSI side.  My client cluster (Hyper-V Failover Cluster) started to pause VMs.  I couldn't get a handle on what was happening in the SAN, so I put it in maintenance mode, paused it, and shut down and restarted the node from which I pulled the OSD.  I unpaused the SAN and removed it from maintenance in steps and the SAN had no improvement with iSCSI.  VMs were pausing, the Clustered volumes coming and going on different servers at different time - hard to track, exactly - so I put it back in maintenance, paused it again and shut the whole SAN down.  Bringing it back up rendered it completely dead for over 8 hours.

Aside from the removal of one OSD resulting in taking the whole SAN down (bad), my issues is that the reassignment created a form of deadly embrace (much worse). I could not start iSCSI Disks because the system would not stop the address reassignment.  The reassignment could not complete since the IPs no longer existed that it was trying to reassign.  I could not even determine all the addresses the reassignment was trying to reassign even with all iSCSI disks in the "Starting" state.  The page never showed all 32 addresses.

Is there a consul command to enumerate all the reassignments that are pending and moving?  Without this, there would be no way to completely stop the reassignment process.

Given that continuing a reassignment process once a node or iSCSI disk goes down is contrary to the purpose of address reassignment, how hard would it be to programmatically detect this and terminate all moves when a node with iSCSI leaves the cluster or any iSCSI disk is "Stopped" or "Stopping", letting the decentralize process at least keep the SAN and it's iSCSI drives up?

Thanks,

Jim

If the Ceph layer fails and some pools/pgs become inative, the iSCSI layer will also fail. It is common mistake for users to try to fix things at the iSCSI layer, this is probably because most client errors will report an iSCSI issues. Trying to move things around at the iSCSI layers like ip paths will drive it nuts, not only is the data served stored in Ceph but so is the iSCSI configuration info (iqn/ips..) is stored in Ceph as rbd image metadata. Best to focus on getting the pools/pgs active then the iSCSI layer will just work.

Failure of 1 OSD should not render the Ceph layer inactive. The cases that this could happen is if the Backfill/Recovery speed was set too high and your hardware/disks are already stressed out, probably in combination with client load and scrubing action which could be also set too fast, this is particularly true for HDDs. From the dashboard charts, look at the PG Status charts for times when PGs were inactive, do they correlate with iSCSI paths down ? Does the OSD Status chart correlate that 1 down OSD started this ? From the Node Stats, were the disk % utilization saturated when recovery kicked in ? Can you check your Backfill and Scrub speeds from the Maintenance tab and lower them if set too high. These are just questions for you to think about. It is important that you would have an idea of what load your cluster can handle, and what it cannot, ofcourse this would better be done prior to going production..do benchmark tests including while you simulate recovery.

To your other question:
to get list of current path assignments:
consul kv get -recurse -detailed PetaSAN/Assignment
to delete all and stop movements
consul kv delete -recurse PetaSAN/Assignment

If the entire cluster is stopped and restarted, the iSCSI disks will all be in a stopped state. This is because for the iSCSI layer to work, Ceph and Consul and other things to be up. PetaSAN cluster is not designed to be switched on/off on a regular basis, in case you do restart the cluster you would need to manually click on "Start All" button in the iSCSI page, we also have a script
consul kv delete -recurse PetaSAN/Assignment
in case you need to script this.

Thanks for the info above.  You address a good deal of info that I hadn't found anywhere else above.  To keep this possible bug report on track, I'll start a new question in a different forum to address problems I've encountered with regard to SAN stability.

For this possible bug, I continue to believe there is a deadly embrace that is not handled by PetaSAN - that if Auto Assignments are underway, it makes no sense to continue that process if a node goes down that is serving iSCSI, given that the process could easily be attempting to move an assignment to or from the absent node. Given that the purpose of the Auto Assignment processes is to even out the load/address assignments, the goal after a node exits, would have to restart at best, and eliminate any IPs moving from a missing PetaSAN node or to a missing PetaSAN node.

Failure to do so gates the start of the iSCSI targets, preventing the iSCSI disk from starting.

Thanks for your consideration.

 

We have not seen this issue stopping an iSCSI disk or an ip path within disk to not function, it maybe due to the other issue you mention in your new post. i will check if we can replicate this.