Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Proper use of maintenance mode?

Hello there,

New to Ceph, and PetaSAN, but an old school storage, network, and virtualization expert. Have been playing with PetaSAN in a VMware cluster lab. Pretty neat! So far mostly getting what I would expect, based on testing.

However one thing that is not clear, and which I ran into an issue with, is attempting to enter "maintenance mode" so I can intentionally down one of the nodes. It appears that maintenance mode is not well documented? It's simply a menu, with a number of manual options to toggle on / off. But I cannot find any documented guidance, as to how these should be used? Please forgive me, perhaps this is all obvious to those who are more familiar with Ceph. But what I was expecting, was something akin to putting an ESXi host into maintenance mode, in a vCenter cluster. Such that the cluster no longer expects anything of that host. It's absence, for things such as updates, causes no issues.

In my case, entering maintenance mode did NOT go well. With out any guidance, I set all options in the maintenance mode option to off. I then shutdown node 2. Now in my lab, only nodes 2 and 3 have OSD, and iSCSI service. After I shutdown node 2, I ended up with an iSCSI all paths down issue on my ESXi hosts! The iSCSI path to node 3, which was still running, went down! When I went into the iSCSI paths menu in PetaSAN, it showed no mapped paths at all.

After about 5 minutes of things sitting in this state, I decided to "nudge" things along by going back into the maintenance mode menu and turning everything back on. In hopes PetaSAN would recover, and at least the iSCSI path to node 3 would restore. Several more minutes, and not seeing any change, I decided to bring node 2 back on-line. Because at this point my Windows 10 test VM was sitting with out disk for about 10 minutes.

Now, behavior on the last observation is not clear. Node 2 was not fully booted yet. But suddenly iSCSI paths came up, and I found PetaSAN had mapped both paths to node 3. What is not clear: was it just coincidence that node 2 was booting, and nodes 1 and 3 had finally hit some recovery threshold that would have happened regardless of booting node 2? Or did some services on node 2 start connecting to the cluster and cause things to start working again?

The whole experience leaves me with more questions than answers!

  • Where can I find better documentation on putting a PetaSAN cluster into maintenance mode, to intentionally down a node?
  • What is the expected recovery time when a node is powered off, for iSCSI paths, and other services?

caveats:

  • Only using size=2 and min_size=1 for my RBD pool, with 2 storage nodes and 1 node that is just for management and to keep quorum.
  • I understand this is not a recommended production setup. But for a lab test, this should be valid.
  • Using default "replicated_rule", this should result in 2 copies of every block, 1 on each of the 2 storage nodes

In comparing this to a traditional SAN, my expectation was PetaSAN would just keep running, and access to iSCSI block storage would not at all be lost. Or that there would perhaps be a short convergence time during failure, of a few seconds, to maybe 1 minute at most. That the iSCSI path to node 2 would go dead, but node 3 would 100% service requests and everything would keep running. That is not what happened. So that makes me nervous. For the moment, I will assume it is my ignorance on a new technology, and not a design flaw. But if you could please point me in the right direction here? Much appreciated!

Thank you!

-Greg Curry-

I presume you refer to the maintenance on/off flags in the maintenance menu, in such case i agree they are not well documented, apart from popup balloon text which relies on the Ceph documentation which is also not that helpful. I agree that in this area, PetaSAN just presented the Ceph flags without trying to filter things as we typically do.

The irony is that if you had just left the flags to their operational values, took the node offline for maintenance, things would have worked fine, the only issue is that the system would have started to replicate the data somewhere else since it does not know the node will be coming back, and when it comes back the replication will stop and we would just have wasted some bandwidth and maybe loaded the cluster during this time. But the safest was to leave things as is.

The actual flag that could be useful is noout, so you tell the system that if an OSD is down, do not take it out of crush map to start data re balance. The really worst one to use (and really should be removed ), is nodown, which tells the system not to flag the OSD as down, so clients will keep trying to access it ! Probably in your case the ESXi will timeout their iSCSI i/o operations waiting for a down OSD to come up, so the flag makes no sense at all, maybe to a developer but not to a user, Ceph has many such parameters. I will review this internally and see if we can streamline this page as we always try to do in other parts of the system.

It is not recommended to use size 2, min_size 1. It has little to do with having only 2 replicas but the culprit is having min size of 1, meaning it is ok to accept io if you have 1 replica. This can cause many issues aside from redundancy, and more with consistency. Imagine a critical cluster state where OSDs are coming up and down or maybe during a power loss case, it is possible that clients write data to several OSDs but each have different copy of the data, or have a different sequence of data changes. This makes the self healing process confused and will require pain staking manual debugging and could also lead to data integrity issues.

The default/recommended is size 3, min 2: so if you have 1 node failing, everything keeps working since you have 2 replicas of your data and self healing will re-create the missing replicas in the background. If you have 2 nodes going down, you have 1 active replica and are now below the min size, all your io will suspend ( iSCSI as well as any Ceph io ), the self healing background process will start replicating to create the lost replicas, as soon as a PG has 2 replicas it will receive io and client operations like iSCSI will resume..how long this takes depends on how much data you have, the speed of recovery, so it depends. Note that you can speed the background recovery from another page in the maintenance ui, in this page we did try to make things user friendly by grouping required Ceph flags into different speed steps 🙂

Thank for the fast response!

So in my case it sounds like I would have only wanted to toggle the noout flag. Because I didn't want it to start making new replicas, local replicas to re-balance. (it would have to be local because there would only be one surviving OSD node).

I get that best practice is to use size=3 and min_size=2. Which is sound advice, if you have the hardware and your needs justify it (which most serious production environments would). In my case this is testing for a basic home lab setup. So if I run a risk that I am down to only 1 available copy of a block, and then something else "bad happens", so be it. Worst case there would be backups of the VMs, and I restore them.

In this case, since only 2 nodes actually have OSD and can service I/O for iSCSI, I don't want I/O to stop if only 1 of the nodes is available. In this example, I was looking for node 3 to provide 100% of available blocks and take all I/O, which node 2 is down.

Since I am still learning the finer details of Ceph, can you confirm something for me? With the use of size=2, and min_size=1, along with using "replicated_rule", should I expect all of the following to be true?

  • Under normal conditions, with nodes 2 and 3 both on-line, this should result in copies of each block being stored 1 time on each node. That is, 1 block write to node 2 and 1 replica write to node 3 (or visa versa). Giving only 1 replica level of redundancy. (similar to RAID-1 or RAID-10)
  • When using "noout" for maintenance, the remaining single OSD node would service 100% of available blocks, and service all I/O. However it would only store 1 copy of any newly written blocks. So if that single OSD node were to also fail, all blocks written during that window could be lost.
  • When bringing the 2nd OSD node back on-line, disable "noout", the Ceph cluster will automatically start replicating changed blocks. Such that eventually size=2 is restored, and there is 1 replica level of redundancy again.

Because for a 2 OSD node + 1 management only node (OSD-less), lab setup, that is what I am looking for. If 1 of the OSD nodes goes down, intentional or otherwise, I don't want the other OSD node to start using up more OSD capacity for local replicas. I would rather wait until the downed node can be restored, and then allow replicas. Does this mean I might want to leave "noout" enabled at all times? I realize you would not normally recommend doing so, but given the small size and the fact that this is just a lab setup. Also that I accept the limitations, and know what I am getting myself into, as to risk of data loss.

This is all just a starting point for me, so I am trying to keep it basic at first. Only 2 nodes have OSD, and each of those has 2 SSDs being provided via RDM (I cannot pass thru controller because I have HDD array used for existing datastore). Eventually I want to start playing with use of EC. I might also replace the 1 management only node, with another OSD node that has a large set of spinning rust OSDs (HDDs), and get some tiered storage going. But I want to put PetaSAN through the paces with the most basic of setups first, and then work my way up.

I should also point out my reasoning, as to wanting to shutdown 1 out of 2 OSD nodes. As I hinted at, these are running as VMs. I only have 2 ESXi hosts in my lab. So if I need to take one down for maintenance, I require that only 1 OSD node be able to service the remaining ESXi host. Right now the 1 management only node is also running as a VM on that 2 host cluster. Which is, yes, dangerous! But long term my plan is to run that OSD-less node on my ZFS server, FreeBSD 13.1 with Bhyve. So at least the 3 nodes are on separate physical hosts.

I will say that, if this goes well, I would then be looking to explore commercial use for clients. At which point I will need to explore your commercial support offering, as that would be a requirement for my company to carry this in our line up. But for the moment, this is a personal project, and I am just playing with this at home in my lab.  🙂

Thanks again for the response.

-Greg Curry-