Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Simple Cluster

Hi,
we are thinking to build a small PetaSAN ceph Cluster.

For than we have 3 Nodes with 15x 2,5" SATA drives and 5x SSD drives in each node.
The idea is to build 2 pools (SATA and SSD) and connect it with iSCSI to ESXi hosts.

After a lot of research in glogs and other forum threads we got confused about the question: Would our PetaSAN cluster be HA?

The Question is: Can one node completely fail without any notice of the VMs on the ESXi behind, so that they are still running?

Thanks for asking this NOOB question

🙂

Yes of course it supports HA.  Ceph has built in ha and recovery support, PetaSAN built ha at the iSCSI layer via Consul system. The system can even tolerate more than 1 node failures by adjusting the values of size and min_size in your pools.

There will be a small pause in io for 20-25 sec, before the path is failed over. This is default in VMWare failover and it is better left as is, if you want to lower it you can  lower the values via NoopOutInterval and NoopOutTimeout  parameters in the ESX iSCSI settings as well as OSD  config values osd_heartbeat_grace  osd_heartbeat_interval in ceph.conf.  Again it is better to leave these at default values and not make the failover too sensitive which could lead to getting false failover  under heavy load.

To get good performance for your hdd pool, you need to have an ssd journal ( ratio 1:4 ssd:hdd ) + have a controller with write back cache.

 

Thank you for your detailed description.

What I was reading or maybe misunderstanding: There is a "gap" of a view minutes, where failed OSD are marked as DOWN and ceph is going to recover/rebuild as far as possible.
In that "gap" the VMs are PINGing but not responding. (Websites are down, SSH connections freezes,...) The same "gap" would be if the whole OSD-Node failed.

But this is not true?

No this is not true, we test this kind of stuff all the time. The pause will be approx 25 sec as standard for VMWare iSCSI settings, this will be the same as with any other SAN solutions.

Internally it works is like this: there are 2 libraries for Ceph client access, a user space and a kernel space, we use (internally within our iSCSI layer) the the later as it is a bit faster. Both do this: an io request is sent to the Ceph OSD but before the io completes the disk fails, the io request is therefore pending. The Ceph cluster will quickly detect the OSD failure in approx 25 sec which can be lowered via osd_heartbeat_grace & osd_heartbeat_interval settings, this will result in a new version (epoch) of the cluster maps being issued which lists this OSD as down and has a new mapping of PGs to OSDs excluding the one that went down. The Ceph client which was still waiting will receive an event that the cluster maps changed and will update its internal maps and re-map the pending object to a new OSD and re-issue the io operation to it which will respond normally.

You can also see on the dashboard the OSD to be dected down quite quicly (note there will be an extra delay in the ui due its refresh cycle).

The auto recovery is a different story..there will be approx 10 min before the it starts to kick in, this is probably the gap you refer to. Its role is to re-create the downed replicas of all PGs that were served by the failed OSD, it is a parallel process involving many-to-many OSDs, so it is much more efficient than RAID recovery and puts much lesser load on the disks. Again the recovery delay can be adjusted but the default values are good enough to avoid false positives. However it is important to note that during this "recovery" delay + during recovery itself, client io is working.

Lastly PetaSAN is easy to use, you can test all this yourself quick easily.