Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Cluster Fail on 2 hosts install

Hi,

I have a case :

The cluster is operationnal on five vm, with 3 monitors placed on 2  hosts hypervisor.

In case of a crash of host who have 2 monitor, the cluster i down...  allright...but..

Is it possible to had a new monitor in order to have four monitor ?

Regards,

Ceph requires an odd number of monitors to achieve a quorum: with 3 monitor nodes you can tolerate 1 monitors failing but not 2, with 5 monitor nodes you can tolerate 2 monitors  failing but not 3.  However any such setup cannot work within a 2 hypervisor hosts in case the host with the majority of monitors fails. You cannot start of with 4 monitors as this can lead to split brain where there is no consensus among them.

PetaSAN can recreate a monitor node if 1 out of 3 fails (using the "Replace Management Node"  in the first step of deployment wizard) but cannot recreate 2 if you only have 1 left:  at that stage there  is no cluster running. Of course using manual cli commands you can re-create the cluster from 1 monitor but you cannot do it from PetaSAN ui as the cluster will be down.

What can be done: place 1 of the monitors/management nodes on a third low end machine, without giving it roles for storage or for iSCSI so it will not have a lot to do and will not consume resources, its role will be to arbitrate or provide quorum with the other 2 when needed. In contrast the 2 hypervisors will host PetaSAN VMs that do the actual Ceph storage + iSCSI load.

Hi,

ok, thanks for your response 😉

Is it possible to have a 4 node cluster? I am looking at 4 physical servers and want to provide maximum redundancy.

 

PetaSAN is a scale out system, it has a minimum of 3 nodes but you can add as many additional nodes as you want. The more you add the more you have storage capacity as well as the faster your iSCSI disks become. Each iSCSI disk uses all physical disks in the cluster in parallel like a giant networked RAID.

So adding nodes scales your system for performance and capacity but not redundancy. Data redundancy is controlled by how many replicas you have for your data, in PetaSAN you can specify 2 or 3 replicas of data. For a replica size of 2, you will lose data if 2 disks in the cluster, each on different nodes, fail at the same time. If you have 2 or more disk failures on the same node, you are still ok. For a replica size of 3 you  will lose data if you have 3 disks failures at the same time, each on different machines.

So if you have a small number of disks, replica size of 2 should be acceptable, but if you have a large number of disks the chances of more than one simultaneous failure is higher and replica 3 is recommended. Some flash and NVMe vendors claim tha replica 2 may be enough even with large number of disks since a replacement disk is very quick in recovery that the chances of another failure during this small time window is small. But to sum this: if you have a large number of disks or want the higher redundancy use replica 3. Note that replica size 3 has a write performance penalty of 33% but the read performance is the same.

Lastly i should mentioned cluster uptime: Irrespective on the number of nodes in the cluster, the first 3 nodes ( the management nodes in PetaSAN terms ) are the brain of the cluster ( they are the Ceph monitors, Consul HA servers and PetaSAN controllers ) you can tolerate 1 server failure out of 3, but 2 failures in the management nodes will lead to the entire cluster being down and no io can occur, but will not lead to data loss by itself.  Note than we provide a way to quickly build a replacement management node when one goes down.