Forums - PetaSAN

ForumGeneral DiscussionCluster Fail on 2 hosts install
You need to log in to create posts and topics. Login · Register
Cluster Fail on 2 hosts install

fredohouse
13 Posts

July 3, 2017, 9:46 am
Quote from fredohouse on July 3, 2017, 9:46 am
Hi,

I have a case :

The cluster is operationnal on five vm, with 3 monitors placed on 2 hosts hypervisor.

In case of a crash of host who have 2 monitor, the cluster i down... allright...but..

Is it possible to had a new monitor in order to have four monitor ?

Regards,

Hi,

I have a case :

The cluster is operationnal on five vm, with 3 monitors placed on 2 hosts hypervisor.

In case of a crash of host who have 2 monitor, the cluster i down... allright...but..

Is it possible to had a new monitor in order to have four monitor ?

Regards,

#1

admin
2,959 Posts

July 3, 2017, 11:36 am
Quote from admin on July 3, 2017, 11:36 am
Ceph requires an odd number of monitors to achieve a quorum: with 3 monitor nodes you can tolerate 1 monitors failing but not 2, with 5 monitor nodes you can tolerate 2 monitors failing but not 3. However any such setup cannot work within a 2 hypervisor hosts in case the host with the majority of monitors fails. You cannot start of with 4 monitors as this can lead to split brain where there is no consensus among them.

PetaSAN can recreate a monitor node if 1 out of 3 fails (using the "Replace Management Node" in the first step of deployment wizard) but cannot recreate 2 if you only have 1 left: at that stage there is no cluster running. Of course using manual cli commands you can re-create the cluster from 1 monitor but you cannot do it from PetaSAN ui as the cluster will be down.

What can be done: place 1 of the monitors/management nodes on a third low end machine, without giving it roles for storage or for iSCSI so it will not have a lot to do and will not consume resources, its role will be to arbitrate or provide quorum with the other 2 when needed. In contrast the 2 hypervisors will host PetaSAN VMs that do the actual Ceph storage + iSCSI load.

Ceph requires an odd number of monitors to achieve a quorum: with 3 monitor nodes you can tolerate 1 monitors failing but not 2, with 5 monitor nodes you can tolerate 2 monitors failing but not 3. However any such setup cannot work within a 2 hypervisor hosts in case the host with the majority of monitors fails. You cannot start of with 4 monitors as this can lead to split brain where there is no consensus among them.

PetaSAN can recreate a monitor node if 1 out of 3 fails (using the "Replace Management Node" in the first step of deployment wizard) but cannot recreate 2 if you only have 1 left: at that stage there is no cluster running. Of course using manual cli commands you can re-create the cluster from 1 monitor but you cannot do it from PetaSAN ui as the cluster will be down.

What can be done: place 1 of the monitors/management nodes on a third low end machine, without giving it roles for storage or for iSCSI so it will not have a lot to do and will not consume resources, its role will be to arbitrate or provide quorum with the other 2 when needed. In contrast the 2 hypervisors will host PetaSAN VMs that do the actual Ceph storage + iSCSI load.

Last edited on July 3, 2017, 11:38 am · #2

fredohouse
13 Posts

July 3, 2017, 4:44 pm
Quote from fredohouse on July 3, 2017, 4:44 pm
Hi,

ok, thanks for your response 😉

Hi,

ok, thanks for your response 😉

#3

WHChase
2 Posts

July 6, 2017, 8:58 pm
Quote from WHChase on July 6, 2017, 8:58 pm
Is it possible to have a 4 node cluster? I am looking at 4 physical servers and want to provide maximum redundancy.

Is it possible to have a 4 node cluster? I am looking at 4 physical servers and want to provide maximum redundancy.

#4

admin
2,959 Posts

July 6, 2017, 9:54 pm
Quote from admin on July 6, 2017, 9:54 pm
PetaSAN is a scale out system, it has a minimum of 3 nodes but you can add as many additional nodes as you want. The more you add the more you have storage capacity as well as the faster your iSCSI disks become. Each iSCSI disk uses all physical disks in the cluster in parallel like a giant networked RAID.

So adding nodes scales your system for performance and capacity but not redundancy. Data redundancy is controlled by how many replicas you have for your data, in PetaSAN you can specify 2 or 3 replicas of data. For a replica size of 2, you will lose data if 2 disks in the cluster, each on different nodes, fail at the same time. If you have 2 or more disk failures on the same node, you are still ok. For a replica size of 3 you will lose data if you have 3 disks failures at the same time, each on different machines.

So if you have a small number of disks, replica size of 2 should be acceptable, but if you have a large number of disks the chances of more than one simultaneous failure is higher and replica 3 is recommended. Some flash and NVMe vendors claim tha replica 2 may be enough even with large number of disks since a replacement disk is very quick in recovery that the chances of another failure during this small time window is small. But to sum this: if you have a large number of disks or want the higher redundancy use replica 3. Note that replica size 3 has a write performance penalty of 33% but the read performance is the same.

Lastly i should mentioned cluster uptime: Irrespective on the number of nodes in the cluster, the first 3 nodes ( the management nodes in PetaSAN terms ) are the brain of the cluster ( they are the Ceph monitors, Consul HA servers and PetaSAN controllers ) you can tolerate 1 server failure out of 3, but 2 failures in the management nodes will lead to the entire cluster being down and no io can occur, but will not lead to data loss by itself. Note than we provide a way to quickly build a replacement management node when one goes down.

PetaSAN is a scale out system, it has a minimum of 3 nodes but you can add as many additional nodes as you want. The more you add the more you have storage capacity as well as the faster your iSCSI disks become. Each iSCSI disk uses all physical disks in the cluster in parallel like a giant networked RAID.

So adding nodes scales your system for performance and capacity but not redundancy. Data redundancy is controlled by how many replicas you have for your data, in PetaSAN you can specify 2 or 3 replicas of data. For a replica size of 2, you will lose data if 2 disks in the cluster, each on different nodes, fail at the same time. If you have 2 or more disk failures on the same node, you are still ok. For a replica size of 3 you will lose data if you have 3 disks failures at the same time, each on different machines.

So if you have a small number of disks, replica size of 2 should be acceptable, but if you have a large number of disks the chances of more than one simultaneous failure is higher and replica 3 is recommended. Some flash and NVMe vendors claim tha replica 2 may be enough even with large number of disks since a replacement disk is very quick in recovery that the chances of another failure during this small time window is small. But to sum this: if you have a large number of disks or want the higher redundancy use replica 3. Note that replica size 3 has a write performance penalty of 33% but the read performance is the same.

Lastly i should mentioned cluster uptime: Irrespective on the number of nodes in the cluster, the first 3 nodes ( the management nodes in PetaSAN terms ) are the brain of the cluster ( they are the Ceph monitors, Consul HA servers and PetaSAN controllers ) you can tolerate 1 server failure out of 3, but 2 failures in the management nodes will lead to the entire cluster being down and no io can occur, but will not lead to data loss by itself. Note than we provide a way to quickly build a replacement management node when one goes down.

Last edited on July 6, 2017, 9:58 pm · #5

Post Reply: Cluster Fail on 2 hosts install

Cancel