Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Problem replacing first node after disk failure

In my test environment, I had a disk failure on my first node. I've replaced the drive, and installed PetaSAN to it, using the same IP address that this machine had originally.

When I try to join it to the cluster, it says "Alert: Error joining node to cluster" when I give the IP address of either remaining monitor nodes. Is there something I should be doing? This node was one of the three monitor nodes.

I understand you had a failure on your system disk. Since this is a management node, you need to choose "Replace Management Node" in the first step of the deployment wizard. You also need to give it the same hostname and ip address when installing via the installer.

Nodes beyond 3 can be deleted and added at will, nodes can be deleted from the node list and added to the cluster by choosing "Join existing Cluster".  But the first 3 nodes if they fail need to be "replaced" as soon as possible, they are the management nodes that contain the brain of the cluster (Ceph monitors,Consul Server, PetaSAN Management ), the cluster can tolerate 1 management node failure but not 2. Replacement nodes need to have the same hostname/ip since the Ceph monitors cannot have those changed.

Note some users prefer to install system disk on RAID1

Great support as always. Somehow I missed the 'replace management node' option.  When I tried it, it worked flawlessly.