Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Random Shutdowns on PetaSAN VMs

Hello,

We have had 3-4 incidents now where our petaSAN VMs are randomly shutting down, I thought this maybe due to fencing however I can't see anything that states it is.

Output of /opt/petasan/log/PetaSAN.log:

01/02/2018 02:22:10 INFO     ClusterLeader stop action

01/02/2018 02:22:17 WARNING  , retrying in 1 seconds...

01/02/2018 02:22:19 INFO     Cleaned disk path 00001/2.

01/02/2018 02:22:19 INFO     Cleaned disk path 00003/2.

01/02/2018 02:22:19 INFO     PetaSAN cleaned local paths not locked by this node in consul.

01/02/2018 02:22:19 INFO     LIO deleted backstore image image-00001

01/02/2018 02:22:20 INFO     LIO deleted Target iqn.2016-05.com.petasan:00001

01/02/2018 02:22:20 INFO     Image image-00001 unmapped successfully.

01/02/2018 02:22:20 INFO     LIO deleted backstore image image-00003

01/02/2018 02:22:20 INFO     LIO deleted Target iqn.2016-05.com.petasan:00003

01/02/2018 02:22:20 INFO     Image image-00003 unmapped successfully.

01/02/2018 02:22:20 INFO     PetaSAN cleaned iqns.

01/02/2018 02:22:20 INFO     Could not lock path 00001/2 with session e3454fa9-74e3-8bba-047b-53b69af2baef.

01/02/2018 02:22:25 WARNING  , retrying in 1 seconds...

01/02/2018 08:06:15 INFO     Start settings IPs

01/02/2018 08:06:15 INFO     Successfully set default gateway

01/02/2018 08:06:15 INFO     Successfully set node ips

01/02/2018 08:06:16 INFO     GlusterFS mount attempt

01/02/2018 08:06:16 INFO     str_start_command: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>consul agent -config-dir /opt/petasan/config/etc/consul.d/server -bind 172.31.50.222  -retry-join 172.31.50.220 -retry-join 172.31.50.221

01/02/2018 08:06:19 INFO     Starting cluster file sync service

01/02/2018 08:06:19 INFO     Starting iSCSI Service

01/02/2018 08:06:19 INFO     Starting Cluster Management application

01/02/2018 08:06:19 INFO     Starting Node Stats Service

01/02/2018 08:06:20 INFO     Service is starting.

01/02/2018 08:06:20 INFO     Cleaning unused configurations.

01/02/2018 08:06:20 INFO     Cleaning all mapped disks

01/02/2018 08:06:20 INFO     Cleaning unused rbd images.

01/02/2018 08:06:20 INFO     Cleaning unused ips.

01/02/2018 09:45:22 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

01/02/2018 09:45:22 INFO     Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.

01/02/2018 09:45:22 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

01/02/2018 09:45:22 INFO     Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.

01/02/2018 09:45:23 INFO     Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.

01/02/2018 09:45:23 INFO     Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.

01/02/2018 09:45:23 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

01/02/2018 09:45:23 INFO     Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.

01/02/2018 09:45:24 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

01/02/2018 09:45:24 INFO     Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.

01/02/2018 09:45:25 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

01/02/2018 09:45:27 INFO     Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.

01/02/2018 09:45:27 INFO     Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.

01/02/2018 09:45:27 INFO     Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.

01/02/2018 09:45:27 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

01/02/2018 09:45:28 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

01/02/2018 09:45:28 INFO     Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.

01/02/2018 09:45:28 INFO     Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.

01/02/2018 09:45:28 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

01/02/2018 09:45:28 INFO     Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.

01/02/2018 09:45:33 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

01/02/2018 09:45:34 INFO     The path 00002/1 was locked by ESX-20-VSAN.

01/02/2018 09:45:36 INFO     This node will stop node ESX-20-VSAN/172.31.51.220.

01/02/2018 09:45:45 INFO     ClusterLeader start action

01/02/2018 09:45:57 INFO     Image image-00002 mapped successfully.

01/02/2018 09:46:00 INFO     Path 00002/1 acquired successfully

01/02/2018 15:00:51 WARNING  , retrying in 1 seconds...

18/02/2018 09:06:44 INFO     ClusterLeader stop action

18/02/2018 09:06:46 INFO     Cleaned disk path 00002/1.

18/02/2018 09:06:46 INFO     PetaSAN cleaned local paths not locked by this node in consul.

18/02/2018 09:06:46 INFO     LIO deleted backstore image image-00002

18/02/2018 09:06:46 INFO     LIO deleted Target iqn.2016-05.com.petasan:00002

18/02/2018 09:06:46 INFO     Image image-00002 unmapped successfully.

18/02/2018 09:06:46 INFO     PetaSAN cleaned iqns.

18/02/2018 09:06:47 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 09:06:47 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 09:06:47 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 09:06:47 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 09:06:47 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 09:06:47 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 09:06:52 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 09:06:53 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 09:06:53 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 09:06:53 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 09:06:53 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 09:06:53 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 09:06:53 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 09:06:53 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 09:06:53 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 09:06:54 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 09:06:54 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 09:06:54 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 09:06:54 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 09:06:54 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 09:06:54 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 09:06:54 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 09:06:55 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 09:06:55 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 09:06:55 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 09:06:55 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 09:06:55 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 09:06:55 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 09:06:55 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 09:06:56 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 09:06:56 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 09:06:56 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 09:06:56 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 09:06:56 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 09:06:56 INFO     Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.

18/02/2018 13:30:22 INFO     Start settings IPs

18/02/2018 13:30:23 INFO     Successfully set default gateway

18/02/2018 13:30:23 INFO     Successfully set node ips

18/02/2018 13:30:23 INFO     GlusterFS mount attempt

18/02/2018 13:30:23 INFO     str_start_command: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>consul agent -config-dir /opt/petasan/config/etc/consul.d/server -bind 172.31.50.222  -retry-join 172.31.50.220 -retry-join 172.31.50.221

18/02/2018 13:30:26 INFO     Starting cluster file sync service

18/02/2018 13:30:26 INFO     Starting iSCSI Service

18/02/2018 13:30:26 INFO     Starting Cluster Management application

18/02/2018 13:30:26 INFO     Starting Node Stats Service

18/02/2018 13:30:27 INFO     Service is starting.

18/02/2018 13:30:27 INFO     Cleaning unused configurations.

18/02/2018 13:30:27 INFO     Cleaning all mapped disks

18/02/2018 13:30:27 INFO     Cleaning unused rbd images.

18/02/2018 13:30:30 WARNING  , retrying in 1 seconds...

18/02/2018 13:30:32 INFO     Cleaning unused ips.

 

Hi,

Running PetaSAN virtualized is not supported, however i will give you some pointers:

The fencing, though may accelerate observing the issue, it is not the main cause, it is probably good to leave it happen. What we found in our testing is you need to give the vms the resources it needs,  the errors you see are consistent of nodes not being able to respond to cluster heartbeat in timely manner. Ceph is resource hungry there is no cutting corners, we have a hardware requirements guide that you should try to be close to, i will list the resource constraints in term of what is likely:

RAM: make sure you have 2GB per OSD, the iSCSI Target is speced at another 16 GB, this can be reduced but do not reduce the per OSD ram requirement, If you do not have much ram, limit the number of OSDs.

Cores:  it is better to have 1 core per OSD, this can be decreased a little, but if you do not gave cpu resources, decrease the OSD count.

Physical storage: any storage solution requires high performance storage, ideally have disks in pci pass through and let the vm directly control the disks, or use raw device mapping, if you use regular virtual disks, make sure the virtual disk assigned to the OSD is on a physical disk by itself, having virtual disks share the same physical disk is very bad. If you run atop command inside the vm and see your disk and cpu are not busy but your cpu is high in %iowait it means there are no resource load inside the vm but it is waiting on the hypervisor to fulfill an io request, the hypervisor could be either under powered or most likely it is setting io limits on the vm to slow it down (so it does not affect other vms doing disk access), depending on your hypervisor you should be able to give the PetaSAN vms higher share in the disk io.

Lastly, network is generally not a factor, unless the connection is flaky,, just make sure the connection on the backend networks and their switch connections is robust.

What are the signs of them fencing one another?

 

 

The log file on the killing node will show  "This node will stop node" xx   you can check:

cat /opt/petasan/log/PetaSAN.log  | grep "This node will stop node"

Note you can always disable fencing from the web ui under cluster maintenance tab, but this is probably not a good idea.

Fencing will happen if a node serving iSCSI disks is not responding to heartbeats from the Consul cluster ( heartbeat is 15 sec) and Consul determines is it out of the cluster. Another cluster node needs to handle the iSCSI failover, it will fence/kill the first node just in case it is alive (although the first node should clean its iSCSI disks by itself as well).

The root cause of not responding to heartbeats is either load issues (most likely) or unreliable network.

Also if you see OSDs going up and down randomly (flapping)  it means they too cannot respond to heartbeats among themselves.