Random Shutdowns on PetaSAN VMs
J0C
2 Posts
February 19, 2018, 9:37 amQuote from J0C on February 19, 2018, 9:37 amHello,
We have had 3-4 incidents now where our petaSAN VMs are randomly shutting down, I thought this maybe due to fencing however I can't see anything that states it is.
Output of /opt/petasan/log/PetaSAN.log:
01/02/2018 02:22:10 INFO ClusterLeader stop action
01/02/2018 02:22:17 WARNING , retrying in 1 seconds...
01/02/2018 02:22:19 INFO Cleaned disk path 00001/2.
01/02/2018 02:22:19 INFO Cleaned disk path 00003/2.
01/02/2018 02:22:19 INFO PetaSAN cleaned local paths not locked by this node in consul.
01/02/2018 02:22:19 INFO LIO deleted backstore image image-00001
01/02/2018 02:22:20 INFO LIO deleted Target iqn.2016-05.com.petasan:00001
01/02/2018 02:22:20 INFO Image image-00001 unmapped successfully.
01/02/2018 02:22:20 INFO LIO deleted backstore image image-00003
01/02/2018 02:22:20 INFO LIO deleted Target iqn.2016-05.com.petasan:00003
01/02/2018 02:22:20 INFO Image image-00003 unmapped successfully.
01/02/2018 02:22:20 INFO PetaSAN cleaned iqns.
01/02/2018 02:22:20 INFO Could not lock path 00001/2 with session e3454fa9-74e3-8bba-047b-53b69af2baef.
01/02/2018 02:22:25 WARNING , retrying in 1 seconds...
01/02/2018 08:06:15 INFO Start settings IPs
01/02/2018 08:06:15 INFO Successfully set default gateway
01/02/2018 08:06:15 INFO Successfully set node ips
01/02/2018 08:06:16 INFO GlusterFS mount attempt
01/02/2018 08:06:16 INFO str_start_command: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>consul agent -config-dir /opt/petasan/config/etc/consul.d/server -bind 172.31.50.222 -retry-join 172.31.50.220 -retry-join 172.31.50.221
01/02/2018 08:06:19 INFO Starting cluster file sync service
01/02/2018 08:06:19 INFO Starting iSCSI Service
01/02/2018 08:06:19 INFO Starting Cluster Management application
01/02/2018 08:06:19 INFO Starting Node Stats Service
01/02/2018 08:06:20 INFO Service is starting.
01/02/2018 08:06:20 INFO Cleaning unused configurations.
01/02/2018 08:06:20 INFO Cleaning all mapped disks
01/02/2018 08:06:20 INFO Cleaning unused rbd images.
01/02/2018 08:06:20 INFO Cleaning unused ips.
01/02/2018 09:45:22 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:22 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:22 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:22 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:23 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:23 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:23 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:23 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:24 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:24 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:25 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:27 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:27 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:27 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:27 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:28 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:28 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:28 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:28 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:28 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:33 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:34 INFO The path 00002/1 was locked by ESX-20-VSAN.
01/02/2018 09:45:36 INFO This node will stop node ESX-20-VSAN/172.31.51.220.
01/02/2018 09:45:45 INFO ClusterLeader start action
01/02/2018 09:45:57 INFO Image image-00002 mapped successfully.
01/02/2018 09:46:00 INFO Path 00002/1 acquired successfully
01/02/2018 15:00:51 WARNING , retrying in 1 seconds...
18/02/2018 09:06:44 INFO ClusterLeader stop action
18/02/2018 09:06:46 INFO Cleaned disk path 00002/1.
18/02/2018 09:06:46 INFO PetaSAN cleaned local paths not locked by this node in consul.
18/02/2018 09:06:46 INFO LIO deleted backstore image image-00002
18/02/2018 09:06:46 INFO LIO deleted Target iqn.2016-05.com.petasan:00002
18/02/2018 09:06:46 INFO Image image-00002 unmapped successfully.
18/02/2018 09:06:46 INFO PetaSAN cleaned iqns.
18/02/2018 09:06:47 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:47 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:47 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:47 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:47 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:47 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:52 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:53 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:53 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:53 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:53 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:53 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:53 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:53 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:53 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:54 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:54 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:54 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:54 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:54 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:54 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:54 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:55 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:55 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:55 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:55 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:55 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:55 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:55 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:56 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:56 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:56 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:56 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:56 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:56 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 13:30:22 INFO Start settings IPs
18/02/2018 13:30:23 INFO Successfully set default gateway
18/02/2018 13:30:23 INFO Successfully set node ips
18/02/2018 13:30:23 INFO GlusterFS mount attempt
18/02/2018 13:30:23 INFO str_start_command: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>consul agent -config-dir /opt/petasan/config/etc/consul.d/server -bind 172.31.50.222 -retry-join 172.31.50.220 -retry-join 172.31.50.221
18/02/2018 13:30:26 INFO Starting cluster file sync service
18/02/2018 13:30:26 INFO Starting iSCSI Service
18/02/2018 13:30:26 INFO Starting Cluster Management application
18/02/2018 13:30:26 INFO Starting Node Stats Service
18/02/2018 13:30:27 INFO Service is starting.
18/02/2018 13:30:27 INFO Cleaning unused configurations.
18/02/2018 13:30:27 INFO Cleaning all mapped disks
18/02/2018 13:30:27 INFO Cleaning unused rbd images.
18/02/2018 13:30:30 WARNING , retrying in 1 seconds...
18/02/2018 13:30:32 INFO Cleaning unused ips.
Hello,
We have had 3-4 incidents now where our petaSAN VMs are randomly shutting down, I thought this maybe due to fencing however I can't see anything that states it is.
Output of /opt/petasan/log/PetaSAN.log:
01/02/2018 02:22:10 INFO ClusterLeader stop action
01/02/2018 02:22:17 WARNING , retrying in 1 seconds...
01/02/2018 02:22:19 INFO Cleaned disk path 00001/2.
01/02/2018 02:22:19 INFO Cleaned disk path 00003/2.
01/02/2018 02:22:19 INFO PetaSAN cleaned local paths not locked by this node in consul.
01/02/2018 02:22:19 INFO LIO deleted backstore image image-00001
01/02/2018 02:22:20 INFO LIO deleted Target iqn.2016-05.com.petasan:00001
01/02/2018 02:22:20 INFO Image image-00001 unmapped successfully.
01/02/2018 02:22:20 INFO LIO deleted backstore image image-00003
01/02/2018 02:22:20 INFO LIO deleted Target iqn.2016-05.com.petasan:00003
01/02/2018 02:22:20 INFO Image image-00003 unmapped successfully.
01/02/2018 02:22:20 INFO PetaSAN cleaned iqns.
01/02/2018 02:22:20 INFO Could not lock path 00001/2 with session e3454fa9-74e3-8bba-047b-53b69af2baef.
01/02/2018 02:22:25 WARNING , retrying in 1 seconds...
01/02/2018 08:06:15 INFO Start settings IPs
01/02/2018 08:06:15 INFO Successfully set default gateway
01/02/2018 08:06:15 INFO Successfully set node ips
01/02/2018 08:06:16 INFO GlusterFS mount attempt
01/02/2018 08:06:16 INFO str_start_command: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>consul agent -config-dir /opt/petasan/config/etc/consul.d/server -bind 172.31.50.222 -retry-join 172.31.50.220 -retry-join 172.31.50.221
01/02/2018 08:06:19 INFO Starting cluster file sync service
01/02/2018 08:06:19 INFO Starting iSCSI Service
01/02/2018 08:06:19 INFO Starting Cluster Management application
01/02/2018 08:06:19 INFO Starting Node Stats Service
01/02/2018 08:06:20 INFO Service is starting.
01/02/2018 08:06:20 INFO Cleaning unused configurations.
01/02/2018 08:06:20 INFO Cleaning all mapped disks
01/02/2018 08:06:20 INFO Cleaning unused rbd images.
01/02/2018 08:06:20 INFO Cleaning unused ips.
01/02/2018 09:45:22 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:22 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:22 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:22 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:23 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:23 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:23 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:23 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:24 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:24 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:25 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:27 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:27 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:27 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:27 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:28 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:28 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:28 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:28 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:28 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:33 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:34 INFO The path 00002/1 was locked by ESX-20-VSAN.
01/02/2018 09:45:36 INFO This node will stop node ESX-20-VSAN/172.31.51.220.
01/02/2018 09:45:45 INFO ClusterLeader start action
01/02/2018 09:45:57 INFO Image image-00002 mapped successfully.
01/02/2018 09:46:00 INFO Path 00002/1 acquired successfully
01/02/2018 15:00:51 WARNING , retrying in 1 seconds...
18/02/2018 09:06:44 INFO ClusterLeader stop action
18/02/2018 09:06:46 INFO Cleaned disk path 00002/1.
18/02/2018 09:06:46 INFO PetaSAN cleaned local paths not locked by this node in consul.
18/02/2018 09:06:46 INFO LIO deleted backstore image image-00002
18/02/2018 09:06:46 INFO LIO deleted Target iqn.2016-05.com.petasan:00002
18/02/2018 09:06:46 INFO Image image-00002 unmapped successfully.
18/02/2018 09:06:46 INFO PetaSAN cleaned iqns.
18/02/2018 09:06:47 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:47 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:47 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:47 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:47 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:47 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:52 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:53 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:53 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:53 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:53 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:53 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:53 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:53 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:53 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:54 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:54 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:54 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:54 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:54 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:54 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:54 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:55 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:55 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:55 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:55 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:55 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:55 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:55 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:56 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:56 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:56 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:56 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:56 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:56 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 13:30:22 INFO Start settings IPs
18/02/2018 13:30:23 INFO Successfully set default gateway
18/02/2018 13:30:23 INFO Successfully set node ips
18/02/2018 13:30:23 INFO GlusterFS mount attempt
18/02/2018 13:30:23 INFO str_start_command: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>consul agent -config-dir /opt/petasan/config/etc/consul.d/server -bind 172.31.50.222 -retry-join 172.31.50.220 -retry-join 172.31.50.221
18/02/2018 13:30:26 INFO Starting cluster file sync service
18/02/2018 13:30:26 INFO Starting iSCSI Service
18/02/2018 13:30:26 INFO Starting Cluster Management application
18/02/2018 13:30:26 INFO Starting Node Stats Service
18/02/2018 13:30:27 INFO Service is starting.
18/02/2018 13:30:27 INFO Cleaning unused configurations.
18/02/2018 13:30:27 INFO Cleaning all mapped disks
18/02/2018 13:30:27 INFO Cleaning unused rbd images.
18/02/2018 13:30:30 WARNING , retrying in 1 seconds...
18/02/2018 13:30:32 INFO Cleaning unused ips.
admin
2,930 Posts
February 19, 2018, 11:25 amQuote from admin on February 19, 2018, 11:25 amHi,
Running PetaSAN virtualized is not supported, however i will give you some pointers:
The fencing, though may accelerate observing the issue, it is not the main cause, it is probably good to leave it happen. What we found in our testing is you need to give the vms the resources it needs, the errors you see are consistent of nodes not being able to respond to cluster heartbeat in timely manner. Ceph is resource hungry there is no cutting corners, we have a hardware requirements guide that you should try to be close to, i will list the resource constraints in term of what is likely:
RAM: make sure you have 2GB per OSD, the iSCSI Target is speced at another 16 GB, this can be reduced but do not reduce the per OSD ram requirement, If you do not have much ram, limit the number of OSDs.
Cores: it is better to have 1 core per OSD, this can be decreased a little, but if you do not gave cpu resources, decrease the OSD count.
Physical storage: any storage solution requires high performance storage, ideally have disks in pci pass through and let the vm directly control the disks, or use raw device mapping, if you use regular virtual disks, make sure the virtual disk assigned to the OSD is on a physical disk by itself, having virtual disks share the same physical disk is very bad. If you run atop command inside the vm and see your disk and cpu are not busy but your cpu is high in %iowait it means there are no resource load inside the vm but it is waiting on the hypervisor to fulfill an io request, the hypervisor could be either under powered or most likely it is setting io limits on the vm to slow it down (so it does not affect other vms doing disk access), depending on your hypervisor you should be able to give the PetaSAN vms higher share in the disk io.
Lastly, network is generally not a factor, unless the connection is flaky,, just make sure the connection on the backend networks and their switch connections is robust.
Hi,
Running PetaSAN virtualized is not supported, however i will give you some pointers:
The fencing, though may accelerate observing the issue, it is not the main cause, it is probably good to leave it happen. What we found in our testing is you need to give the vms the resources it needs, the errors you see are consistent of nodes not being able to respond to cluster heartbeat in timely manner. Ceph is resource hungry there is no cutting corners, we have a hardware requirements guide that you should try to be close to, i will list the resource constraints in term of what is likely:
RAM: make sure you have 2GB per OSD, the iSCSI Target is speced at another 16 GB, this can be reduced but do not reduce the per OSD ram requirement, If you do not have much ram, limit the number of OSDs.
Cores: it is better to have 1 core per OSD, this can be decreased a little, but if you do not gave cpu resources, decrease the OSD count.
Physical storage: any storage solution requires high performance storage, ideally have disks in pci pass through and let the vm directly control the disks, or use raw device mapping, if you use regular virtual disks, make sure the virtual disk assigned to the OSD is on a physical disk by itself, having virtual disks share the same physical disk is very bad. If you run atop command inside the vm and see your disk and cpu are not busy but your cpu is high in %iowait it means there are no resource load inside the vm but it is waiting on the hypervisor to fulfill an io request, the hypervisor could be either under powered or most likely it is setting io limits on the vm to slow it down (so it does not affect other vms doing disk access), depending on your hypervisor you should be able to give the PetaSAN vms higher share in the disk io.
Lastly, network is generally not a factor, unless the connection is flaky,, just make sure the connection on the backend networks and their switch connections is robust.
Last edited on February 19, 2018, 11:29 am by admin · #2
J0C
2 Posts
February 21, 2018, 10:07 amQuote from J0C on February 21, 2018, 10:07 amWhat are the signs of them fencing one another?
What are the signs of them fencing one another?
admin
2,930 Posts
February 21, 2018, 12:51 pmQuote from admin on February 21, 2018, 12:51 pmThe log file on the killing node will show "This node will stop node" xx you can check:
cat /opt/petasan/log/PetaSAN.log | grep "This node will stop node"
Note you can always disable fencing from the web ui under cluster maintenance tab, but this is probably not a good idea.
Fencing will happen if a node serving iSCSI disks is not responding to heartbeats from the Consul cluster ( heartbeat is 15 sec) and Consul determines is it out of the cluster. Another cluster node needs to handle the iSCSI failover, it will fence/kill the first node just in case it is alive (although the first node should clean its iSCSI disks by itself as well).
The root cause of not responding to heartbeats is either load issues (most likely) or unreliable network.
Also if you see OSDs going up and down randomly (flapping) it means they too cannot respond to heartbeats among themselves.
The log file on the killing node will show "This node will stop node" xx you can check:
cat /opt/petasan/log/PetaSAN.log | grep "This node will stop node"
Note you can always disable fencing from the web ui under cluster maintenance tab, but this is probably not a good idea.
Fencing will happen if a node serving iSCSI disks is not responding to heartbeats from the Consul cluster ( heartbeat is 15 sec) and Consul determines is it out of the cluster. Another cluster node needs to handle the iSCSI failover, it will fence/kill the first node just in case it is alive (although the first node should clean its iSCSI disks by itself as well).
The root cause of not responding to heartbeats is either load issues (most likely) or unreliable network.
Also if you see OSDs going up and down randomly (flapping) it means they too cannot respond to heartbeats among themselves.
Random Shutdowns on PetaSAN VMs
J0C
2 Posts
Quote from J0C on February 19, 2018, 9:37 amHello,
We have had 3-4 incidents now where our petaSAN VMs are randomly shutting down, I thought this maybe due to fencing however I can't see anything that states it is.
Output of /opt/petasan/log/PetaSAN.log:
01/02/2018 02:22:10 INFO ClusterLeader stop action
01/02/2018 02:22:17 WARNING , retrying in 1 seconds...
01/02/2018 02:22:19 INFO Cleaned disk path 00001/2.
01/02/2018 02:22:19 INFO Cleaned disk path 00003/2.
01/02/2018 02:22:19 INFO PetaSAN cleaned local paths not locked by this node in consul.
01/02/2018 02:22:19 INFO LIO deleted backstore image image-00001
01/02/2018 02:22:20 INFO LIO deleted Target iqn.2016-05.com.petasan:00001
01/02/2018 02:22:20 INFO Image image-00001 unmapped successfully.
01/02/2018 02:22:20 INFO LIO deleted backstore image image-00003
01/02/2018 02:22:20 INFO LIO deleted Target iqn.2016-05.com.petasan:00003
01/02/2018 02:22:20 INFO Image image-00003 unmapped successfully.
01/02/2018 02:22:20 INFO PetaSAN cleaned iqns.
01/02/2018 02:22:20 INFO Could not lock path 00001/2 with session e3454fa9-74e3-8bba-047b-53b69af2baef.
01/02/2018 02:22:25 WARNING , retrying in 1 seconds...
01/02/2018 08:06:15 INFO Start settings IPs
01/02/2018 08:06:15 INFO Successfully set default gateway
01/02/2018 08:06:15 INFO Successfully set node ips
01/02/2018 08:06:16 INFO GlusterFS mount attempt
01/02/2018 08:06:16 INFO str_start_command: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>consul agent -config-dir /opt/petasan/config/etc/consul.d/server -bind 172.31.50.222 -retry-join 172.31.50.220 -retry-join 172.31.50.221
01/02/2018 08:06:19 INFO Starting cluster file sync service
01/02/2018 08:06:19 INFO Starting iSCSI Service
01/02/2018 08:06:19 INFO Starting Cluster Management application
01/02/2018 08:06:19 INFO Starting Node Stats Service
01/02/2018 08:06:20 INFO Service is starting.
01/02/2018 08:06:20 INFO Cleaning unused configurations.
01/02/2018 08:06:20 INFO Cleaning all mapped disks
01/02/2018 08:06:20 INFO Cleaning unused rbd images.
01/02/2018 08:06:20 INFO Cleaning unused ips.
01/02/2018 09:45:22 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:22 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:22 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:22 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:23 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:23 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:23 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:23 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:24 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:24 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:25 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:27 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:27 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:27 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:27 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:28 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:28 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:28 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:28 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:28 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:33 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:34 INFO The path 00002/1 was locked by ESX-20-VSAN.
01/02/2018 09:45:36 INFO This node will stop node ESX-20-VSAN/172.31.51.220.
01/02/2018 09:45:45 INFO ClusterLeader start action
01/02/2018 09:45:57 INFO Image image-00002 mapped successfully.
01/02/2018 09:46:00 INFO Path 00002/1 acquired successfully
01/02/2018 15:00:51 WARNING , retrying in 1 seconds...
18/02/2018 09:06:44 INFO ClusterLeader stop action
18/02/2018 09:06:46 INFO Cleaned disk path 00002/1.
18/02/2018 09:06:46 INFO PetaSAN cleaned local paths not locked by this node in consul.
18/02/2018 09:06:46 INFO LIO deleted backstore image image-00002
18/02/2018 09:06:46 INFO LIO deleted Target iqn.2016-05.com.petasan:00002
18/02/2018 09:06:46 INFO Image image-00002 unmapped successfully.
18/02/2018 09:06:46 INFO PetaSAN cleaned iqns.
18/02/2018 09:06:47 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:47 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:47 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:47 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:47 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:47 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:52 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:53 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:53 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:53 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:53 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:53 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:53 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:53 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:53 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:54 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:54 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:54 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:54 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:54 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:54 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:54 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:55 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:55 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:55 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:55 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:55 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:55 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:55 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:56 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:56 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:56 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:56 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:56 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:56 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 13:30:22 INFO Start settings IPs
18/02/2018 13:30:23 INFO Successfully set default gateway
18/02/2018 13:30:23 INFO Successfully set node ips
18/02/2018 13:30:23 INFO GlusterFS mount attempt
18/02/2018 13:30:23 INFO str_start_command: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>consul agent -config-dir /opt/petasan/config/etc/consul.d/server -bind 172.31.50.222 -retry-join 172.31.50.220 -retry-join 172.31.50.221
18/02/2018 13:30:26 INFO Starting cluster file sync service
18/02/2018 13:30:26 INFO Starting iSCSI Service
18/02/2018 13:30:26 INFO Starting Cluster Management application
18/02/2018 13:30:26 INFO Starting Node Stats Service
18/02/2018 13:30:27 INFO Service is starting.
18/02/2018 13:30:27 INFO Cleaning unused configurations.
18/02/2018 13:30:27 INFO Cleaning all mapped disks
18/02/2018 13:30:27 INFO Cleaning unused rbd images.
18/02/2018 13:30:30 WARNING , retrying in 1 seconds...
18/02/2018 13:30:32 INFO Cleaning unused ips.
Hello,
We have had 3-4 incidents now where our petaSAN VMs are randomly shutting down, I thought this maybe due to fencing however I can't see anything that states it is.
Output of /opt/petasan/log/PetaSAN.log:
01/02/2018 02:22:10 INFO ClusterLeader stop action
01/02/2018 02:22:17 WARNING , retrying in 1 seconds...
01/02/2018 02:22:19 INFO Cleaned disk path 00001/2.
01/02/2018 02:22:19 INFO Cleaned disk path 00003/2.
01/02/2018 02:22:19 INFO PetaSAN cleaned local paths not locked by this node in consul.
01/02/2018 02:22:19 INFO LIO deleted backstore image image-00001
01/02/2018 02:22:20 INFO LIO deleted Target iqn.2016-05.com.petasan:00001
01/02/2018 02:22:20 INFO Image image-00001 unmapped successfully.
01/02/2018 02:22:20 INFO LIO deleted backstore image image-00003
01/02/2018 02:22:20 INFO LIO deleted Target iqn.2016-05.com.petasan:00003
01/02/2018 02:22:20 INFO Image image-00003 unmapped successfully.
01/02/2018 02:22:20 INFO PetaSAN cleaned iqns.
01/02/2018 02:22:20 INFO Could not lock path 00001/2 with session e3454fa9-74e3-8bba-047b-53b69af2baef.
01/02/2018 02:22:25 WARNING , retrying in 1 seconds...
01/02/2018 08:06:15 INFO Start settings IPs
01/02/2018 08:06:15 INFO Successfully set default gateway
01/02/2018 08:06:15 INFO Successfully set node ips
01/02/2018 08:06:16 INFO GlusterFS mount attempt
01/02/2018 08:06:16 INFO str_start_command: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>consul agent -config-dir /opt/petasan/config/etc/consul.d/server -bind 172.31.50.222 -retry-join 172.31.50.220 -retry-join 172.31.50.221
01/02/2018 08:06:19 INFO Starting cluster file sync service
01/02/2018 08:06:19 INFO Starting iSCSI Service
01/02/2018 08:06:19 INFO Starting Cluster Management application
01/02/2018 08:06:19 INFO Starting Node Stats Service
01/02/2018 08:06:20 INFO Service is starting.
01/02/2018 08:06:20 INFO Cleaning unused configurations.
01/02/2018 08:06:20 INFO Cleaning all mapped disks
01/02/2018 08:06:20 INFO Cleaning unused rbd images.
01/02/2018 08:06:20 INFO Cleaning unused ips.
01/02/2018 09:45:22 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:22 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:22 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:22 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:23 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:23 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:23 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:23 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:24 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:24 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:25 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:27 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:27 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:27 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:27 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:28 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:28 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:28 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:28 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:28 INFO Could not lock path 00001/2 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:33 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
01/02/2018 09:45:34 INFO The path 00002/1 was locked by ESX-20-VSAN.
01/02/2018 09:45:36 INFO This node will stop node ESX-20-VSAN/172.31.51.220.
01/02/2018 09:45:45 INFO ClusterLeader start action
01/02/2018 09:45:57 INFO Image image-00002 mapped successfully.
01/02/2018 09:46:00 INFO Path 00002/1 acquired successfully
01/02/2018 15:00:51 WARNING , retrying in 1 seconds...
18/02/2018 09:06:44 INFO ClusterLeader stop action
18/02/2018 09:06:46 INFO Cleaned disk path 00002/1.
18/02/2018 09:06:46 INFO PetaSAN cleaned local paths not locked by this node in consul.
18/02/2018 09:06:46 INFO LIO deleted backstore image image-00002
18/02/2018 09:06:46 INFO LIO deleted Target iqn.2016-05.com.petasan:00002
18/02/2018 09:06:46 INFO Image image-00002 unmapped successfully.
18/02/2018 09:06:46 INFO PetaSAN cleaned iqns.
18/02/2018 09:06:47 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:47 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:47 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:47 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:47 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:47 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:52 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:53 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:53 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:53 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:53 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:53 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:53 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:53 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:53 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:54 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:54 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:54 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:54 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:54 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:54 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:54 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:55 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:55 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:55 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:55 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:55 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:55 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:55 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:56 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:56 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:56 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:56 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:56 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 09:06:56 INFO Could not lock path 00002/1 with session d9889439-ac93-0986-9246-4078b39fdeb3.
18/02/2018 13:30:22 INFO Start settings IPs
18/02/2018 13:30:23 INFO Successfully set default gateway
18/02/2018 13:30:23 INFO Successfully set node ips
18/02/2018 13:30:23 INFO GlusterFS mount attempt
18/02/2018 13:30:23 INFO str_start_command: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>consul agent -config-dir /opt/petasan/config/etc/consul.d/server -bind 172.31.50.222 -retry-join 172.31.50.220 -retry-join 172.31.50.221
18/02/2018 13:30:26 INFO Starting cluster file sync service
18/02/2018 13:30:26 INFO Starting iSCSI Service
18/02/2018 13:30:26 INFO Starting Cluster Management application
18/02/2018 13:30:26 INFO Starting Node Stats Service
18/02/2018 13:30:27 INFO Service is starting.
18/02/2018 13:30:27 INFO Cleaning unused configurations.
18/02/2018 13:30:27 INFO Cleaning all mapped disks
18/02/2018 13:30:27 INFO Cleaning unused rbd images.
18/02/2018 13:30:30 WARNING , retrying in 1 seconds...
18/02/2018 13:30:32 INFO Cleaning unused ips.
admin
2,930 Posts
Quote from admin on February 19, 2018, 11:25 amHi,
Running PetaSAN virtualized is not supported, however i will give you some pointers:
The fencing, though may accelerate observing the issue, it is not the main cause, it is probably good to leave it happen. What we found in our testing is you need to give the vms the resources it needs, the errors you see are consistent of nodes not being able to respond to cluster heartbeat in timely manner. Ceph is resource hungry there is no cutting corners, we have a hardware requirements guide that you should try to be close to, i will list the resource constraints in term of what is likely:
RAM: make sure you have 2GB per OSD, the iSCSI Target is speced at another 16 GB, this can be reduced but do not reduce the per OSD ram requirement, If you do not have much ram, limit the number of OSDs.
Cores: it is better to have 1 core per OSD, this can be decreased a little, but if you do not gave cpu resources, decrease the OSD count.
Physical storage: any storage solution requires high performance storage, ideally have disks in pci pass through and let the vm directly control the disks, or use raw device mapping, if you use regular virtual disks, make sure the virtual disk assigned to the OSD is on a physical disk by itself, having virtual disks share the same physical disk is very bad. If you run atop command inside the vm and see your disk and cpu are not busy but your cpu is high in %iowait it means there are no resource load inside the vm but it is waiting on the hypervisor to fulfill an io request, the hypervisor could be either under powered or most likely it is setting io limits on the vm to slow it down (so it does not affect other vms doing disk access), depending on your hypervisor you should be able to give the PetaSAN vms higher share in the disk io.
Lastly, network is generally not a factor, unless the connection is flaky,, just make sure the connection on the backend networks and their switch connections is robust.
Hi,
Running PetaSAN virtualized is not supported, however i will give you some pointers:
The fencing, though may accelerate observing the issue, it is not the main cause, it is probably good to leave it happen. What we found in our testing is you need to give the vms the resources it needs, the errors you see are consistent of nodes not being able to respond to cluster heartbeat in timely manner. Ceph is resource hungry there is no cutting corners, we have a hardware requirements guide that you should try to be close to, i will list the resource constraints in term of what is likely:
RAM: make sure you have 2GB per OSD, the iSCSI Target is speced at another 16 GB, this can be reduced but do not reduce the per OSD ram requirement, If you do not have much ram, limit the number of OSDs.
Cores: it is better to have 1 core per OSD, this can be decreased a little, but if you do not gave cpu resources, decrease the OSD count.
Physical storage: any storage solution requires high performance storage, ideally have disks in pci pass through and let the vm directly control the disks, or use raw device mapping, if you use regular virtual disks, make sure the virtual disk assigned to the OSD is on a physical disk by itself, having virtual disks share the same physical disk is very bad. If you run atop command inside the vm and see your disk and cpu are not busy but your cpu is high in %iowait it means there are no resource load inside the vm but it is waiting on the hypervisor to fulfill an io request, the hypervisor could be either under powered or most likely it is setting io limits on the vm to slow it down (so it does not affect other vms doing disk access), depending on your hypervisor you should be able to give the PetaSAN vms higher share in the disk io.
Lastly, network is generally not a factor, unless the connection is flaky,, just make sure the connection on the backend networks and their switch connections is robust.
J0C
2 Posts
Quote from J0C on February 21, 2018, 10:07 amWhat are the signs of them fencing one another?
What are the signs of them fencing one another?
admin
2,930 Posts
Quote from admin on February 21, 2018, 12:51 pmThe log file on the killing node will show "This node will stop node" xx you can check:
cat /opt/petasan/log/PetaSAN.log | grep "This node will stop node"
Note you can always disable fencing from the web ui under cluster maintenance tab, but this is probably not a good idea.
Fencing will happen if a node serving iSCSI disks is not responding to heartbeats from the Consul cluster ( heartbeat is 15 sec) and Consul determines is it out of the cluster. Another cluster node needs to handle the iSCSI failover, it will fence/kill the first node just in case it is alive (although the first node should clean its iSCSI disks by itself as well).
The root cause of not responding to heartbeats is either load issues (most likely) or unreliable network.
Also if you see OSDs going up and down randomly (flapping) it means they too cannot respond to heartbeats among themselves.
The log file on the killing node will show "This node will stop node" xx you can check:
cat /opt/petasan/log/PetaSAN.log | grep "This node will stop node"
Note you can always disable fencing from the web ui under cluster maintenance tab, but this is probably not a good idea.
Fencing will happen if a node serving iSCSI disks is not responding to heartbeats from the Consul cluster ( heartbeat is 15 sec) and Consul determines is it out of the cluster. Another cluster node needs to handle the iSCSI failover, it will fence/kill the first node just in case it is alive (although the first node should clean its iSCSI disks by itself as well).
The root cause of not responding to heartbeats is either load issues (most likely) or unreliable network.
Also if you see OSDs going up and down randomly (flapping) it means they too cannot respond to heartbeats among themselves.