Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Nodes shutting down

Pages: 1 2 3 4

Have an issue to where a node will shutdown by itself.  One thing was noticed that the node with the web interface attached was the node shutting down.  A web interface has been attached to the 3rd node to see how that works but its happened to both node 1 and 2.  I have a web browser open all the time on the node to monitor the interface.

 

Petasan log

15/06/2018 08:46:03 WARNING , retrying in 1 seconds...
15/06/2018 08:46:04 WARNING , retrying in 1 seconds...
15/06/2018 08:46:11 WARNING , retrying in 2 seconds...
15/06/2018 08:46:12 ERROR Node Stats exception.
15/06/2018 08:46:12 ERROR Error running echo command :echo 'PetaSAN.NodeStat s.PS-Node-1.cpu_all.percent_util 0.4' `date +%s` | nc -q0 172.16.14.32 2003
Traceback (most recent call last):
File "/opt/petasan/scripts/node_stats.py", line 120, in <module>
get_stats()
File "/opt/petasan/scripts/node_stats.py", line 52, in get_stats
_get_cpu(sar_result)
File "/opt/petasan/scripts/node_stats.py", line 61, in _get_cpu
_send_graphite(path_key, val)
File "/opt/petasan/scripts/node_stats.py", line 48, in _send_graphite
raise Exception("Error running echo command :" + cmd)
Exception: Error running echo command :echo 'PetaSAN.NodeStats.PS-Node-1.cpu_all .percent_util 0.4' `date +%s` | nc -q0 172.16.14.32 2003
15/06/2018 08:46:13 WARNING , retrying in 2 seconds...
15/06/2018 08:46:19 WARNING , retrying in 1 seconds...
15/06/2018 08:46:20 WARNING , retrying in 1 seconds...
15/06/2018 08:46:20 WARNING , retrying in 4 seconds...
15/06/2018 08:46:22 WARNING , retrying in 4 seconds...
15/06/2018 08:46:28 WARNING , retrying in 2 seconds...
15/06/2018 08:46:28 WARNING , retrying in 2 seconds...
15/06/2018 08:46:39 INFO Cleaned disk path 00001/1.
15/06/2018 08:46:39 INFO PetaSAN cleaned local paths not locked by this nod e in consul.
15/06/2018 08:46:39 INFO LIO deleted backstore image image-00001
15/06/2018 08:46:39 INFO LIO deleted Target iqn.2018-05.com.tec:00001:00001
15/06/2018 08:46:39 INFO Image image-00001 unmapped successfully.
15/06/2018 08:46:39 INFO PetaSAN cleaned iqns.
15/06/2018 08:46:39 WARNING , retrying in 1 seconds...
15/06/2018 08:46:40 WARNING , retrying in 2 seconds...
15/06/2018 08:46:42 WARNING , retrying in 4 seconds...

The node shutdowns we had in our cluster were caused by network problems in the backend network.

  • Double check if your nics are correctly connected. We mixed the links of one host so that ISCSI nics where connected to backend network and the other way around.
  • Check if the switches show errors on the network ports and change SFPs if so
  • An bug in one of the early versions was that nics could change names in PetaSAN when rebooting a node. Check if the connections correspond the the correct ips.

Regards,

Dennis

Hi Dennis, appreciate the feedback.  Right now, its in a test configuration (latest version) so no complicated backend involved, just the 3 nodes w/two etherports going to a small (unmanaged) switch.  Good information on this and will watch for this.

Adding to Dennis, the error logs show connection failures for 2 different tasks that do not appear to relate if the node is being connected to via browser or not:

  1. The node fails to save its cpu metric stats to the stats database
  2. The iSCSI service could not connect to Consul cluster, after several attempts it cleared its LIO assignment to image-00001 path 1, effectively stopping thee path since the node is now not part of the cluster, other nodes will get assigned this path but will also kill the initial node since there is there is possibility it is in a bad shape and did not unmap the path correctly.  It is most likely this is what led to the shutdown.

These connections happen on backend 1 network, this is inline with what Dennis stated. Maybe turn off fencing for a while and see if the nodes do not shutdown, but this will not solve the root issue.

Wow, strange issue.  I put the web browser to node 3 and it failed this weekend also.  I can see this doing this for "a" node but when you can pick the node to fail by attaching a web browser to it is something else, can't really see all 3 nodes with the same issue.  I'm using a Dell R410 that has two ethernet ports, and those go to a single unmanged 1G switch so everything is on the same switch.  there is nothing using the cluster for storage so traffic is minimal.  I'll take the web browser off and see how long it can go for now, if it runs, there's something else going on.  Is there something else that I can do?

I will get this tested here, i presume you leave it for a day or 2 opened on the dashboard page, i will let you know if we find anything.

Just to double check, you do not see any high resource load in the historic charts at the time of the failure ? let me know if you find any other clues.  I would also recommend you turn off fencing but keep an eye on the logs for any consul connections  failures. If you can also make a script to check you can ping on backend 1 periodically it may also help.

It has snmp configured going to an NMS, no strange things going on.  I have replaced the ethernet cables just in case.  Appreciate it! Thanks

We tested it for 2 days with v 2.0 without see-ing the problem. What version of PetaSAN do you use  ? What browser: chrome/firefox ? Do you have any resource load at the time of the problem ?

I'm using your v2 (latest) and using firefox.  I've had it running for a week without web browser and did have a failure on one of them.  I'll find another switch and try that.  Appreciate you running it on your test system.

Replaced the switch and Node 2 turned off.  So we have replaced the switch, cables and it still drops nodes, don't know what to do next or what causing the issues.

Pages: 1 2 3 4