Nodes shutting down
khopkins
96 Posts
June 15, 2018, 6:32 pmQuote from khopkins on June 15, 2018, 6:32 pmHave an issue to where a node will shutdown by itself. One thing was noticed that the node with the web interface attached was the node shutting down. A web interface has been attached to the 3rd node to see how that works but its happened to both node 1 and 2. I have a web browser open all the time on the node to monitor the interface.
Petasan log
15/06/2018 08:46:03 WARNING , retrying in 1 seconds...
15/06/2018 08:46:04 WARNING , retrying in 1 seconds...
15/06/2018 08:46:11 WARNING , retrying in 2 seconds...
15/06/2018 08:46:12 ERROR Node Stats exception.
15/06/2018 08:46:12 ERROR Error running echo command :echo 'PetaSAN.NodeStat s.PS-Node-1.cpu_all.percent_util 0.4' `date +%s` | nc -q0 172.16.14.32 2003
Traceback (most recent call last):
File "/opt/petasan/scripts/node_stats.py", line 120, in <module>
get_stats()
File "/opt/petasan/scripts/node_stats.py", line 52, in get_stats
_get_cpu(sar_result)
File "/opt/petasan/scripts/node_stats.py", line 61, in _get_cpu
_send_graphite(path_key, val)
File "/opt/petasan/scripts/node_stats.py", line 48, in _send_graphite
raise Exception("Error running echo command :" + cmd)
Exception: Error running echo command :echo 'PetaSAN.NodeStats.PS-Node-1.cpu_all .percent_util 0.4' `date +%s` | nc -q0 172.16.14.32 2003
15/06/2018 08:46:13 WARNING , retrying in 2 seconds...
15/06/2018 08:46:19 WARNING , retrying in 1 seconds...
15/06/2018 08:46:20 WARNING , retrying in 1 seconds...
15/06/2018 08:46:20 WARNING , retrying in 4 seconds...
15/06/2018 08:46:22 WARNING , retrying in 4 seconds...
15/06/2018 08:46:28 WARNING , retrying in 2 seconds...
15/06/2018 08:46:28 WARNING , retrying in 2 seconds...
15/06/2018 08:46:39 INFO Cleaned disk path 00001/1.
15/06/2018 08:46:39 INFO PetaSAN cleaned local paths not locked by this nod e in consul.
15/06/2018 08:46:39 INFO LIO deleted backstore image image-00001
15/06/2018 08:46:39 INFO LIO deleted Target iqn.2018-05.com.tec:00001:00001
15/06/2018 08:46:39 INFO Image image-00001 unmapped successfully.
15/06/2018 08:46:39 INFO PetaSAN cleaned iqns.
15/06/2018 08:46:39 WARNING , retrying in 1 seconds...
15/06/2018 08:46:40 WARNING , retrying in 2 seconds...
15/06/2018 08:46:42 WARNING , retrying in 4 seconds...
Have an issue to where a node will shutdown by itself. One thing was noticed that the node with the web interface attached was the node shutting down. A web interface has been attached to the 3rd node to see how that works but its happened to both node 1 and 2. I have a web browser open all the time on the node to monitor the interface.
Petasan log
15/06/2018 08:46:03 WARNING , retrying in 1 seconds...
15/06/2018 08:46:04 WARNING , retrying in 1 seconds...
15/06/2018 08:46:11 WARNING , retrying in 2 seconds...
15/06/2018 08:46:12 ERROR Node Stats exception.
15/06/2018 08:46:12 ERROR Error running echo command :echo 'PetaSAN.NodeStat s.PS-Node-1.cpu_all.percent_util 0.4' `date +%s` | nc -q0 172.16.14.32 2003
Traceback (most recent call last):
File "/opt/petasan/scripts/node_stats.py", line 120, in <module>
get_stats()
File "/opt/petasan/scripts/node_stats.py", line 52, in get_stats
_get_cpu(sar_result)
File "/opt/petasan/scripts/node_stats.py", line 61, in _get_cpu
_send_graphite(path_key, val)
File "/opt/petasan/scripts/node_stats.py", line 48, in _send_graphite
raise Exception("Error running echo command :" + cmd)
Exception: Error running echo command :echo 'PetaSAN.NodeStats.PS-Node-1.cpu_all .percent_util 0.4' `date +%s` | nc -q0 172.16.14.32 2003
15/06/2018 08:46:13 WARNING , retrying in 2 seconds...
15/06/2018 08:46:19 WARNING , retrying in 1 seconds...
15/06/2018 08:46:20 WARNING , retrying in 1 seconds...
15/06/2018 08:46:20 WARNING , retrying in 4 seconds...
15/06/2018 08:46:22 WARNING , retrying in 4 seconds...
15/06/2018 08:46:28 WARNING , retrying in 2 seconds...
15/06/2018 08:46:28 WARNING , retrying in 2 seconds...
15/06/2018 08:46:39 INFO Cleaned disk path 00001/1.
15/06/2018 08:46:39 INFO PetaSAN cleaned local paths not locked by this nod e in consul.
15/06/2018 08:46:39 INFO LIO deleted backstore image image-00001
15/06/2018 08:46:39 INFO LIO deleted Target iqn.2018-05.com.tec:00001:00001
15/06/2018 08:46:39 INFO Image image-00001 unmapped successfully.
15/06/2018 08:46:39 INFO PetaSAN cleaned iqns.
15/06/2018 08:46:39 WARNING , retrying in 1 seconds...
15/06/2018 08:46:40 WARNING , retrying in 2 seconds...
15/06/2018 08:46:42 WARNING , retrying in 4 seconds...
therm
121 Posts
June 15, 2018, 8:08 pmQuote from therm on June 15, 2018, 8:08 pmThe node shutdowns we had in our cluster were caused by network problems in the backend network.
- Double check if your nics are correctly connected. We mixed the links of one host so that ISCSI nics where connected to backend network and the other way around.
- Check if the switches show errors on the network ports and change SFPs if so
- An bug in one of the early versions was that nics could change names in PetaSAN when rebooting a node. Check if the connections correspond the the correct ips.
Regards,
Dennis
The node shutdowns we had in our cluster were caused by network problems in the backend network.
- Double check if your nics are correctly connected. We mixed the links of one host so that ISCSI nics where connected to backend network and the other way around.
- Check if the switches show errors on the network ports and change SFPs if so
- An bug in one of the early versions was that nics could change names in PetaSAN when rebooting a node. Check if the connections correspond the the correct ips.
Regards,
Dennis
khopkins
96 Posts
June 15, 2018, 9:22 pmQuote from khopkins on June 15, 2018, 9:22 pmHi Dennis, appreciate the feedback. Right now, its in a test configuration (latest version) so no complicated backend involved, just the 3 nodes w/two etherports going to a small (unmanaged) switch. Good information on this and will watch for this.
Hi Dennis, appreciate the feedback. Right now, its in a test configuration (latest version) so no complicated backend involved, just the 3 nodes w/two etherports going to a small (unmanaged) switch. Good information on this and will watch for this.
admin
2,930 Posts
June 15, 2018, 9:51 pmQuote from admin on June 15, 2018, 9:51 pmAdding to Dennis, the error logs show connection failures for 2 different tasks that do not appear to relate if the node is being connected to via browser or not:
- The node fails to save its cpu metric stats to the stats database
- The iSCSI service could not connect to Consul cluster, after several attempts it cleared its LIO assignment to image-00001 path 1, effectively stopping thee path since the node is now not part of the cluster, other nodes will get assigned this path but will also kill the initial node since there is there is possibility it is in a bad shape and did not unmap the path correctly. It is most likely this is what led to the shutdown.
These connections happen on backend 1 network, this is inline with what Dennis stated. Maybe turn off fencing for a while and see if the nodes do not shutdown, but this will not solve the root issue.
Adding to Dennis, the error logs show connection failures for 2 different tasks that do not appear to relate if the node is being connected to via browser or not:
- The node fails to save its cpu metric stats to the stats database
- The iSCSI service could not connect to Consul cluster, after several attempts it cleared its LIO assignment to image-00001 path 1, effectively stopping thee path since the node is now not part of the cluster, other nodes will get assigned this path but will also kill the initial node since there is there is possibility it is in a bad shape and did not unmap the path correctly. It is most likely this is what led to the shutdown.
These connections happen on backend 1 network, this is inline with what Dennis stated. Maybe turn off fencing for a while and see if the nodes do not shutdown, but this will not solve the root issue.
Last edited on June 15, 2018, 9:52 pm by admin · #4
khopkins
96 Posts
June 18, 2018, 2:15 pmQuote from khopkins on June 18, 2018, 2:15 pmWow, strange issue. I put the web browser to node 3 and it failed this weekend also. I can see this doing this for "a" node but when you can pick the node to fail by attaching a web browser to it is something else, can't really see all 3 nodes with the same issue. I'm using a Dell R410 that has two ethernet ports, and those go to a single unmanged 1G switch so everything is on the same switch. there is nothing using the cluster for storage so traffic is minimal. I'll take the web browser off and see how long it can go for now, if it runs, there's something else going on. Is there something else that I can do?
Wow, strange issue. I put the web browser to node 3 and it failed this weekend also. I can see this doing this for "a" node but when you can pick the node to fail by attaching a web browser to it is something else, can't really see all 3 nodes with the same issue. I'm using a Dell R410 that has two ethernet ports, and those go to a single unmanged 1G switch so everything is on the same switch. there is nothing using the cluster for storage so traffic is minimal. I'll take the web browser off and see how long it can go for now, if it runs, there's something else going on. Is there something else that I can do?
admin
2,930 Posts
June 18, 2018, 8:57 pmQuote from admin on June 18, 2018, 8:57 pmI will get this tested here, i presume you leave it for a day or 2 opened on the dashboard page, i will let you know if we find anything.
Just to double check, you do not see any high resource load in the historic charts at the time of the failure ? let me know if you find any other clues. I would also recommend you turn off fencing but keep an eye on the logs for any consul connections failures. If you can also make a script to check you can ping on backend 1 periodically it may also help.
I will get this tested here, i presume you leave it for a day or 2 opened on the dashboard page, i will let you know if we find anything.
Just to double check, you do not see any high resource load in the historic charts at the time of the failure ? let me know if you find any other clues. I would also recommend you turn off fencing but keep an eye on the logs for any consul connections failures. If you can also make a script to check you can ping on backend 1 periodically it may also help.
Last edited on June 18, 2018, 9:01 pm by admin · #6
khopkins
96 Posts
June 18, 2018, 10:02 pmQuote from khopkins on June 18, 2018, 10:02 pmIt has snmp configured going to an NMS, no strange things going on. I have replaced the ethernet cables just in case. Appreciate it! Thanks
It has snmp configured going to an NMS, no strange things going on. I have replaced the ethernet cables just in case. Appreciate it! Thanks
Last edited on June 19, 2018, 12:46 pm by khopkins · #7
admin
2,930 Posts
June 21, 2018, 1:20 pmQuote from admin on June 21, 2018, 1:20 pmWe tested it for 2 days with v 2.0 without see-ing the problem. What version of PetaSAN do you use ? What browser: chrome/firefox ? Do you have any resource load at the time of the problem ?
We tested it for 2 days with v 2.0 without see-ing the problem. What version of PetaSAN do you use ? What browser: chrome/firefox ? Do you have any resource load at the time of the problem ?
khopkins
96 Posts
June 25, 2018, 12:41 pmQuote from khopkins on June 25, 2018, 12:41 pmI'm using your v2 (latest) and using firefox. I've had it running for a week without web browser and did have a failure on one of them. I'll find another switch and try that. Appreciate you running it on your test system.
I'm using your v2 (latest) and using firefox. I've had it running for a week without web browser and did have a failure on one of them. I'll find another switch and try that. Appreciate you running it on your test system.
khopkins
96 Posts
June 27, 2018, 12:54 pmQuote from khopkins on June 27, 2018, 12:54 pmReplaced the switch and Node 2 turned off. So we have replaced the switch, cables and it still drops nodes, don't know what to do next or what causing the issues.
Replaced the switch and Node 2 turned off. So we have replaced the switch, cables and it still drops nodes, don't know what to do next or what causing the issues.
Last edited on June 27, 2018, 2:56 pm by khopkins · #10
Nodes shutting down
khopkins
96 Posts
Quote from khopkins on June 15, 2018, 6:32 pmHave an issue to where a node will shutdown by itself. One thing was noticed that the node with the web interface attached was the node shutting down. A web interface has been attached to the 3rd node to see how that works but its happened to both node 1 and 2. I have a web browser open all the time on the node to monitor the interface.
Petasan log
15/06/2018 08:46:03 WARNING , retrying in 1 seconds...
15/06/2018 08:46:04 WARNING , retrying in 1 seconds...
15/06/2018 08:46:11 WARNING , retrying in 2 seconds...
15/06/2018 08:46:12 ERROR Node Stats exception.
15/06/2018 08:46:12 ERROR Error running echo command :echo 'PetaSAN.NodeStat s.PS-Node-1.cpu_all.percent_util 0.4' `date +%s` | nc -q0 172.16.14.32 2003
Traceback (most recent call last):
File "/opt/petasan/scripts/node_stats.py", line 120, in <module>
get_stats()
File "/opt/petasan/scripts/node_stats.py", line 52, in get_stats
_get_cpu(sar_result)
File "/opt/petasan/scripts/node_stats.py", line 61, in _get_cpu
_send_graphite(path_key, val)
File "/opt/petasan/scripts/node_stats.py", line 48, in _send_graphite
raise Exception("Error running echo command :" + cmd)
Exception: Error running echo command :echo 'PetaSAN.NodeStats.PS-Node-1.cpu_all .percent_util 0.4' `date +%s` | nc -q0 172.16.14.32 2003
15/06/2018 08:46:13 WARNING , retrying in 2 seconds...
15/06/2018 08:46:19 WARNING , retrying in 1 seconds...
15/06/2018 08:46:20 WARNING , retrying in 1 seconds...
15/06/2018 08:46:20 WARNING , retrying in 4 seconds...
15/06/2018 08:46:22 WARNING , retrying in 4 seconds...
15/06/2018 08:46:28 WARNING , retrying in 2 seconds...
15/06/2018 08:46:28 WARNING , retrying in 2 seconds...
15/06/2018 08:46:39 INFO Cleaned disk path 00001/1.
15/06/2018 08:46:39 INFO PetaSAN cleaned local paths not locked by this nod e in consul.
15/06/2018 08:46:39 INFO LIO deleted backstore image image-00001
15/06/2018 08:46:39 INFO LIO deleted Target iqn.2018-05.com.tec:00001:00001
15/06/2018 08:46:39 INFO Image image-00001 unmapped successfully.
15/06/2018 08:46:39 INFO PetaSAN cleaned iqns.
15/06/2018 08:46:39 WARNING , retrying in 1 seconds...
15/06/2018 08:46:40 WARNING , retrying in 2 seconds...
15/06/2018 08:46:42 WARNING , retrying in 4 seconds...
Have an issue to where a node will shutdown by itself. One thing was noticed that the node with the web interface attached was the node shutting down. A web interface has been attached to the 3rd node to see how that works but its happened to both node 1 and 2. I have a web browser open all the time on the node to monitor the interface.
Petasan log
15/06/2018 08:46:03 WARNING , retrying in 1 seconds...
15/06/2018 08:46:04 WARNING , retrying in 1 seconds...
15/06/2018 08:46:11 WARNING , retrying in 2 seconds...
15/06/2018 08:46:12 ERROR Node Stats exception.
15/06/2018 08:46:12 ERROR Error running echo command :echo 'PetaSAN.NodeStat s.PS-Node-1.cpu_all.percent_util 0.4' `date +%s` | nc -q0 172.16.14.32 2003
Traceback (most recent call last):
File "/opt/petasan/scripts/node_stats.py", line 120, in <module>
get_stats()
File "/opt/petasan/scripts/node_stats.py", line 52, in get_stats
_get_cpu(sar_result)
File "/opt/petasan/scripts/node_stats.py", line 61, in _get_cpu
_send_graphite(path_key, val)
File "/opt/petasan/scripts/node_stats.py", line 48, in _send_graphite
raise Exception("Error running echo command :" + cmd)
Exception: Error running echo command :echo 'PetaSAN.NodeStats.PS-Node-1.cpu_all .percent_util 0.4' `date +%s` | nc -q0 172.16.14.32 2003
15/06/2018 08:46:13 WARNING , retrying in 2 seconds...
15/06/2018 08:46:19 WARNING , retrying in 1 seconds...
15/06/2018 08:46:20 WARNING , retrying in 1 seconds...
15/06/2018 08:46:20 WARNING , retrying in 4 seconds...
15/06/2018 08:46:22 WARNING , retrying in 4 seconds...
15/06/2018 08:46:28 WARNING , retrying in 2 seconds...
15/06/2018 08:46:28 WARNING , retrying in 2 seconds...
15/06/2018 08:46:39 INFO Cleaned disk path 00001/1.
15/06/2018 08:46:39 INFO PetaSAN cleaned local paths not locked by this nod e in consul.
15/06/2018 08:46:39 INFO LIO deleted backstore image image-00001
15/06/2018 08:46:39 INFO LIO deleted Target iqn.2018-05.com.tec:00001:00001
15/06/2018 08:46:39 INFO Image image-00001 unmapped successfully.
15/06/2018 08:46:39 INFO PetaSAN cleaned iqns.
15/06/2018 08:46:39 WARNING , retrying in 1 seconds...
15/06/2018 08:46:40 WARNING , retrying in 2 seconds...
15/06/2018 08:46:42 WARNING , retrying in 4 seconds...
therm
121 Posts
Quote from therm on June 15, 2018, 8:08 pmThe node shutdowns we had in our cluster were caused by network problems in the backend network.
- Double check if your nics are correctly connected. We mixed the links of one host so that ISCSI nics where connected to backend network and the other way around.
- Check if the switches show errors on the network ports and change SFPs if so
- An bug in one of the early versions was that nics could change names in PetaSAN when rebooting a node. Check if the connections correspond the the correct ips.
Regards,
Dennis
The node shutdowns we had in our cluster were caused by network problems in the backend network.
- Double check if your nics are correctly connected. We mixed the links of one host so that ISCSI nics where connected to backend network and the other way around.
- Check if the switches show errors on the network ports and change SFPs if so
- An bug in one of the early versions was that nics could change names in PetaSAN when rebooting a node. Check if the connections correspond the the correct ips.
Regards,
Dennis
khopkins
96 Posts
Quote from khopkins on June 15, 2018, 9:22 pmHi Dennis, appreciate the feedback. Right now, its in a test configuration (latest version) so no complicated backend involved, just the 3 nodes w/two etherports going to a small (unmanaged) switch. Good information on this and will watch for this.
Hi Dennis, appreciate the feedback. Right now, its in a test configuration (latest version) so no complicated backend involved, just the 3 nodes w/two etherports going to a small (unmanaged) switch. Good information on this and will watch for this.
admin
2,930 Posts
Quote from admin on June 15, 2018, 9:51 pmAdding to Dennis, the error logs show connection failures for 2 different tasks that do not appear to relate if the node is being connected to via browser or not:
- The node fails to save its cpu metric stats to the stats database
- The iSCSI service could not connect to Consul cluster, after several attempts it cleared its LIO assignment to image-00001 path 1, effectively stopping thee path since the node is now not part of the cluster, other nodes will get assigned this path but will also kill the initial node since there is there is possibility it is in a bad shape and did not unmap the path correctly. It is most likely this is what led to the shutdown.
These connections happen on backend 1 network, this is inline with what Dennis stated. Maybe turn off fencing for a while and see if the nodes do not shutdown, but this will not solve the root issue.
Adding to Dennis, the error logs show connection failures for 2 different tasks that do not appear to relate if the node is being connected to via browser or not:
- The node fails to save its cpu metric stats to the stats database
- The iSCSI service could not connect to Consul cluster, after several attempts it cleared its LIO assignment to image-00001 path 1, effectively stopping thee path since the node is now not part of the cluster, other nodes will get assigned this path but will also kill the initial node since there is there is possibility it is in a bad shape and did not unmap the path correctly. It is most likely this is what led to the shutdown.
These connections happen on backend 1 network, this is inline with what Dennis stated. Maybe turn off fencing for a while and see if the nodes do not shutdown, but this will not solve the root issue.
khopkins
96 Posts
Quote from khopkins on June 18, 2018, 2:15 pmWow, strange issue. I put the web browser to node 3 and it failed this weekend also. I can see this doing this for "a" node but when you can pick the node to fail by attaching a web browser to it is something else, can't really see all 3 nodes with the same issue. I'm using a Dell R410 that has two ethernet ports, and those go to a single unmanged 1G switch so everything is on the same switch. there is nothing using the cluster for storage so traffic is minimal. I'll take the web browser off and see how long it can go for now, if it runs, there's something else going on. Is there something else that I can do?
Wow, strange issue. I put the web browser to node 3 and it failed this weekend also. I can see this doing this for "a" node but when you can pick the node to fail by attaching a web browser to it is something else, can't really see all 3 nodes with the same issue. I'm using a Dell R410 that has two ethernet ports, and those go to a single unmanged 1G switch so everything is on the same switch. there is nothing using the cluster for storage so traffic is minimal. I'll take the web browser off and see how long it can go for now, if it runs, there's something else going on. Is there something else that I can do?
admin
2,930 Posts
Quote from admin on June 18, 2018, 8:57 pmI will get this tested here, i presume you leave it for a day or 2 opened on the dashboard page, i will let you know if we find anything.
Just to double check, you do not see any high resource load in the historic charts at the time of the failure ? let me know if you find any other clues. I would also recommend you turn off fencing but keep an eye on the logs for any consul connections failures. If you can also make a script to check you can ping on backend 1 periodically it may also help.
I will get this tested here, i presume you leave it for a day or 2 opened on the dashboard page, i will let you know if we find anything.
Just to double check, you do not see any high resource load in the historic charts at the time of the failure ? let me know if you find any other clues. I would also recommend you turn off fencing but keep an eye on the logs for any consul connections failures. If you can also make a script to check you can ping on backend 1 periodically it may also help.
khopkins
96 Posts
Quote from khopkins on June 18, 2018, 10:02 pmIt has snmp configured going to an NMS, no strange things going on. I have replaced the ethernet cables just in case. Appreciate it! Thanks
It has snmp configured going to an NMS, no strange things going on. I have replaced the ethernet cables just in case. Appreciate it! Thanks
admin
2,930 Posts
Quote from admin on June 21, 2018, 1:20 pmWe tested it for 2 days with v 2.0 without see-ing the problem. What version of PetaSAN do you use ? What browser: chrome/firefox ? Do you have any resource load at the time of the problem ?
We tested it for 2 days with v 2.0 without see-ing the problem. What version of PetaSAN do you use ? What browser: chrome/firefox ? Do you have any resource load at the time of the problem ?
khopkins
96 Posts
Quote from khopkins on June 25, 2018, 12:41 pmI'm using your v2 (latest) and using firefox. I've had it running for a week without web browser and did have a failure on one of them. I'll find another switch and try that. Appreciate you running it on your test system.
I'm using your v2 (latest) and using firefox. I've had it running for a week without web browser and did have a failure on one of them. I'll find another switch and try that. Appreciate you running it on your test system.
khopkins
96 Posts
Quote from khopkins on June 27, 2018, 12:54 pmReplaced the switch and Node 2 turned off. So we have replaced the switch, cables and it still drops nodes, don't know what to do next or what causing the issues.
Replaced the switch and Node 2 turned off. So we have replaced the switch, cables and it still drops nodes, don't know what to do next or what causing the issues.