No node stats after upgrading 3.0.1 to 3.0.2
BeHau
12 Posts
March 23, 2022, 8:03 amQuote from BeHau on March 23, 2022, 8:03 amNode stats are gone after upgrading this morning.
Cpu and memory is still shown but neither disk io nor network is shown correct. I restared node-statistics as mentioned in http://www.petasan.org/forums/?view=thread&id=1022 but still no stats.
See node-log and pictures attached.
PetaSAN.NodeStats.TG2.ifaces.throughput.eth2_received 0.0 `date +%s`" | nc -v -q0 172.31.52.103 2003
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth2 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_transmitted 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_received 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth1 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_transmitted 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_received 133.12 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth0 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.memory.percent_util -3.17 `date +%s`" "
Exception: Error running echo command :echo "PetaSAN.NodeStats.TG2.cpu_all.percent_util 3.1 `date +%s`" "
raise Exception("Error running echo command :" + cmd)
File "/usr/lib/python3/dist-packages/PetaSAN/core/common/graphite_sender.py", line 59, in send
graphite_sender.send(leader_ip)
File "/opt/petasan/scripts/node_stats.py", line 66, in get_stats
get_stats()
File "/opt/petasan/scripts/node_stats.py", line 168, in <module>
Traceback (most recent call last):
PetaSAN.NodeStats.TG-2.ifaces.throughput.eth2_received 0.0 `date +%s`" | nc -v -q0 172.31.52.103 2003
PetaSAN.NodeStats.TG-2.ifaces.percent_util.eth2 0.0 `date +%s`" "
https://cloud.technigroup.de/index.php/s/jG0TUPIWzM5WqUl
https://cloud.technigroup.de/index.php/s/AIcnOasGcisyWf9
Thank you for your help.
Node stats are gone after upgrading this morning.
Cpu and memory is still shown but neither disk io nor network is shown correct. I restared node-statistics as mentioned in http://www.petasan.org/forums/?view=thread&id=1022 but still no stats.
See node-log and pictures attached.
PetaSAN.NodeStats.TG2.ifaces.throughput.eth2_received 0.0 `date +%s`" | nc -v -q0 172.31.52.103 2003
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth2 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_transmitted 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_received 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth1 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_transmitted 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_received 133.12 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth0 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.memory.percent_util -3.17 `date +%s`" "
Exception: Error running echo command :echo "PetaSAN.NodeStats.TG2.cpu_all.percent_util 3.1 `date +%s`" "
raise Exception("Error running echo command :" + cmd)
File "/usr/lib/python3/dist-packages/PetaSAN/core/common/graphite_sender.py", line 59, in send
graphite_sender.send(leader_ip)
File "/opt/petasan/scripts/node_stats.py", line 66, in get_stats
get_stats()
File "/opt/petasan/scripts/node_stats.py", line 168, in <module>
Traceback (most recent call last):
PetaSAN.NodeStats.TG-2.ifaces.throughput.eth2_received 0.0 `date +%s`" | nc -v -q0 172.31.52.103 2003
PetaSAN.NodeStats.TG-2.ifaces.percent_util.eth2 0.0 `date +%s`" "
https://cloud.technigroup.de/index.php/s/jG0TUPIWzM5WqUl
https://cloud.technigroup.de/index.php/s/AIcnOasGcisyWf9
Thank you for your help.
admin
2,918 Posts
March 23, 2022, 10:14 amQuote from admin on March 23, 2022, 10:14 amCan you double check that really some parameters like cpu% are working but other like disk% are not..for the same node ? or is it some nodes work with all parameters and some others do not for all their parameters ?
# get the stats server ip from
/opt/petasan/scripts/util/get_cluster_leader.py
# this will show current stats server
/opt/petasan/scripts/util/get_cluster_leader.py
based in the logs this should be 172.31.52.103
# on that server, show status of graphite srevice
systemctl status carbon-cache
# from TG2 send fake 3% eth % util to this server (adjust ip at end if needed)
echo "PetaSAN.NodeStats.TG2.ifaces.percent_util.eth0 3.0 `date +%s`" | nc -v -q0 172.31.52.103
Another thing to check, can you make sure that names of interfaced (eth1/2) do match the correct physical interface, you can check that from node list page or from blue node console.
Is the system under high load at times ? does it meet our hardware guide ?
Can you double check that really some parameters like cpu% are working but other like disk% are not..for the same node ? or is it some nodes work with all parameters and some others do not for all their parameters ?
# get the stats server ip from
/opt/petasan/scripts/util/get_cluster_leader.py
# this will show current stats server
/opt/petasan/scripts/util/get_cluster_leader.py
based in the logs this should be 172.31.52.103
# on that server, show status of graphite srevice
systemctl status carbon-cache
# from TG2 send fake 3% eth % util to this server (adjust ip at end if needed)
echo "PetaSAN.NodeStats.TG2.ifaces.percent_util.eth0 3.0 `date +%s`" | nc -v -q0 172.31.52.103
Another thing to check, can you make sure that names of interfaced (eth1/2) do match the correct physical interface, you can check that from node list page or from blue node console.
Is the system under high load at times ? does it meet our hardware guide ?
Last edited on March 23, 2022, 10:16 am by admin · #2
BeHau
12 Posts
March 23, 2022, 12:12 pmQuote from BeHau on March 23, 2022, 12:12 pm># get the stats server ip from
>/opt/petasan/scripts/util/get_cluster_leader.py
># this will show current stats server
>/opt/petasan/scripts/util/get_cluster_leader.py
Yes, this is 172.31.52.103.
root@TG3:~# systemctl status carbon-cache
● carbon-cache.service - Graphite Carbon Cache
Loaded: loaded (/lib/systemd/system/carbon-cache.service; disabled; vendor preset: enabled)
Active: active (running) since Wed 2022-03-23 06:42:57 CET; 5h 51min ago
Docs: https://graphite.readthedocs.io
Process: 3574775 ExecStart=/usr/bin/carbon-cache --config=/etc/carbon/carbon.conf --pidfile=/var/run/carbon-cache.pid --logdir=/var
/log/carbon/ start (code=exited, status=0/SUCCESS)
Main PID: 3574779 (carbon-cache)
Tasks: 3 (limit: 114318)
Memory: 38.1M
CGroup: /system.slice/carbon-cache.service
└─3574779 /usr/bin/python3 /usr/bin/carbon-cache --config=/etc/carbon/carbon.conf --pidfile=/var/run/carbon-cache.pid --lo
gdir=/var/log/carbon/ start
Mar 23 06:42:57 TG3 systemd[1]: Starting Graphite Carbon Cache...
Mar 23 06:42:57 TG3 systemd[1]: Started Graphite Carbon Cache.
The result of "echo "PetaSAN.NodeStats.TG2.ifaces.percent_util.eth0 3.0 `date +%s`" | nc -v -q0 172.31.52.103" is: "nc: missing port number"
EDIT: found to use port 2003 . Result:
Connection to 172.31.52.103 2003 port [tcp/cfinger] succeeded!
># get the stats server ip from
>/opt/petasan/scripts/util/get_cluster_leader.py
># this will show current stats server
>/opt/petasan/scripts/util/get_cluster_leader.py
Yes, this is 172.31.52.103.
root@TG3:~# systemctl status carbon-cache
● carbon-cache.service - Graphite Carbon Cache
Loaded: loaded (/lib/systemd/system/carbon-cache.service; disabled; vendor preset: enabled)
Active: active (running) since Wed 2022-03-23 06:42:57 CET; 5h 51min ago
Docs: https://graphite.readthedocs.io
Process: 3574775 ExecStart=/usr/bin/carbon-cache --config=/etc/carbon/carbon.conf --pidfile=/var/run/carbon-cache.pid --logdir=/var
/log/carbon/ start (code=exited, status=0/SUCCESS)
Main PID: 3574779 (carbon-cache)
Tasks: 3 (limit: 114318)
Memory: 38.1M
CGroup: /system.slice/carbon-cache.service
└─3574779 /usr/bin/python3 /usr/bin/carbon-cache --config=/etc/carbon/carbon.conf --pidfile=/var/run/carbon-cache.pid --lo
gdir=/var/log/carbon/ start
Mar 23 06:42:57 TG3 systemd[1]: Starting Graphite Carbon Cache...
Mar 23 06:42:57 TG3 systemd[1]: Started Graphite Carbon Cache.
The result of "echo "PetaSAN.NodeStats.TG2.ifaces.percent_util.eth0 3.0 `date +%s`" | nc -v -q0 172.31.52.103" is: "nc: missing port number"
EDIT: found to use port 2003 . Result:
Connection to 172.31.52.103 2003 port [tcp/cfinger] succeeded!
Last edited on March 23, 2022, 12:34 pm by BeHau · #3
admin
2,918 Posts
March 23, 2022, 12:47 pmQuote from admin on March 23, 2022, 12:47 pm1) Do you see the fake entry we manually did for eth0 % util 3% for node TG2 on the dashboard charts ?
2) On TG2 what is the status of:
systemctl status petasan-node-stats
3) On TG2 PetaSAN.log do you still see errors trying to send stats values (same as we did manually), are the errors recent and repeating or are they old ?
4) On TG1 or TG3
systemctl restart petasan-node-stats
does that make this node values start to show after a few minutes ? ( wait 5 min), also do it for 1 node only so we can debug more if needed.
5) is the cluster healthy ? any clock skew warnings ? is the browser and servers in same time zone or at least have correct times ?
1) Do you see the fake entry we manually did for eth0 % util 3% for node TG2 on the dashboard charts ?
2) On TG2 what is the status of:
systemctl status petasan-node-stats
3) On TG2 PetaSAN.log do you still see errors trying to send stats values (same as we did manually), are the errors recent and repeating or are they old ?
4) On TG1 or TG3
systemctl restart petasan-node-stats
does that make this node values start to show after a few minutes ? ( wait 5 min), also do it for 1 node only so we can debug more if needed.
5) is the cluster healthy ? any clock skew warnings ? is the browser and servers in same time zone or at least have correct times ?
Last edited on March 23, 2022, 12:48 pm by admin · #4
BeHau
12 Posts
March 23, 2022, 3:09 pmQuote from BeHau on March 23, 2022, 3:09 pm1) Do you see the fake entry we manually did for eth0 % util 3% for node TG2 on the dashboard charts ?
Yes.
https://cloud.technigroup.de/index.php/s/yfAAj4kfcsZtJVW
2) On TG2 what is the status of:
systemctl status petasan-node-stats
root@TG2:~# systemctl status petasan-node-stats
● petasan-node-stats.service - PetaSAN Node Stats Service
Loaded: loaded (/lib/systemd/system/petasan-node-stats.service; static; vendor preset: enabled)
Active: active (running) since Wed 2022-03-23 06:45:31 CET; 7h ago
Main PID: 2266 (node_stats.py)
Tasks: 1 (limit: 114318)
Memory: 33.2M
CGroup: /system.slice/petasan-node-stats.service
└─2266 /usr/bin/python3 /opt/petasan/scripts/node_stats.py
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2773, 'temperature_celsius': 55}, 'nvme1n1': {'power_on_
hours': 2649, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2773, 'temperature_celsius': 55}, 'nvme1n1': {'power_on_
hours': 2649, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2773, 'temperature_celsius': 55}, 'nvme1n1': {'power_on_
hours': 2649, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2773, 'temperature_celsius': 56}, 'nvme1n1': {'power_on_
hours': 2649, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2774, 'temperature_celsius': 55}, 'nvme1n1': {'power_on_
hours': 2649, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2774, 'temperature_celsius': 56}, 'nvme1n1': {'power_on_
hours': 2650, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2774, 'temperature_celsius': 56}, 'nvme1n1': {'power_on_
hours': 2650, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2774, 'temperature_celsius': 55}, 'nvme1n1': {'power_on_
hours': 2650, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2774, 'temperature_celsius': 56}, 'nvme1n1': {'power_on_
hours': 2650, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2774, 'temperature_celsius': 56}, 'nvme1n1': {'power_on_
hours': 2650, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
3) On TG2 PetaSAN.log do you still see errors trying to send stats values (same as we did manually), are the errors recent and repeating or are they old ?
There are still errors.
root@TG2:/opt/petasan/log# tail -f PetaSAN.log
Exception: Error running echo command :echo "PetaSAN.NodeStats.TG2.cpu_all.percent_util 1.89 `date +%s`" "
PetaSAN.NodeStats.TG2.memory.percent_util -45.59 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth0 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_received 174.08 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_transmitted 30.72 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth1 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_received 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_transmitted 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth2 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth2_received 0.0 `date +%s`" | nc -v -q0 172.31.52.103 2003
23/03/2022 14:48:51 ERROR Node Stats exception.
23/03/2022 14:48:51 ERROR Error running echo command :echo "PetaSAN.NodeStats.TG2.cpu_all.percent_util 0.51 `date +%s`" "
PetaSAN.NodeStats.TG2.memory.percent_util -45.48 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth0 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_received 92.16 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_transmitted 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth1 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_received 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_transmitted 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth2 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth2_received 0.0 `date +%s`" | nc -v -q0 172.31.52.103 2003
Traceback (most recent call last):
File "/opt/petasan/scripts/node_stats.py", line 168, in <module>
get_stats()
File "/opt/petasan/scripts/node_stats.py", line 66, in get_stats
graphite_sender.send(leader_ip)
File "/usr/lib/python3/dist-packages/PetaSAN/core/common/graphite_sender.py", line 59, in send
raise Exception("Error running echo command :" + cmd)
Exception: Error running echo command :echo "PetaSAN.NodeStats.TG2.cpu_all.percent_util 0.51 `date +%s`" "
PetaSAN.NodeStats.TG2.memory.percent_util -45.48 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth0 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_received 92.16 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_transmitted 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth1 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_received 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_transmitted 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth2 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth2_received 0.0 `date +%s`" | nc -v -q0 172.31.52.103 2003
^C
4) On TG1 or TG3
systemctl restart petasan-node-stats
does that make this node values start to show after a few minutes ? ( wait 5 min), also do it for 1 node only so we can debug more if needed.
No, it did not.
5) is the cluster healthy ? any clock skew warnings ? is the browser and servers in same time zone or at least have correct times ?
root@TG1:/opt/petasan/log# ceph health
HEALTH_OK
no warnings, no clock skew. Well, it´s been working until upgrading from 3.0.1 to 3.0.2.
1) Do you see the fake entry we manually did for eth0 % util 3% for node TG2 on the dashboard charts ?
Yes.
https://cloud.technigroup.de/index.php/s/yfAAj4kfcsZtJVW
2) On TG2 what is the status of:
systemctl status petasan-node-stats
root@TG2:~# systemctl status petasan-node-stats
● petasan-node-stats.service - PetaSAN Node Stats Service
Loaded: loaded (/lib/systemd/system/petasan-node-stats.service; static; vendor preset: enabled)
Active: active (running) since Wed 2022-03-23 06:45:31 CET; 7h ago
Main PID: 2266 (node_stats.py)
Tasks: 1 (limit: 114318)
Memory: 33.2M
CGroup: /system.slice/petasan-node-stats.service
└─2266 /usr/bin/python3 /opt/petasan/scripts/node_stats.py
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2773, 'temperature_celsius': 55}, 'nvme1n1': {'power_on_
hours': 2649, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2773, 'temperature_celsius': 55}, 'nvme1n1': {'power_on_
hours': 2649, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2773, 'temperature_celsius': 55}, 'nvme1n1': {'power_on_
hours': 2649, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2773, 'temperature_celsius': 56}, 'nvme1n1': {'power_on_
hours': 2649, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2774, 'temperature_celsius': 55}, 'nvme1n1': {'power_on_
hours': 2649, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2774, 'temperature_celsius': 56}, 'nvme1n1': {'power_on_
hours': 2650, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2774, 'temperature_celsius': 56}, 'nvme1n1': {'power_on_
hours': 2650, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2774, 'temperature_celsius': 55}, 'nvme1n1': {'power_on_
hours': 2650, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2774, 'temperature_celsius': 56}, 'nvme1n1': {'power_on_
hours': 2650, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2774, 'temperature_celsius': 56}, 'nvme1n1': {'power_on_
hours': 2650, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
3) On TG2 PetaSAN.log do you still see errors trying to send stats values (same as we did manually), are the errors recent and repeating or are they old ?
There are still errors.
root@TG2:/opt/petasan/log# tail -f PetaSAN.log
Exception: Error running echo command :echo "PetaSAN.NodeStats.TG2.cpu_all.percent_util 1.89 `date +%s`" "
PetaSAN.NodeStats.TG2.memory.percent_util -45.59 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth0 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_received 174.08 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_transmitted 30.72 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth1 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_received 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_transmitted 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth2 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth2_received 0.0 `date +%s`" | nc -v -q0 172.31.52.103 2003
23/03/2022 14:48:51 ERROR Node Stats exception.
23/03/2022 14:48:51 ERROR Error running echo command :echo "PetaSAN.NodeStats.TG2.cpu_all.percent_util 0.51 `date +%s`" "
PetaSAN.NodeStats.TG2.memory.percent_util -45.48 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth0 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_received 92.16 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_transmitted 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth1 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_received 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_transmitted 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth2 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth2_received 0.0 `date +%s`" | nc -v -q0 172.31.52.103 2003
Traceback (most recent call last):
File "/opt/petasan/scripts/node_stats.py", line 168, in <module>
get_stats()
File "/opt/petasan/scripts/node_stats.py", line 66, in get_stats
graphite_sender.send(leader_ip)
File "/usr/lib/python3/dist-packages/PetaSAN/core/common/graphite_sender.py", line 59, in send
raise Exception("Error running echo command :" + cmd)
Exception: Error running echo command :echo "PetaSAN.NodeStats.TG2.cpu_all.percent_util 0.51 `date +%s`" "
PetaSAN.NodeStats.TG2.memory.percent_util -45.48 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth0 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_received 92.16 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_transmitted 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth1 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_received 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_transmitted 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth2 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth2_received 0.0 `date +%s`" | nc -v -q0 172.31.52.103 2003
^C
4) On TG1 or TG3
systemctl restart petasan-node-stats
does that make this node values start to show after a few minutes ? ( wait 5 min), also do it for 1 node only so we can debug more if needed.
No, it did not.
5) is the cluster healthy ? any clock skew warnings ? is the browser and servers in same time zone or at least have correct times ?
root@TG1:/opt/petasan/log# ceph health
HEALTH_OK
no warnings, no clock skew. Well, it´s been working until upgrading from 3.0.1 to 3.0.2.
admin
2,918 Posts
March 23, 2022, 5:26 pmQuote from admin on March 23, 2022, 5:26 pmi can confirm that with 3.0.2 some charts are working others are not. we are working on a fix for this
i can confirm that with 3.0.2 some charts are working others are not. we are working on a fix for this
admin
2,918 Posts
March 23, 2022, 11:46 pmQuote from admin on March 23, 2022, 11:46 pmThank you for your help. We identified an issue with netcat command which was changed in Ubuntu 20.04 to use FreeBSD version which behaves differently than prev Linux version used in 18.04
Can you please test following patch:
https://drive.google.com/file/d/1_lAyaZOUh3rz9aYg91Mrmi0a_1ZgsiWr/view?usp=sharing
apply with
patch -p1 -d / < netcat-openbsd-ignore-stderr.patch
systemctl restart petasan-node-stats
Please do let us know if this fixes the issue.
Thank you for your help. We identified an issue with netcat command which was changed in Ubuntu 20.04 to use FreeBSD version which behaves differently than prev Linux version used in 18.04
Can you please test following patch:
https://drive.google.com/file/d/1_lAyaZOUh3rz9aYg91Mrmi0a_1ZgsiWr/view?usp=sharing
apply with
patch -p1 -d / < netcat-openbsd-ignore-stderr.patch
systemctl restart petasan-node-stats
Please do let us know if this fixes the issue.
Last edited on March 23, 2022, 11:51 pm by admin · #7
BeHau
12 Posts
March 24, 2022, 7:38 amQuote from BeHau on March 24, 2022, 7:38 amPerfect ! All node-stats are working as expected now
Thank you very much on tracking down this issue. I highly apreciate your work.
Perfect ! All node-stats are working as expected now
Thank you very much on tracking down this issue. I highly apreciate your work.
No node stats after upgrading 3.0.1 to 3.0.2
BeHau
12 Posts
Quote from BeHau on March 23, 2022, 8:03 amNode stats are gone after upgrading this morning.
Cpu and memory is still shown but neither disk io nor network is shown correct. I restared node-statistics as mentioned in http://www.petasan.org/forums/?view=thread&id=1022 but still no stats.
See node-log and pictures attached.
PetaSAN.NodeStats.TG2.ifaces.throughput.eth2_received 0.0 `date +%s`" | nc -v -q0 172.31.52.103 2003
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth2 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_transmitted 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_received 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth1 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_transmitted 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_received 133.12 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth0 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.memory.percent_util -3.17 `date +%s`" "
Exception: Error running echo command :echo "PetaSAN.NodeStats.TG2.cpu_all.percent_util 3.1 `date +%s`" "
raise Exception("Error running echo command :" + cmd)
File "/usr/lib/python3/dist-packages/PetaSAN/core/common/graphite_sender.py", line 59, in send
graphite_sender.send(leader_ip)
File "/opt/petasan/scripts/node_stats.py", line 66, in get_stats
get_stats()
File "/opt/petasan/scripts/node_stats.py", line 168, in <module>
Traceback (most recent call last):
PetaSAN.NodeStats.TG-2.ifaces.throughput.eth2_received 0.0 `date +%s`" | nc -v -q0 172.31.52.103 2003
PetaSAN.NodeStats.TG-2.ifaces.percent_util.eth2 0.0 `date +%s`" "
https://cloud.technigroup.de/index.php/s/jG0TUPIWzM5WqUl
https://cloud.technigroup.de/index.php/s/AIcnOasGcisyWf9
Thank you for your help.
Node stats are gone after upgrading this morning.
Cpu and memory is still shown but neither disk io nor network is shown correct. I restared node-statistics as mentioned in http://www.petasan.org/forums/?view=thread&id=1022 but still no stats.
See node-log and pictures attached.
PetaSAN.NodeStats.TG2.ifaces.throughput.eth2_received 0.0 `date +%s`" | nc -v -q0 172.31.52.103 2003
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth2 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_transmitted 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_received 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth1 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_transmitted 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_received 133.12 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth0 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.memory.percent_util -3.17 `date +%s`" "
Exception: Error running echo command :echo "PetaSAN.NodeStats.TG2.cpu_all.percent_util 3.1 `date +%s`" "
raise Exception("Error running echo command :" + cmd)
File "/usr/lib/python3/dist-packages/PetaSAN/core/common/graphite_sender.py", line 59, in send
graphite_sender.send(leader_ip)
File "/opt/petasan/scripts/node_stats.py", line 66, in get_stats
get_stats()
File "/opt/petasan/scripts/node_stats.py", line 168, in <module>
Traceback (most recent call last):
PetaSAN.NodeStats.TG-2.ifaces.throughput.eth2_received 0.0 `date +%s`" | nc -v -q0 172.31.52.103 2003
PetaSAN.NodeStats.TG-2.ifaces.percent_util.eth2 0.0 `date +%s`" "
https://cloud.technigroup.de/index.php/s/jG0TUPIWzM5WqUl
https://cloud.technigroup.de/index.php/s/AIcnOasGcisyWf9
Thank you for your help.
admin
2,918 Posts
Quote from admin on March 23, 2022, 10:14 amCan you double check that really some parameters like cpu% are working but other like disk% are not..for the same node ? or is it some nodes work with all parameters and some others do not for all their parameters ?
# get the stats server ip from
/opt/petasan/scripts/util/get_cluster_leader.py# this will show current stats server
/opt/petasan/scripts/util/get_cluster_leader.pybased in the logs this should be 172.31.52.103
# on that server, show status of graphite srevice
systemctl status carbon-cache# from TG2 send fake 3% eth % util to this server (adjust ip at end if needed)
echo "PetaSAN.NodeStats.TG2.ifaces.percent_util.eth0 3.0 `date +%s`" | nc -v -q0 172.31.52.103Another thing to check, can you make sure that names of interfaced (eth1/2) do match the correct physical interface, you can check that from node list page or from blue node console.
Is the system under high load at times ? does it meet our hardware guide ?
Can you double check that really some parameters like cpu% are working but other like disk% are not..for the same node ? or is it some nodes work with all parameters and some others do not for all their parameters ?
# get the stats server ip from
/opt/petasan/scripts/util/get_cluster_leader.py
# this will show current stats server
/opt/petasan/scripts/util/get_cluster_leader.py
based in the logs this should be 172.31.52.103
# on that server, show status of graphite srevice
systemctl status carbon-cache
# from TG2 send fake 3% eth % util to this server (adjust ip at end if needed)
echo "PetaSAN.NodeStats.TG2.ifaces.percent_util.eth0 3.0 `date +%s`" | nc -v -q0 172.31.52.103
Another thing to check, can you make sure that names of interfaced (eth1/2) do match the correct physical interface, you can check that from node list page or from blue node console.
Is the system under high load at times ? does it meet our hardware guide ?
BeHau
12 Posts
Quote from BeHau on March 23, 2022, 12:12 pm># get the stats server ip from
>/opt/petasan/scripts/util/get_cluster_leader.py># this will show current stats server
>/opt/petasan/scripts/util/get_cluster_leader.pyYes, this is 172.31.52.103.
root@TG3:~# systemctl status carbon-cache
● carbon-cache.service - Graphite Carbon Cache
Loaded: loaded (/lib/systemd/system/carbon-cache.service; disabled; vendor preset: enabled)
Active: active (running) since Wed 2022-03-23 06:42:57 CET; 5h 51min ago
Docs: https://graphite.readthedocs.io
Process: 3574775 ExecStart=/usr/bin/carbon-cache --config=/etc/carbon/carbon.conf --pidfile=/var/run/carbon-cache.pid --logdir=/var
/log/carbon/ start (code=exited, status=0/SUCCESS)
Main PID: 3574779 (carbon-cache)
Tasks: 3 (limit: 114318)
Memory: 38.1M
CGroup: /system.slice/carbon-cache.service
└─3574779 /usr/bin/python3 /usr/bin/carbon-cache --config=/etc/carbon/carbon.conf --pidfile=/var/run/carbon-cache.pid --lo
gdir=/var/log/carbon/ startMar 23 06:42:57 TG3 systemd[1]: Starting Graphite Carbon Cache...
Mar 23 06:42:57 TG3 systemd[1]: Started Graphite Carbon Cache.
The result of "echo "PetaSAN.NodeStats.TG2.ifaces.percent_util.eth0 3.0 `date +%s`" | nc -v -q0 172.31.52.103" is: "nc: missing port number"
EDIT: found to use port 2003 . Result:
Connection to 172.31.52.103 2003 port [tcp/cfinger] succeeded!
># get the stats server ip from
>/opt/petasan/scripts/util/get_cluster_leader.py
># this will show current stats server
>/opt/petasan/scripts/util/get_cluster_leader.py
Yes, this is 172.31.52.103.
root@TG3:~# systemctl status carbon-cache
● carbon-cache.service - Graphite Carbon Cache
Loaded: loaded (/lib/systemd/system/carbon-cache.service; disabled; vendor preset: enabled)
Active: active (running) since Wed 2022-03-23 06:42:57 CET; 5h 51min ago
Docs: https://graphite.readthedocs.io
Process: 3574775 ExecStart=/usr/bin/carbon-cache --config=/etc/carbon/carbon.conf --pidfile=/var/run/carbon-cache.pid --logdir=/var
/log/carbon/ start (code=exited, status=0/SUCCESS)
Main PID: 3574779 (carbon-cache)
Tasks: 3 (limit: 114318)
Memory: 38.1M
CGroup: /system.slice/carbon-cache.service
└─3574779 /usr/bin/python3 /usr/bin/carbon-cache --config=/etc/carbon/carbon.conf --pidfile=/var/run/carbon-cache.pid --lo
gdir=/var/log/carbon/ start
Mar 23 06:42:57 TG3 systemd[1]: Starting Graphite Carbon Cache...
Mar 23 06:42:57 TG3 systemd[1]: Started Graphite Carbon Cache.
The result of "echo "PetaSAN.NodeStats.TG2.ifaces.percent_util.eth0 3.0 `date +%s`" | nc -v -q0 172.31.52.103" is: "nc: missing port number"
EDIT: found to use port 2003 . Result:
Connection to 172.31.52.103 2003 port [tcp/cfinger] succeeded!
admin
2,918 Posts
Quote from admin on March 23, 2022, 12:47 pm1) Do you see the fake entry we manually did for eth0 % util 3% for node TG2 on the dashboard charts ?
2) On TG2 what is the status of:
systemctl status petasan-node-stats3) On TG2 PetaSAN.log do you still see errors trying to send stats values (same as we did manually), are the errors recent and repeating or are they old ?
4) On TG1 or TG3
systemctl restart petasan-node-stats
does that make this node values start to show after a few minutes ? ( wait 5 min), also do it for 1 node only so we can debug more if needed.5) is the cluster healthy ? any clock skew warnings ? is the browser and servers in same time zone or at least have correct times ?
1) Do you see the fake entry we manually did for eth0 % util 3% for node TG2 on the dashboard charts ?
2) On TG2 what is the status of:
systemctl status petasan-node-stats
3) On TG2 PetaSAN.log do you still see errors trying to send stats values (same as we did manually), are the errors recent and repeating or are they old ?
4) On TG1 or TG3
systemctl restart petasan-node-stats
does that make this node values start to show after a few minutes ? ( wait 5 min), also do it for 1 node only so we can debug more if needed.
5) is the cluster healthy ? any clock skew warnings ? is the browser and servers in same time zone or at least have correct times ?
BeHau
12 Posts
Quote from BeHau on March 23, 2022, 3:09 pm1) Do you see the fake entry we manually did for eth0 % util 3% for node TG2 on the dashboard charts ?
Yes.
https://cloud.technigroup.de/index.php/s/yfAAj4kfcsZtJVW
2) On TG2 what is the status of:
systemctl status petasan-node-statsroot@TG2:~# systemctl status petasan-node-stats
● petasan-node-stats.service - PetaSAN Node Stats Service
Loaded: loaded (/lib/systemd/system/petasan-node-stats.service; static; vendor preset: enabled)
Active: active (running) since Wed 2022-03-23 06:45:31 CET; 7h ago
Main PID: 2266 (node_stats.py)
Tasks: 1 (limit: 114318)
Memory: 33.2M
CGroup: /system.slice/petasan-node-stats.service
└─2266 /usr/bin/python3 /opt/petasan/scripts/node_stats.pyMar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2773, 'temperature_celsius': 55}, 'nvme1n1': {'power_on_
hours': 2649, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2773, 'temperature_celsius': 55}, 'nvme1n1': {'power_on_
hours': 2649, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2773, 'temperature_celsius': 55}, 'nvme1n1': {'power_on_
hours': 2649, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2773, 'temperature_celsius': 56}, 'nvme1n1': {'power_on_
hours': 2649, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2774, 'temperature_celsius': 55}, 'nvme1n1': {'power_on_
hours': 2649, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2774, 'temperature_celsius': 56}, 'nvme1n1': {'power_on_
hours': 2650, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2774, 'temperature_celsius': 56}, 'nvme1n1': {'power_on_
hours': 2650, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2774, 'temperature_celsius': 55}, 'nvme1n1': {'power_on_
hours': 2650, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2774, 'temperature_celsius': 56}, 'nvme1n1': {'power_on_
hours': 2650, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2774, 'temperature_celsius': 56}, 'nvme1n1': {'power_on_
hours': 2650, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
3) On TG2 PetaSAN.log do you still see errors trying to send stats values (same as we did manually), are the errors recent and repeating or are they old ?
There are still errors.
root@TG2:/opt/petasan/log# tail -f PetaSAN.log
Exception: Error running echo command :echo "PetaSAN.NodeStats.TG2.cpu_all.percent_util 1.89 `date +%s`" "
PetaSAN.NodeStats.TG2.memory.percent_util -45.59 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth0 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_received 174.08 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_transmitted 30.72 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth1 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_received 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_transmitted 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth2 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth2_received 0.0 `date +%s`" | nc -v -q0 172.31.52.103 2003
23/03/2022 14:48:51 ERROR Node Stats exception.
23/03/2022 14:48:51 ERROR Error running echo command :echo "PetaSAN.NodeStats.TG2.cpu_all.percent_util 0.51 `date +%s`" "
PetaSAN.NodeStats.TG2.memory.percent_util -45.48 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth0 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_received 92.16 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_transmitted 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth1 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_received 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_transmitted 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth2 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth2_received 0.0 `date +%s`" | nc -v -q0 172.31.52.103 2003
Traceback (most recent call last):
File "/opt/petasan/scripts/node_stats.py", line 168, in <module>
get_stats()
File "/opt/petasan/scripts/node_stats.py", line 66, in get_stats
graphite_sender.send(leader_ip)
File "/usr/lib/python3/dist-packages/PetaSAN/core/common/graphite_sender.py", line 59, in send
raise Exception("Error running echo command :" + cmd)
Exception: Error running echo command :echo "PetaSAN.NodeStats.TG2.cpu_all.percent_util 0.51 `date +%s`" "
PetaSAN.NodeStats.TG2.memory.percent_util -45.48 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth0 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_received 92.16 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_transmitted 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth1 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_received 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_transmitted 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth2 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth2_received 0.0 `date +%s`" | nc -v -q0 172.31.52.103 2003
^C4) On TG1 or TG3
systemctl restart petasan-node-stats
does that make this node values start to show after a few minutes ? ( wait 5 min), also do it for 1 node only so we can debug more if needed.No, it did not.
5) is the cluster healthy ? any clock skew warnings ? is the browser and servers in same time zone or at least have correct times ?
root@TG1:/opt/petasan/log# ceph health
HEALTH_OKno warnings, no clock skew. Well, it´s been working until upgrading from 3.0.1 to 3.0.2.
1) Do you see the fake entry we manually did for eth0 % util 3% for node TG2 on the dashboard charts ?
Yes.
https://cloud.technigroup.de/index.php/s/yfAAj4kfcsZtJVW
2) On TG2 what is the status of:
systemctl status petasan-node-stats
root@TG2:~# systemctl status petasan-node-stats
● petasan-node-stats.service - PetaSAN Node Stats Service
Loaded: loaded (/lib/systemd/system/petasan-node-stats.service; static; vendor preset: enabled)
Active: active (running) since Wed 2022-03-23 06:45:31 CET; 7h ago
Main PID: 2266 (node_stats.py)
Tasks: 1 (limit: 114318)
Memory: 33.2M
CGroup: /system.slice/petasan-node-stats.service
└─2266 /usr/bin/python3 /opt/petasan/scripts/node_stats.py
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2773, 'temperature_celsius': 55}, 'nvme1n1': {'power_on_
hours': 2649, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2773, 'temperature_celsius': 55}, 'nvme1n1': {'power_on_
hours': 2649, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2773, 'temperature_celsius': 55}, 'nvme1n1': {'power_on_
hours': 2649, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2773, 'temperature_celsius': 56}, 'nvme1n1': {'power_on_
hours': 2649, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2774, 'temperature_celsius': 55}, 'nvme1n1': {'power_on_
hours': 2649, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2774, 'temperature_celsius': 56}, 'nvme1n1': {'power_on_
hours': 2650, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2774, 'temperature_celsius': 56}, 'nvme1n1': {'power_on_
hours': 2650, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2774, 'temperature_celsius': 55}, 'nvme1n1': {'power_on_
hours': 2650, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2774, 'temperature_celsius': 56}, 'nvme1n1': {'power_on_
hours': 2650, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2774, 'temperature_celsius': 56}, 'nvme1n1': {'power_on_
hours': 2650, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
3) On TG2 PetaSAN.log do you still see errors trying to send stats values (same as we did manually), are the errors recent and repeating or are they old ?
There are still errors.
root@TG2:/opt/petasan/log# tail -f PetaSAN.log
Exception: Error running echo command :echo "PetaSAN.NodeStats.TG2.cpu_all.percent_util 1.89 `date +%s`" "
PetaSAN.NodeStats.TG2.memory.percent_util -45.59 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth0 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_received 174.08 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_transmitted 30.72 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth1 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_received 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_transmitted 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth2 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth2_received 0.0 `date +%s`" | nc -v -q0 172.31.52.103 2003
23/03/2022 14:48:51 ERROR Node Stats exception.
23/03/2022 14:48:51 ERROR Error running echo command :echo "PetaSAN.NodeStats.TG2.cpu_all.percent_util 0.51 `date +%s`" "
PetaSAN.NodeStats.TG2.memory.percent_util -45.48 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth0 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_received 92.16 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_transmitted 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth1 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_received 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_transmitted 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth2 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth2_received 0.0 `date +%s`" | nc -v -q0 172.31.52.103 2003
Traceback (most recent call last):
File "/opt/petasan/scripts/node_stats.py", line 168, in <module>
get_stats()
File "/opt/petasan/scripts/node_stats.py", line 66, in get_stats
graphite_sender.send(leader_ip)
File "/usr/lib/python3/dist-packages/PetaSAN/core/common/graphite_sender.py", line 59, in send
raise Exception("Error running echo command :" + cmd)
Exception: Error running echo command :echo "PetaSAN.NodeStats.TG2.cpu_all.percent_util 0.51 `date +%s`" "
PetaSAN.NodeStats.TG2.memory.percent_util -45.48 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth0 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_received 92.16 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_transmitted 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth1 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_received 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_transmitted 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth2 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth2_received 0.0 `date +%s`" | nc -v -q0 172.31.52.103 2003
^C
4) On TG1 or TG3
systemctl restart petasan-node-stats
does that make this node values start to show after a few minutes ? ( wait 5 min), also do it for 1 node only so we can debug more if needed.
No, it did not.
5) is the cluster healthy ? any clock skew warnings ? is the browser and servers in same time zone or at least have correct times ?
root@TG1:/opt/petasan/log# ceph health
HEALTH_OK
no warnings, no clock skew. Well, it´s been working until upgrading from 3.0.1 to 3.0.2.
admin
2,918 Posts
Quote from admin on March 23, 2022, 5:26 pmi can confirm that with 3.0.2 some charts are working others are not. we are working on a fix for this
i can confirm that with 3.0.2 some charts are working others are not. we are working on a fix for this
admin
2,918 Posts
Quote from admin on March 23, 2022, 11:46 pmThank you for your help. We identified an issue with netcat command which was changed in Ubuntu 20.04 to use FreeBSD version which behaves differently than prev Linux version used in 18.04
Can you please test following patch:
https://drive.google.com/file/d/1_lAyaZOUh3rz9aYg91Mrmi0a_1ZgsiWr/view?usp=sharingapply with
patch -p1 -d / < netcat-openbsd-ignore-stderr.patch
systemctl restart petasan-node-statsPlease do let us know if this fixes the issue.
Thank you for your help. We identified an issue with netcat command which was changed in Ubuntu 20.04 to use FreeBSD version which behaves differently than prev Linux version used in 18.04
Can you please test following patch:
https://drive.google.com/file/d/1_lAyaZOUh3rz9aYg91Mrmi0a_1ZgsiWr/view?usp=sharing
apply with
patch -p1 -d / < netcat-openbsd-ignore-stderr.patch
systemctl restart petasan-node-stats
Please do let us know if this fixes the issue.
BeHau
12 Posts
Quote from BeHau on March 24, 2022, 7:38 amPerfect ! All node-stats are working as expected now
Thank you very much on tracking down this issue. I highly apreciate your work.
Perfect ! All node-stats are working as expected now
Thank you very much on tracking down this issue. I highly apreciate your work.