Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

No node stats after upgrading 3.0.1 to 3.0.2

Node stats are gone after upgrading this morning.

Cpu and memory is still shown but neither disk io nor network is shown correct. I restared node-statistics as mentioned in http://www.petasan.org/forums/?view=thread&id=1022 but still no stats.

See node-log and pictures attached.

 

PetaSAN.NodeStats.TG2.ifaces.throughput.eth2_received 0.0 `date +%s`" | nc -v -q0 172.31.52.103 2003

PetaSAN.NodeStats.TG2.ifaces.percent_util.eth2 0.0 `date +%s`" "

PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_transmitted 0.0 `date +%s`" "

PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_received 0.0 `date +%s`" "

PetaSAN.NodeStats.TG2.ifaces.percent_util.eth1 0.0 `date +%s`" "

PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_transmitted 0.0 `date +%s`" "

PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_received 133.12 `date +%s`" "

PetaSAN.NodeStats.TG2.ifaces.percent_util.eth0 0.0 `date +%s`" "

PetaSAN.NodeStats.TG2.memory.percent_util -3.17 `date +%s`" "

Exception: Error running echo command :echo "PetaSAN.NodeStats.TG2.cpu_all.percent_util 3.1 `date +%s`" "

raise Exception("Error running echo command :" + cmd)

File "/usr/lib/python3/dist-packages/PetaSAN/core/common/graphite_sender.py", line 59, in send

graphite_sender.send(leader_ip)

File "/opt/petasan/scripts/node_stats.py", line 66, in get_stats

get_stats()

File "/opt/petasan/scripts/node_stats.py", line 168, in <module>

Traceback (most recent call last):

PetaSAN.NodeStats.TG-2.ifaces.throughput.eth2_received 0.0 `date +%s`" | nc -v -q0 172.31.52.103 2003

PetaSAN.NodeStats.TG-2.ifaces.percent_util.eth2 0.0 `date +%s`" "

 

https://cloud.technigroup.de/index.php/s/jG0TUPIWzM5WqUl

https://cloud.technigroup.de/index.php/s/AIcnOasGcisyWf9

 

Thank you for your help.

Can you double check that really some parameters like cpu% are working but other like disk% are not..for the same node ? or is it some nodes work with all parameters and some others do not for all their parameters ?

# get the stats server ip from
/opt/petasan/scripts/util/get_cluster_leader.py

# this will show current stats server
/opt/petasan/scripts/util/get_cluster_leader.py

based in the logs this should be 172.31.52.103

# on that server, show status of graphite srevice
systemctl status carbon-cache

# from TG2 send fake 3% eth % util to this server (adjust ip at end if needed)
echo "PetaSAN.NodeStats.TG2.ifaces.percent_util.eth0 3.0 `date +%s`" | nc -v -q0 172.31.52.103

Another thing to check, can you make sure that names of interfaced (eth1/2) do match the correct physical interface, you can check that from node list page or from blue node console.

Is the system under high load  at times ? does it meet our hardware guide ?

># get the stats server ip from
>/opt/petasan/scripts/util/get_cluster_leader.py

># this will show current stats server
>/opt/petasan/scripts/util/get_cluster_leader.py

Yes, this is 172.31.52.103.

root@TG3:~# systemctl status carbon-cache
● carbon-cache.service - Graphite Carbon Cache
Loaded: loaded (/lib/systemd/system/carbon-cache.service; disabled; vendor preset: enabled)
Active: active (running) since Wed 2022-03-23 06:42:57 CET; 5h 51min ago
Docs: https://graphite.readthedocs.io
Process: 3574775 ExecStart=/usr/bin/carbon-cache --config=/etc/carbon/carbon.conf --pidfile=/var/run/carbon-cache.pid --logdir=/var
/log/carbon/ start (code=exited, status=0/SUCCESS)
Main PID: 3574779 (carbon-cache)
Tasks: 3 (limit: 114318)
Memory: 38.1M
CGroup: /system.slice/carbon-cache.service
└─3574779 /usr/bin/python3 /usr/bin/carbon-cache --config=/etc/carbon/carbon.conf --pidfile=/var/run/carbon-cache.pid --lo
gdir=/var/log/carbon/ start

Mar 23 06:42:57 TG3 systemd[1]: Starting Graphite Carbon Cache...
Mar 23 06:42:57 TG3 systemd[1]: Started Graphite Carbon Cache.

 

The result of  "echo "PetaSAN.NodeStats.TG2.ifaces.percent_util.eth0 3.0 `date +%s`" | nc -v -q0 172.31.52.103" is: "nc: missing port number"

 

EDIT: found to use port 2003 . Result:
Connection to 172.31.52.103 2003 port [tcp/cfinger] succeeded!

1) Do you see the fake entry we manually did for eth0 % util 3% for node TG2 on the dashboard charts ?

2) On TG2 what is the status of:
systemctl status petasan-node-stats

3) On TG2 PetaSAN.log do you still see errors trying to send stats values (same as we did manually), are the errors recent and repeating or are they old ?

4) On TG1 or TG3
systemctl restart petasan-node-stats
does that make this node values start to show after a few minutes ? ( wait 5 min), also do it for 1 node only so we can debug more if needed.

5) is the cluster healthy ? any clock skew warnings ? is the browser and servers in same time zone or at least have correct times ?

1) Do you see the fake entry we manually did for eth0 % util 3% for node TG2 on the dashboard charts ?

Yes.

https://cloud.technigroup.de/index.php/s/yfAAj4kfcsZtJVW

 

2) On TG2 what is the status of:
systemctl status petasan-node-stats

root@TG2:~# systemctl status petasan-node-stats
● petasan-node-stats.service - PetaSAN Node Stats Service
Loaded: loaded (/lib/systemd/system/petasan-node-stats.service; static; vendor preset: enabled)
Active: active (running) since Wed 2022-03-23 06:45:31 CET; 7h ago
Main PID: 2266 (node_stats.py)
Tasks: 1 (limit: 114318)
Memory: 33.2M
CGroup: /system.slice/petasan-node-stats.service
└─2266 /usr/bin/python3 /opt/petasan/scripts/node_stats.py

Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2773, 'temperature_celsius': 55}, 'nvme1n1': {'power_on_
hours': 2649, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2773, 'temperature_celsius': 55}, 'nvme1n1': {'power_on_
hours': 2649, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2773, 'temperature_celsius': 55}, 'nvme1n1': {'power_on_
hours': 2649, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2773, 'temperature_celsius': 56}, 'nvme1n1': {'power_on_
hours': 2649, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2774, 'temperature_celsius': 55}, 'nvme1n1': {'power_on_
hours': 2649, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2774, 'temperature_celsius': 56}, 'nvme1n1': {'power_on_
hours': 2650, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2774, 'temperature_celsius': 56}, 'nvme1n1': {'power_on_
hours': 2650, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2774, 'temperature_celsius': 55}, 'nvme1n1': {'power_on_
hours': 2650, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2774, 'temperature_celsius': 56}, 'nvme1n1': {'power_on_
hours': 2650, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}
Mar 23 12:33:28 TG2 node_stats.py[2266]: {'nvme0n1': {'power_on_hours': 2774, 'temperature_celsius': 56}, 'nvme1n1': {'power_on_
hours': 2650, 'temperature_celsius': 51}, 'sda': {}, 'sdb': {}, 'sdc': {}, 'sdd': {}, 'sde': {}, 'sdf': {}, 'sdg': {}, 'sdh': {}, 'sdi'
: {}, 'sdj': {}, 'sdk': {}}

 

3) On TG2 PetaSAN.log do you still see errors trying to send stats values (same as we did manually), are the errors recent and repeating or are they old ?

There are still errors.
root@TG2:/opt/petasan/log# tail -f PetaSAN.log
Exception: Error running echo command :echo "PetaSAN.NodeStats.TG2.cpu_all.percent_util 1.89 `date +%s`" "
PetaSAN.NodeStats.TG2.memory.percent_util -45.59 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth0 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_received 174.08 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_transmitted 30.72 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth1 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_received 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_transmitted 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth2 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth2_received 0.0 `date +%s`" | nc -v -q0 172.31.52.103 2003
23/03/2022 14:48:51 ERROR Node Stats exception.
23/03/2022 14:48:51 ERROR Error running echo command :echo "PetaSAN.NodeStats.TG2.cpu_all.percent_util 0.51 `date +%s`" "
PetaSAN.NodeStats.TG2.memory.percent_util -45.48 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth0 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_received 92.16 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_transmitted 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth1 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_received 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_transmitted 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth2 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth2_received 0.0 `date +%s`" | nc -v -q0 172.31.52.103 2003
Traceback (most recent call last):
File "/opt/petasan/scripts/node_stats.py", line 168, in <module>
get_stats()
File "/opt/petasan/scripts/node_stats.py", line 66, in get_stats
graphite_sender.send(leader_ip)
File "/usr/lib/python3/dist-packages/PetaSAN/core/common/graphite_sender.py", line 59, in send
raise Exception("Error running echo command :" + cmd)
Exception: Error running echo command :echo "PetaSAN.NodeStats.TG2.cpu_all.percent_util 0.51 `date +%s`" "
PetaSAN.NodeStats.TG2.memory.percent_util -45.48 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth0 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_received 92.16 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth0_transmitted 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth1 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_received 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth1_transmitted 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.percent_util.eth2 0.0 `date +%s`" "
PetaSAN.NodeStats.TG2.ifaces.throughput.eth2_received 0.0 `date +%s`" | nc -v -q0 172.31.52.103 2003
^C

4) On TG1 or TG3
systemctl restart petasan-node-stats
does that make this node values start to show after a few minutes ? ( wait 5 min), also do it for 1 node only so we can debug more if needed.

No, it did not.

5) is the cluster healthy ? any clock skew warnings ? is the browser and servers in same time zone or at least have correct times ?

root@TG1:/opt/petasan/log# ceph health
HEALTH_OK

no warnings, no clock skew. Well, it´s been working until upgrading from 3.0.1 to 3.0.2.

 

i can confirm that with 3.0.2 some charts are working others are not.  we are working on a fix for this

Thank you for your help. We identified an issue with netcat command which was changed in Ubuntu 20.04 to use FreeBSD version which behaves differently than prev Linux version used in 18.04

Can you please test following patch:
https://drive.google.com/file/d/1_lAyaZOUh3rz9aYg91Mrmi0a_1ZgsiWr/view?usp=sharing

apply with
patch -p1 -d / < netcat-openbsd-ignore-stderr.patch
systemctl restart petasan-node-stats

Please do let us know if this fixes the issue.

Perfect ! All node-stats are working as expected now

Thank you very much on tracking down this issue. I highly apreciate your work.