Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

v3.0.1 - No graph statistics from 4 and 5 nodes

Hi!

I installed PetaSAN 3.0.1 cluster with 5 nodes: the first three nodes have MON+OSD roles, but the last two nodes have only OSD roles.

If I choose "Dashboard" -> "View Chart:" -> select -> "Node statistics", and then select in the "Node:" drop-down list any of my first three nodes - everything is OK. I see graphs.

But if I choose in the "Node:" drop-down list any of my last two nodes (4 or 5) I see empty graphs. No errors. No messages about "no data". Just empty graphs.

I'm going to SSH console of 4 and 5 nodes and check gluster peers, and didn't see it

# gluster peer status
Number of Peers: 0

This instruction: https://www.petasan.org/forums/?view=thread&id=181&part=2#postid-871 doesn't help me. 🙁

Please, tell me how to fix this problem?


Self fixed: I reboot all 5 nodes and graphs start working. Reboot everything is not good solution as I think, but right now it is only solution I found. 🙁

Thanks a lot for your feedback. We did find a bug and will be included in next bug fix release. which is due in a couple of days. It is related to Ubuntu 20.04 now using a different version of netcat based on bsd which is slightly different, netcat is used by the node stats so communicate to the stats server, it does not behave the same in case of failover.

gluster peer status is correct to work only on first 3 nodes, this is the server component of gluster where it stores the shared data for stats, nodes 4 and above are clients only.

 

Issue has been fixed in Release 3.0.2.

I upgraded my PetaSAN cluster to 3.0.2 (and reboot each node after upgrade), but it doesn't help.

Some graphs like "Disk utilization", "Disk IOPS", "Disk throughput" randomly stop showing data on random node.

I didn't find any order of this malfunction yet, but it can happens on any node at any time.

Guys, your PetaSAN is really great and very useful product, but it's monitoring features is poore and unstable. May be you'll think about rebuilding this feature at all?

I do not think what it is very complex task. You can use Prometheus+Node Exporter+any other exporters+Grafana solution for example to get and show any metrics you need.

Your issue not related to the fix we did in 3.0.2. It could be many issue: network connectivity, hardware load, not enough ram...

When you see the issue, do all stats from a specific node stop ? or some stats show up but not others on that node ? do they work after a while or once they stop they do not work ?

On a node that is not working:

what is status of
systemctl status petasan-node-stats

do you see any errors in /opt/petasan/log/PetaSAN.log ?

Try to manually write a fake 50% cpu from this node and see if we get errors and if it shows up on chart

first get the stats server ip from
/opt/petasan/scripts/util/get_cluster_leader.py

then send command via netcat, syntax is
echo "PetaSAN.NodeStats.NODE_NAME.cpu_all.percent_util 50 `date +%s`" | nc -v -q0 STATS_SERVER_IP 2003
example
echo "PetaSAN.NodeStats.ps-node-01.cpu_all.percent_util 50 `date +%s`" | nc -v -q0 10.0.1.13 2003

The current stats module has been wroking for a long time and has been stable, we use grafana/graphite/carbon stack which is quite robust, in addition it supports high availabilty with a 3x data redundancy provided via Gluster shared filesystem external from Ceph. We need to understand more the issue you have.

any chance testing the above ?

Sorry, right now I decided to install the native Ceph Pacific v16.2.7 on my 5 nodes.

PetaSan is great product, but it's unstable statistic graphs are spoiling everything(