Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

High CPU usage on GlusterFS

Pages: 1 2

Hi,

I setup a small cluster for testing...  The only issue I have right now is that randomly, the CPU usage on one node goes to 100% for about 5 to 10 minutes and will then drop back to normal...   I saw something about data collection running every now and then on the master node...

Is this something we can disable without affecting CEPH?

The problem with this is that whenever this happens, I start getting ISCSI errors on my vSphere machines...

 

I have confirmed with the top command that glusterfs is the one using the whole CPU time...

The stats should not put load on your system. unless it is very slow or has very limited ram,  i recommend trying to find if there is something else causing this. If you need to disable all stats reporting (not recommended), you can do the following on all nodes:

/opt/petasan/scripts/stats-stop.sh

systemctl stop petasan-cluster-leader
systemctl stop petasan-node-stats
systemctl stop petasan-mount-sharedfs

cd /lib/systemd/system/
mv petasan-cluster-leader.service petasan-cluster-leader.back
mv petasan-node-stats.service petasan-node-stats.back
mv petasan-mount-sharedfs.service petasan-mount-sharedfs.back

 

We are limited on ram and budget.. I was wondering,  would it also be possible to add a monitoring only node that is dedicated to stats collection?

So that stats collection would be offloaded and never have any chance of affecting CEPH performance?

We only see this issue on the cluster leader...

In this case you can perform just the cluster leader related steps on 2 management nodes and leave one. This would be a management node that will act as a ceph monitor / consul server and stats server but no storage. If building a new cluster, do not assign any storage to this node, if this is any existing node with storage, move the osds in steps from this node and physically place them in the other nodes. However adding more RAM may solve your issue in and may be cheaper in the end, so i am not recommending the above steps.

Would it be possible to just add a new node with no osd and journal drive and just run management on that new node?

In PetaSAN, management functions are restricted to the first 3 nodes. You can make your new node dedicated for storage

ok so I have added memory and 8 cpu cores to my 3 nodes

 

I checcked the monitoring of all my nodes for ther last 12 hours

Memory usage never goes over 3-4 GB

CPU usage now only tops at 50%

But I sill see disk usage on OS drive and OSD drive going up to 100% for a period of 8 to 10 minutes when the cpu tops at 50%

again, this is on master node only

This seems to be the type of issue I'm seeing...

https://github.com/hashicorp/consul/issues/3552

I am not sure it is related. we do quite a bit of testing so i doubt it..but of course i may be wrong 🙂

I am suspicious your osd disk goes to 100%,  there is nothing special an osd does in a leader node, its load should be the same as other osds.  Are other osds in the cluster ok in terms of load ?  Does it get busy when client io is high  or do you think it is related to the stats service running ?

There are 2 stats function running on the node: one for gathering cluster stats , the other for  gathering node stats (local cpu/disk/..etc). If you stop the node stats

systemctl stop petasan-node-stats

Does this reduce the disk load on the osd disk ? If so it may be something in the disk or the disk driver trying to read the disk stats.  Else it is probably the cluster stats that is loading the system ( although as stated that should not reflect on the osd disk ), there are a number of services that could be involved, but i would suggest this

systemctl stop collectd

Let me know if any of those help. Also as you suggested, you wanted to separate storage and stats collecting, so if the load is due to both client io and stats, you may move your storage disks and place them in new node.

It seems like it starts off with a high io latency...  maybe from my network.. will have to check it out...

That's when I start seeing messages about slow request being blocked and drive utilization spiking with no real IO showing up in iostat or iotop

I'm thinking of starting over with Version 1.5 without bluestore

Pages: 1 2