High CPU usage on GlusterFS
Pages: 1 2
mcharlebois
11 Posts
April 6, 2018, 9:26 pmQuote from mcharlebois on April 6, 2018, 9:26 pmHi,
I setup a small cluster for testing... The only issue I have right now is that randomly, the CPU usage on one node goes to 100% for about 5 to 10 minutes and will then drop back to normal... I saw something about data collection running every now and then on the master node...
Is this something we can disable without affecting CEPH?
The problem with this is that whenever this happens, I start getting ISCSI errors on my vSphere machines...
I have confirmed with the top command that glusterfs is the one using the whole CPU time...
Hi,
I setup a small cluster for testing... The only issue I have right now is that randomly, the CPU usage on one node goes to 100% for about 5 to 10 minutes and will then drop back to normal... I saw something about data collection running every now and then on the master node...
Is this something we can disable without affecting CEPH?
The problem with this is that whenever this happens, I start getting ISCSI errors on my vSphere machines...
I have confirmed with the top command that glusterfs is the one using the whole CPU time...
Last edited on April 6, 2018, 9:27 pm by mcharlebois · #1
admin
2,930 Posts
April 7, 2018, 6:42 amQuote from admin on April 7, 2018, 6:42 amThe stats should not put load on your system. unless it is very slow or has very limited ram, i recommend trying to find if there is something else causing this. If you need to disable all stats reporting (not recommended), you can do the following on all nodes:
/opt/petasan/scripts/stats-stop.sh
systemctl stop petasan-cluster-leader
systemctl stop petasan-node-stats
systemctl stop petasan-mount-sharedfs
cd /lib/systemd/system/
mv petasan-cluster-leader.service petasan-cluster-leader.back
mv petasan-node-stats.service petasan-node-stats.back
mv petasan-mount-sharedfs.service petasan-mount-sharedfs.back
The stats should not put load on your system. unless it is very slow or has very limited ram, i recommend trying to find if there is something else causing this. If you need to disable all stats reporting (not recommended), you can do the following on all nodes:
/opt/petasan/scripts/stats-stop.sh
systemctl stop petasan-cluster-leader
systemctl stop petasan-node-stats
systemctl stop petasan-mount-sharedfs
cd /lib/systemd/system/
mv petasan-cluster-leader.service petasan-cluster-leader.back
mv petasan-node-stats.service petasan-node-stats.back
mv petasan-mount-sharedfs.service petasan-mount-sharedfs.back
mcharlebois
11 Posts
April 7, 2018, 12:16 pmQuote from mcharlebois on April 7, 2018, 12:16 pmWe are limited on ram and budget.. I was wondering, would it also be possible to add a monitoring only node that is dedicated to stats collection?
So that stats collection would be offloaded and never have any chance of affecting CEPH performance?
We only see this issue on the cluster leader...
We are limited on ram and budget.. I was wondering, would it also be possible to add a monitoring only node that is dedicated to stats collection?
So that stats collection would be offloaded and never have any chance of affecting CEPH performance?
We only see this issue on the cluster leader...
Last edited on April 7, 2018, 12:19 pm by mcharlebois · #3
admin
2,930 Posts
April 7, 2018, 7:17 pmQuote from admin on April 7, 2018, 7:17 pmIn this case you can perform just the cluster leader related steps on 2 management nodes and leave one. This would be a management node that will act as a ceph monitor / consul server and stats server but no storage. If building a new cluster, do not assign any storage to this node, if this is any existing node with storage, move the osds in steps from this node and physically place them in the other nodes. However adding more RAM may solve your issue in and may be cheaper in the end, so i am not recommending the above steps.
In this case you can perform just the cluster leader related steps on 2 management nodes and leave one. This would be a management node that will act as a ceph monitor / consul server and stats server but no storage. If building a new cluster, do not assign any storage to this node, if this is any existing node with storage, move the osds in steps from this node and physically place them in the other nodes. However adding more RAM may solve your issue in and may be cheaper in the end, so i am not recommending the above steps.
Last edited on April 7, 2018, 7:20 pm by admin · #4
mcharlebois
11 Posts
April 7, 2018, 11:22 pmQuote from mcharlebois on April 7, 2018, 11:22 pmWould it be possible to just add a new node with no osd and journal drive and just run management on that new node?
Would it be possible to just add a new node with no osd and journal drive and just run management on that new node?
admin
2,930 Posts
April 8, 2018, 8:31 amQuote from admin on April 8, 2018, 8:31 amIn PetaSAN, management functions are restricted to the first 3 nodes. You can make your new node dedicated for storage
In PetaSAN, management functions are restricted to the first 3 nodes. You can make your new node dedicated for storage
Last edited on April 8, 2018, 8:33 am by admin · #6
mcharlebois
11 Posts
April 9, 2018, 11:01 amQuote from mcharlebois on April 9, 2018, 11:01 amok so I have added memory and 8 cpu cores to my 3 nodes
I checcked the monitoring of all my nodes for ther last 12 hours
Memory usage never goes over 3-4 GB
CPU usage now only tops at 50%
But I sill see disk usage on OS drive and OSD drive going up to 100% for a period of 8 to 10 minutes when the cpu tops at 50%
again, this is on master node only
ok so I have added memory and 8 cpu cores to my 3 nodes
I checcked the monitoring of all my nodes for ther last 12 hours
Memory usage never goes over 3-4 GB
CPU usage now only tops at 50%
But I sill see disk usage on OS drive and OSD drive going up to 100% for a period of 8 to 10 minutes when the cpu tops at 50%
again, this is on master node only
Last edited on April 9, 2018, 11:01 am by mcharlebois · #7
mcharlebois
11 Posts
April 9, 2018, 5:54 pmQuote from mcharlebois on April 9, 2018, 5:54 pmThis seems to be the type of issue I'm seeing...
https://github.com/hashicorp/consul/issues/3552
This seems to be the type of issue I'm seeing...
admin
2,930 Posts
April 9, 2018, 7:32 pmQuote from admin on April 9, 2018, 7:32 pmI am not sure it is related. we do quite a bit of testing so i doubt it..but of course i may be wrong 🙂
I am suspicious your osd disk goes to 100%, there is nothing special an osd does in a leader node, its load should be the same as other osds. Are other osds in the cluster ok in terms of load ? Does it get busy when client io is high or do you think it is related to the stats service running ?
There are 2 stats function running on the node: one for gathering cluster stats , the other for gathering node stats (local cpu/disk/..etc). If you stop the node stats
systemctl stop petasan-node-stats
Does this reduce the disk load on the osd disk ? If so it may be something in the disk or the disk driver trying to read the disk stats. Else it is probably the cluster stats that is loading the system ( although as stated that should not reflect on the osd disk ), there are a number of services that could be involved, but i would suggest this
systemctl stop collectd
Let me know if any of those help. Also as you suggested, you wanted to separate storage and stats collecting, so if the load is due to both client io and stats, you may move your storage disks and place them in new node.
I am not sure it is related. we do quite a bit of testing so i doubt it..but of course i may be wrong 🙂
I am suspicious your osd disk goes to 100%, there is nothing special an osd does in a leader node, its load should be the same as other osds. Are other osds in the cluster ok in terms of load ? Does it get busy when client io is high or do you think it is related to the stats service running ?
There are 2 stats function running on the node: one for gathering cluster stats , the other for gathering node stats (local cpu/disk/..etc). If you stop the node stats
systemctl stop petasan-node-stats
Does this reduce the disk load on the osd disk ? If so it may be something in the disk or the disk driver trying to read the disk stats. Else it is probably the cluster stats that is loading the system ( although as stated that should not reflect on the osd disk ), there are a number of services that could be involved, but i would suggest this
systemctl stop collectd
Let me know if any of those help. Also as you suggested, you wanted to separate storage and stats collecting, so if the load is due to both client io and stats, you may move your storage disks and place them in new node.
Last edited on April 9, 2018, 7:35 pm by admin · #9
mcharlebois
11 Posts
April 9, 2018, 7:55 pmQuote from mcharlebois on April 9, 2018, 7:55 pmIt seems like it starts off with a high io latency... maybe from my network.. will have to check it out...
That's when I start seeing messages about slow request being blocked and drive utilization spiking with no real IO showing up in iostat or iotop
I'm thinking of starting over with Version 1.5 without bluestore
It seems like it starts off with a high io latency... maybe from my network.. will have to check it out...
That's when I start seeing messages about slow request being blocked and drive utilization spiking with no real IO showing up in iostat or iotop
I'm thinking of starting over with Version 1.5 without bluestore
Last edited on April 9, 2018, 7:56 pm by mcharlebois · #10
Pages: 1 2
High CPU usage on GlusterFS
mcharlebois
11 Posts
Quote from mcharlebois on April 6, 2018, 9:26 pmHi,
I setup a small cluster for testing... The only issue I have right now is that randomly, the CPU usage on one node goes to 100% for about 5 to 10 minutes and will then drop back to normal... I saw something about data collection running every now and then on the master node...
Is this something we can disable without affecting CEPH?
The problem with this is that whenever this happens, I start getting ISCSI errors on my vSphere machines...
I have confirmed with the top command that glusterfs is the one using the whole CPU time...
Hi,
I setup a small cluster for testing... The only issue I have right now is that randomly, the CPU usage on one node goes to 100% for about 5 to 10 minutes and will then drop back to normal... I saw something about data collection running every now and then on the master node...
Is this something we can disable without affecting CEPH?
The problem with this is that whenever this happens, I start getting ISCSI errors on my vSphere machines...
I have confirmed with the top command that glusterfs is the one using the whole CPU time...
admin
2,930 Posts
Quote from admin on April 7, 2018, 6:42 amThe stats should not put load on your system. unless it is very slow or has very limited ram, i recommend trying to find if there is something else causing this. If you need to disable all stats reporting (not recommended), you can do the following on all nodes:
/opt/petasan/scripts/stats-stop.sh
systemctl stop petasan-cluster-leader
systemctl stop petasan-node-stats
systemctl stop petasan-mount-sharedfscd /lib/systemd/system/
mv petasan-cluster-leader.service petasan-cluster-leader.back
mv petasan-node-stats.service petasan-node-stats.back
mv petasan-mount-sharedfs.service petasan-mount-sharedfs.back
The stats should not put load on your system. unless it is very slow or has very limited ram, i recommend trying to find if there is something else causing this. If you need to disable all stats reporting (not recommended), you can do the following on all nodes:
/opt/petasan/scripts/stats-stop.sh
systemctl stop petasan-cluster-leader
systemctl stop petasan-node-stats
systemctl stop petasan-mount-sharedfscd /lib/systemd/system/
mv petasan-cluster-leader.service petasan-cluster-leader.back
mv petasan-node-stats.service petasan-node-stats.back
mv petasan-mount-sharedfs.service petasan-mount-sharedfs.back
mcharlebois
11 Posts
Quote from mcharlebois on April 7, 2018, 12:16 pmWe are limited on ram and budget.. I was wondering, would it also be possible to add a monitoring only node that is dedicated to stats collection?
So that stats collection would be offloaded and never have any chance of affecting CEPH performance?
We only see this issue on the cluster leader...
We are limited on ram and budget.. I was wondering, would it also be possible to add a monitoring only node that is dedicated to stats collection?
So that stats collection would be offloaded and never have any chance of affecting CEPH performance?
We only see this issue on the cluster leader...
admin
2,930 Posts
Quote from admin on April 7, 2018, 7:17 pmIn this case you can perform just the cluster leader related steps on 2 management nodes and leave one. This would be a management node that will act as a ceph monitor / consul server and stats server but no storage. If building a new cluster, do not assign any storage to this node, if this is any existing node with storage, move the osds in steps from this node and physically place them in the other nodes. However adding more RAM may solve your issue in and may be cheaper in the end, so i am not recommending the above steps.
In this case you can perform just the cluster leader related steps on 2 management nodes and leave one. This would be a management node that will act as a ceph monitor / consul server and stats server but no storage. If building a new cluster, do not assign any storage to this node, if this is any existing node with storage, move the osds in steps from this node and physically place them in the other nodes. However adding more RAM may solve your issue in and may be cheaper in the end, so i am not recommending the above steps.
mcharlebois
11 Posts
Quote from mcharlebois on April 7, 2018, 11:22 pmWould it be possible to just add a new node with no osd and journal drive and just run management on that new node?
Would it be possible to just add a new node with no osd and journal drive and just run management on that new node?
admin
2,930 Posts
Quote from admin on April 8, 2018, 8:31 amIn PetaSAN, management functions are restricted to the first 3 nodes. You can make your new node dedicated for storage
In PetaSAN, management functions are restricted to the first 3 nodes. You can make your new node dedicated for storage
mcharlebois
11 Posts
Quote from mcharlebois on April 9, 2018, 11:01 amok so I have added memory and 8 cpu cores to my 3 nodes
I checcked the monitoring of all my nodes for ther last 12 hours
Memory usage never goes over 3-4 GB
CPU usage now only tops at 50%
But I sill see disk usage on OS drive and OSD drive going up to 100% for a period of 8 to 10 minutes when the cpu tops at 50%
again, this is on master node only
ok so I have added memory and 8 cpu cores to my 3 nodes
I checcked the monitoring of all my nodes for ther last 12 hours
Memory usage never goes over 3-4 GB
CPU usage now only tops at 50%
But I sill see disk usage on OS drive and OSD drive going up to 100% for a period of 8 to 10 minutes when the cpu tops at 50%
again, this is on master node only
mcharlebois
11 Posts
Quote from mcharlebois on April 9, 2018, 5:54 pmThis seems to be the type of issue I'm seeing...
https://github.com/hashicorp/consul/issues/3552
This seems to be the type of issue I'm seeing...
admin
2,930 Posts
Quote from admin on April 9, 2018, 7:32 pmI am not sure it is related. we do quite a bit of testing so i doubt it..but of course i may be wrong 🙂
I am suspicious your osd disk goes to 100%, there is nothing special an osd does in a leader node, its load should be the same as other osds. Are other osds in the cluster ok in terms of load ? Does it get busy when client io is high or do you think it is related to the stats service running ?
There are 2 stats function running on the node: one for gathering cluster stats , the other for gathering node stats (local cpu/disk/..etc). If you stop the node stats
systemctl stop petasan-node-stats
Does this reduce the disk load on the osd disk ? If so it may be something in the disk or the disk driver trying to read the disk stats. Else it is probably the cluster stats that is loading the system ( although as stated that should not reflect on the osd disk ), there are a number of services that could be involved, but i would suggest this
systemctl stop collectd
Let me know if any of those help. Also as you suggested, you wanted to separate storage and stats collecting, so if the load is due to both client io and stats, you may move your storage disks and place them in new node.
I am not sure it is related. we do quite a bit of testing so i doubt it..but of course i may be wrong 🙂
I am suspicious your osd disk goes to 100%, there is nothing special an osd does in a leader node, its load should be the same as other osds. Are other osds in the cluster ok in terms of load ? Does it get busy when client io is high or do you think it is related to the stats service running ?
There are 2 stats function running on the node: one for gathering cluster stats , the other for gathering node stats (local cpu/disk/..etc). If you stop the node stats
systemctl stop petasan-node-stats
Does this reduce the disk load on the osd disk ? If so it may be something in the disk or the disk driver trying to read the disk stats. Else it is probably the cluster stats that is loading the system ( although as stated that should not reflect on the osd disk ), there are a number of services that could be involved, but i would suggest this
systemctl stop collectd
Let me know if any of those help. Also as you suggested, you wanted to separate storage and stats collecting, so if the load is due to both client io and stats, you may move your storage disks and place them in new node.
mcharlebois
11 Posts
Quote from mcharlebois on April 9, 2018, 7:55 pmIt seems like it starts off with a high io latency... maybe from my network.. will have to check it out...
That's when I start seeing messages about slow request being blocked and drive utilization spiking with no real IO showing up in iostat or iotop
I'm thinking of starting over with Version 1.5 without bluestore
It seems like it starts off with a high io latency... maybe from my network.. will have to check it out...
That's when I start seeing messages about slow request being blocked and drive utilization spiking with no real IO showing up in iostat or iotop
I'm thinking of starting over with Version 1.5 without bluestore