High utilization with consul agent, systemd-journald, & rsyslogd
quantumschema
7 Posts
January 15, 2021, 3:04 pmQuote from quantumschema on January 15, 2021, 3:04 pmHello!
We're seeing some rather high write latency when writing to our PetaSAN environment. Things seem to be okay across the 3 management nodes but on one I noticed extremely high utilization coming from consul agent, systemd-journald, and rsyslogd. Is this normal or to be expected?
Hello!
We're seeing some rather high write latency when writing to our PetaSAN environment. Things seem to be okay across the 3 management nodes but on one I noticed extremely high utilization coming from consul agent, systemd-journald, and rsyslogd. Is this normal or to be expected?
Last edited on January 15, 2021, 3:06 pm by quantumschema · #1
admin
2,930 Posts
January 15, 2021, 4:11 pmQuote from admin on January 15, 2021, 4:11 pmwhat version do you use ?
what hardware, network, number of osds ssd/hdd ?
do you see any high of %disk/%ram/%cpu utilization on charts on any node ?
is this a new cluster ? did it just happen ? any changes in load ?
what version do you use ?
what hardware, network, number of osds ssd/hdd ?
do you see any high of %disk/%ram/%cpu utilization on charts on any node ?
is this a new cluster ? did it just happen ? any changes in load ?
quantumschema
7 Posts
January 15, 2021, 4:48 pmQuote from quantumschema on January 15, 2021, 4:48 pm
Quote from admin on January 15, 2021, 4:11 pm
what version do you use ?
what hardware, network, number of osds ssd/hdd ?
do you see any high of %disk/%ram/%cpu utilization on charts on any node ?
is this a new cluster ? did it just happen ? any changes in load ?
Hi!
We're on version 2.5.3.
For hardware: PowerEdge R640's (dual Xeon 5122's w/ 192GB of RAM). Dual 25Gbps NICs (Mellanox).
For OSDs: 143 HDD & 29 SSD
We're not seeing this on the others that I have seen so far.
The particular node in the screenshot is one of the monitors but not the manager.
It's an existing cluster. I'm unsure how long the high utilization of those processes has been going on but we did just notice it. We noticed some performance issues with a few VMs (confirmed high read/write latencies via esxtop per VM). There is a robocopy currently executing w/in a VM moving a few terabytes of data from a different SAN to a PetaSAN volume.
According to a ceph status, it's seeing the following for read/write load:
client: 210 MiB/s rd, 79 MiB/s wr, 2.95k op/s rd, 1.28k op/s wr
Quote from admin on January 15, 2021, 4:11 pm
what version do you use ?
what hardware, network, number of osds ssd/hdd ?
do you see any high of %disk/%ram/%cpu utilization on charts on any node ?
is this a new cluster ? did it just happen ? any changes in load ?
Hi!
We're on version 2.5.3.
For hardware: PowerEdge R640's (dual Xeon 5122's w/ 192GB of RAM). Dual 25Gbps NICs (Mellanox).
For OSDs: 143 HDD & 29 SSD
We're not seeing this on the others that I have seen so far.
The particular node in the screenshot is one of the monitors but not the manager.
It's an existing cluster. I'm unsure how long the high utilization of those processes has been going on but we did just notice it. We noticed some performance issues with a few VMs (confirmed high read/write latencies via esxtop per VM). There is a robocopy currently executing w/in a VM moving a few terabytes of data from a different SAN to a PetaSAN volume.
According to a ceph status, it's seeing the following for read/write load:
client: 210 MiB/s rd, 79 MiB/s wr, 2.95k op/s rd, 1.28k op/s wr
Last edited on January 15, 2021, 4:50 pm by quantumschema · #3
admin
2,930 Posts
January 15, 2021, 5:15 pmQuote from admin on January 15, 2021, 5:15 pmwhat is the %utilization and raw iops on the hdds charts ?
what is the high write latency you are see-ing ?
what is the %utilization and raw iops on the hdds charts ?
what is the high write latency you are see-ing ?
Last edited on January 15, 2021, 5:17 pm by admin · #4
quantumschema
7 Posts
January 15, 2021, 5:42 pmQuote from quantumschema on January 15, 2021, 5:42 pmHere's what I see for IOPS in HDD's.
This is what it looks like from the node in the very first screenshot in the OP:
We're seeing between 7,000ms and 10,000ms write latencies within the ESX node for the VM that is writing to PetaSAN:
EDIT: I have also moved the VM between ESX nodes in the vSphere cluster and the latency follows the VM leading me to believe it's w/in the PetaSAN cluster.
Here's what I see for IOPS in HDD's.
This is what it looks like from the node in the very first screenshot in the OP:
We're seeing between 7,000ms and 10,000ms write latencies within the ESX node for the VM that is writing to PetaSAN:
EDIT: I have also moved the VM between ESX nodes in the vSphere cluster and the latency follows the VM leading me to believe it's w/in the PetaSAN cluster.
Last edited on January 15, 2021, 5:48 pm by quantumschema · #5
admin
2,930 Posts
January 15, 2021, 7:51 pmQuote from admin on January 15, 2021, 7:51 pmyou can try to re-start consul on the node, kill the process and start it assuming a management node via:
/opt/petasan/scripts/consul_start_up.py
i would also try to see the OSD Latency charts under the cluster stats and see if PetaSAN also reports high latency as reported by ESXi. Typically if the system does report high latency, there would be some high %utlization on disks/cpu/ram on one of the nodes or in some cases network issues.
you can try to re-start consul on the node, kill the process and start it assuming a management node via:
/opt/petasan/scripts/consul_start_up.py
i would also try to see the OSD Latency charts under the cluster stats and see if PetaSAN also reports high latency as reported by ESXi. Typically if the system does report high latency, there would be some high %utlization on disks/cpu/ram on one of the nodes or in some cases network issues.
Last edited on January 15, 2021, 7:52 pm by admin · #6
quantumschema
7 Posts
January 15, 2021, 9:33 pmQuote from quantumschema on January 15, 2021, 9:33 pmOkay. Tried killing consul and using "/opt/petasan/scripts/consul_start_up.py" to restart the service but the ring-fence protection kicked in and the host shutdown. After powering back on, it rejoined the cluster and process utilization seems to be okay now. Latency on that one particular VM has dropped from 7,000ms-10,000ms to 4,000ms-7,000ms.
Memory utilization is roughly 50% across the nodes (12 nodes) and CPU utilization is roughly 25% across the nodes. Network utilization is around 75Mbps on all nodes.
Here's what the Commit and Apply latencies look like as well:
Okay. Tried killing consul and using "/opt/petasan/scripts/consul_start_up.py" to restart the service but the ring-fence protection kicked in and the host shutdown. After powering back on, it rejoined the cluster and process utilization seems to be okay now. Latency on that one particular VM has dropped from 7,000ms-10,000ms to 4,000ms-7,000ms.
Memory utilization is roughly 50% across the nodes (12 nodes) and CPU utilization is roughly 25% across the nodes. Network utilization is around 75Mbps on all nodes.
Here's what the Commit and Apply latencies look like as well:
admin
2,930 Posts
January 15, 2021, 9:54 pmQuote from admin on January 15, 2021, 9:54 pmThe OSD latency does not show correlation to the 4000-10000 ms you see on the ESXi side. also from the OSD side it does not treat objects from the particular vm any different than others. You would also get slow op warnings from Ceph. Maybe it could an issue at the iSCSI layer, though it is not likely, you can view any errors from kernel dmesg on the gateways serving the disk paths. But if the issue is not happening on other vms, i would think it is likely not cluster related.
The OSD latency does not show correlation to the 4000-10000 ms you see on the ESXi side. also from the OSD side it does not treat objects from the particular vm any different than others. You would also get slow op warnings from Ceph. Maybe it could an issue at the iSCSI layer, though it is not likely, you can view any errors from kernel dmesg on the gateways serving the disk paths. But if the issue is not happening on other vms, i would think it is likely not cluster related.
Last edited on January 15, 2021, 9:55 pm by admin · #8
High utilization with consul agent, systemd-journald, & rsyslogd
quantumschema
7 Posts
Quote from quantumschema on January 15, 2021, 3:04 pmHello!
We're seeing some rather high write latency when writing to our PetaSAN environment. Things seem to be okay across the 3 management nodes but on one I noticed extremely high utilization coming from consul agent, systemd-journald, and rsyslogd. Is this normal or to be expected?
Hello!
We're seeing some rather high write latency when writing to our PetaSAN environment. Things seem to be okay across the 3 management nodes but on one I noticed extremely high utilization coming from consul agent, systemd-journald, and rsyslogd. Is this normal or to be expected?
admin
2,930 Posts
Quote from admin on January 15, 2021, 4:11 pmwhat version do you use ?
what hardware, network, number of osds ssd/hdd ?
do you see any high of %disk/%ram/%cpu utilization on charts on any node ?
is this a new cluster ? did it just happen ? any changes in load ?
what version do you use ?
what hardware, network, number of osds ssd/hdd ?
do you see any high of %disk/%ram/%cpu utilization on charts on any node ?
is this a new cluster ? did it just happen ? any changes in load ?
quantumschema
7 Posts
Quote from quantumschema on January 15, 2021, 4:48 pmQuote from admin on January 15, 2021, 4:11 pmwhat version do you use ?
what hardware, network, number of osds ssd/hdd ?
do you see any high of %disk/%ram/%cpu utilization on charts on any node ?
is this a new cluster ? did it just happen ? any changes in load ?Hi!
We're on version 2.5.3.
For hardware: PowerEdge R640's (dual Xeon 5122's w/ 192GB of RAM). Dual 25Gbps NICs (Mellanox).
For OSDs: 143 HDD & 29 SSD
We're not seeing this on the others that I have seen so far.
The particular node in the screenshot is one of the monitors but not the manager.
It's an existing cluster. I'm unsure how long the high utilization of those processes has been going on but we did just notice it. We noticed some performance issues with a few VMs (confirmed high read/write latencies via esxtop per VM). There is a robocopy currently executing w/in a VM moving a few terabytes of data from a different SAN to a PetaSAN volume.
According to a ceph status, it's seeing the following for read/write load:
client: 210 MiB/s rd, 79 MiB/s wr, 2.95k op/s rd, 1.28k op/s wr
Quote from admin on January 15, 2021, 4:11 pmwhat version do you use ?
what hardware, network, number of osds ssd/hdd ?
do you see any high of %disk/%ram/%cpu utilization on charts on any node ?
is this a new cluster ? did it just happen ? any changes in load ?
Hi!
We're on version 2.5.3.
For hardware: PowerEdge R640's (dual Xeon 5122's w/ 192GB of RAM). Dual 25Gbps NICs (Mellanox).
For OSDs: 143 HDD & 29 SSD
We're not seeing this on the others that I have seen so far.
The particular node in the screenshot is one of the monitors but not the manager.
It's an existing cluster. I'm unsure how long the high utilization of those processes has been going on but we did just notice it. We noticed some performance issues with a few VMs (confirmed high read/write latencies via esxtop per VM). There is a robocopy currently executing w/in a VM moving a few terabytes of data from a different SAN to a PetaSAN volume.
According to a ceph status, it's seeing the following for read/write load:
client: 210 MiB/s rd, 79 MiB/s wr, 2.95k op/s rd, 1.28k op/s wr
admin
2,930 Posts
Quote from admin on January 15, 2021, 5:15 pmwhat is the %utilization and raw iops on the hdds charts ?
what is the high write latency you are see-ing ?
what is the %utilization and raw iops on the hdds charts ?
what is the high write latency you are see-ing ?
quantumschema
7 Posts
Quote from quantumschema on January 15, 2021, 5:42 pmHere's what I see for IOPS in HDD's.
This is what it looks like from the node in the very first screenshot in the OP:
We're seeing between 7,000ms and 10,000ms write latencies within the ESX node for the VM that is writing to PetaSAN:
EDIT: I have also moved the VM between ESX nodes in the vSphere cluster and the latency follows the VM leading me to believe it's w/in the PetaSAN cluster.
Here's what I see for IOPS in HDD's.
This is what it looks like from the node in the very first screenshot in the OP:
We're seeing between 7,000ms and 10,000ms write latencies within the ESX node for the VM that is writing to PetaSAN:
EDIT: I have also moved the VM between ESX nodes in the vSphere cluster and the latency follows the VM leading me to believe it's w/in the PetaSAN cluster.
admin
2,930 Posts
Quote from admin on January 15, 2021, 7:51 pmyou can try to re-start consul on the node, kill the process and start it assuming a management node via:
/opt/petasan/scripts/consul_start_up.py
i would also try to see the OSD Latency charts under the cluster stats and see if PetaSAN also reports high latency as reported by ESXi. Typically if the system does report high latency, there would be some high %utlization on disks/cpu/ram on one of the nodes or in some cases network issues.
you can try to re-start consul on the node, kill the process and start it assuming a management node via:
/opt/petasan/scripts/consul_start_up.py
i would also try to see the OSD Latency charts under the cluster stats and see if PetaSAN also reports high latency as reported by ESXi. Typically if the system does report high latency, there would be some high %utlization on disks/cpu/ram on one of the nodes or in some cases network issues.
quantumschema
7 Posts
Quote from quantumschema on January 15, 2021, 9:33 pmOkay. Tried killing consul and using "/opt/petasan/scripts/consul_start_up.py" to restart the service but the ring-fence protection kicked in and the host shutdown. After powering back on, it rejoined the cluster and process utilization seems to be okay now. Latency on that one particular VM has dropped from 7,000ms-10,000ms to 4,000ms-7,000ms.
Memory utilization is roughly 50% across the nodes (12 nodes) and CPU utilization is roughly 25% across the nodes. Network utilization is around 75Mbps on all nodes.
Here's what the Commit and Apply latencies look like as well:
Okay. Tried killing consul and using "/opt/petasan/scripts/consul_start_up.py" to restart the service but the ring-fence protection kicked in and the host shutdown. After powering back on, it rejoined the cluster and process utilization seems to be okay now. Latency on that one particular VM has dropped from 7,000ms-10,000ms to 4,000ms-7,000ms.
Memory utilization is roughly 50% across the nodes (12 nodes) and CPU utilization is roughly 25% across the nodes. Network utilization is around 75Mbps on all nodes.
Here's what the Commit and Apply latencies look like as well:
admin
2,930 Posts
Quote from admin on January 15, 2021, 9:54 pmThe OSD latency does not show correlation to the 4000-10000 ms you see on the ESXi side. also from the OSD side it does not treat objects from the particular vm any different than others. You would also get slow op warnings from Ceph. Maybe it could an issue at the iSCSI layer, though it is not likely, you can view any errors from kernel dmesg on the gateways serving the disk paths. But if the issue is not happening on other vms, i would think it is likely not cluster related.
The OSD latency does not show correlation to the 4000-10000 ms you see on the ESXi side. also from the OSD side it does not treat objects from the particular vm any different than others. You would also get slow op warnings from Ceph. Maybe it could an issue at the iSCSI layer, though it is not likely, you can view any errors from kernel dmesg on the gateways serving the disk paths. But if the issue is not happening on other vms, i would think it is likely not cluster related.