Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

High utilization with consul agent, systemd-journald, & rsyslogd

Hello!

We're seeing some rather high write latency when writing to our PetaSAN environment. Things seem to be okay across the 3 management nodes but on one I noticed extremely high utilization coming from consul agent, systemd-journald, and rsyslogd. Is this normal or to be expected?

 

what version do you use ?
what hardware, network, number of osds ssd/hdd ?
do you see any high of %disk/%ram/%cpu utilization on charts on any node ?
is this a new cluster ? did it just happen ? any changes in load ?

Quote from admin on January 15, 2021, 4:11 pm

what version do you use ?
what hardware, network, number of osds ssd/hdd ?
do you see any high of %disk/%ram/%cpu utilization on charts on any node ?
is this a new cluster ? did it just happen ? any changes in load ?

Hi!

We're on version 2.5.3.

For hardware: PowerEdge R640's (dual Xeon 5122's w/ 192GB of RAM). Dual 25Gbps NICs (Mellanox).

For OSDs: 143 HDD & 29 SSD

We're not seeing this on the others that I have seen so far.

The particular node in the screenshot is one of the monitors but not the manager.

It's an existing cluster. I'm unsure how long the high utilization of those processes has been going on but we did just notice it. We noticed some performance issues with a few VMs (confirmed high read/write latencies via esxtop per VM). There is a robocopy currently executing w/in a VM moving a few terabytes of data from a different SAN to a PetaSAN volume.

According to a ceph status, it's seeing the following for read/write load:

client:   210 MiB/s rd, 79 MiB/s wr, 2.95k op/s rd, 1.28k op/s wr

 

what is the %utilization  and raw iops on the hdds charts ?

what is the high write latency you are see-ing ?

Here's what I see for IOPS in HDD's.

This is what it looks like from the node in the very first screenshot in the OP:

We're seeing between 7,000ms and 10,000ms write latencies within the ESX node for the VM that is writing to PetaSAN:

 

EDIT: I have also moved the VM between ESX nodes in the vSphere cluster and the latency follows the VM leading me to believe it's w/in the PetaSAN cluster.

you can try to re-start consul on the node, kill the process and start it assuming a management node via:

/opt/petasan/scripts/consul_start_up.py

i would also try to see the OSD Latency charts under the cluster stats and see if PetaSAN also reports high latency as reported by ESXi. Typically if the system does report high latency, there would be some high %utlization on disks/cpu/ram on one of the nodes or in some cases network issues.

 

Okay. Tried killing consul and using "/opt/petasan/scripts/consul_start_up.py" to restart the service but the ring-fence protection kicked in and the host shutdown. After powering back on, it rejoined the cluster and process utilization seems to be okay now. Latency on that one particular VM has dropped from 7,000ms-10,000ms to 4,000ms-7,000ms.

Memory utilization is roughly 50% across the nodes (12 nodes) and CPU utilization is roughly 25% across the nodes. Network utilization is around 75Mbps on all nodes.

Here's what the Commit and Apply latencies look like as well:

The OSD latency does not show correlation to the 4000-10000 ms you see on the ESXi side. also from the OSD side it does not treat objects from the particular vm any different than others. You would also get slow op warnings from Ceph. Maybe it could an issue at the iSCSI layer, though it is not likely, you can view any errors from kernel dmesg on the gateways serving the disk paths. But if the issue is not happening on other vms, i would think it is likely not cluster related.