Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

OSD latency problems (VMware stuck)

Pages: 1 2

Hi,

First of all I have to say thank you about the information how to gather data of Grafana here.
We are facing some strange issues.

5 Weeks ago we had the problem that for about 30-45min a few VMs got stuck (High CPU).
Today we got the same problem but not as high than last time.

We looked on the VM and the Vcenter. Everything there said that the Ceph Cluster has a higher latency than normal.
So we check the PetaSAN Cluster. Here was everything normal. The only thing we saw are some OSD with a really high commit latency of 2-3 sec.

With the 2 problems now we declare 6 OSDs that have problems were we are thinking to exchange them.
Because the "problem OSD" are from 2 nodes I would not suspect the raid controller.

Or are we "running" in the wrong direction?

One other thing I found is this here:
https://imgur.com/f9xQoUj

Graph

Here it looks like that the OSD commit latency it growing after a reboot of the nodes.
- 26.06 was the cluster update
- 17.09 was a controller update (one day after the problem the first time occurred)

Could it be also be a run time problem?

How many nodes/ osds + type ( hdd / ssd ) do you have

Can you show the OSD latency charts together with disk % utilization (% busy) for the last hour. If it does show the issue, can you go back for 1 day and zoom in the area of the problem (right click on mouse to select area). It is possible to go further back in time, but the more recent data will be sampled at higher precision.

We have 3 HP ProLiant DL380p Gen8 Nodes with each 11 OSDs. So 33x 2TB SSDs in total.

I played a little bit with Grafana.
Here is the overview of the OSD latency and the disk utilization per node from the problem today:

https://imgur.com/CZI5Hhs

And also from the last time.

https://imgur.com/GBX1eYN

 

But I found maybe one point.
We added 6 OSDs per node a few month ago. As I tried checking the smart values of the SSDs today I found out, that the raid controller caching was not active for these 6 OSDs. And indeed 4 of the "problem OSDs" are one of the SSDs without caching.

The other point that i just came into my mind are the number of PGs.
We started with 512 PG for our pool. If I do a calculation with the additional SSDs it should be set to 1024, or am I wrong with this?

What model of SSDs do you use ?

Try disabling scrub and deep-scrub for some time and see if this improves it.

Do use the controller cache if your SSDs are not fast in sync write speed.

Try to disable any on disk caching if present via

hdparm -W 0 /dev/sdX

We are using consumer SSDs: Crucial MX500 2TB

I know that you change the time frame of the scrub and deep scrub.
Is there a way to change or force the interval of the scrubbing to be a Saturday or Sunday with e.g. a cron job or manual execution on the weekend?

The recommended way is to control sleep time and threshold:
osd_scrub_sleep = 1
osd_scrub_load_threshold = 0.3
you can change the value depending on impact to your client io, increasing the osd_scrub_sleep and decreasing osd_scrub_load_threshold will lower scrub processing. You cannot lower it too much as scrubs should be done every day and deep scrubs every week.

You can also use:
osd_scrub_begin_week_day
osd_scrub_end_week_day

To take full control, you can run
ceph pg deep-scrub PG
yourself in a script via cron job, you can loop through all pgs via
ceph pg dump
and sort them via DEEP_SCRUB_STAMP

Thanks about the scrub config information's.

The last open question is this one:

The other point that i just came into my mind are the number of PGs.
We started with 512 PG for our pool. If I do a calculation with the additional SSDs it should be set to 1024, or am I wrong with this?

Wouldn't to less PG not result a unbalance of the used OSDs and so a higher or lower usage of SSDs?

Yes 1024 would be more correct, however i would leave it at 512 unless you plan to add more disks. increasing the PG count does re-balance data so it is something you have evaluate.

you can see the relative usage of  your disks

ceph osd df   (2.3.1)

ceph osd df --cluster xx  ( 2.3.0)

yes increasing the PGs will tend to balance objects more evenly.

We did a reboot of all cluster nodes last night. (one after the other in maintenance mode)
And now I have a similar picture like I was showing in the beginning.

The OSD commit latency is increasing enormous and all OSD are having normal latency values (last 24h sheet):

https://imgur.com/VVWNl5m

My guess it that this will increase slowly like the last times before.

Do you have an idea what could cause this?

To clarify:  after reboot you see normal OSD latency, but increase with time ? % utilization also increases with time ?

Pages: 1 2