Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

dm-writecache slow to flush

Hi,

We've a 5 node cluster, mostly SAS 4TB HDDs, and they have the write cache using NVMEs. 50 OSDs in total. Each has 256GB RAM and 36cords (72with HT), memory and CPU usage seem low.

We're seeing the write cache fill up ("lvs" shows Data% 100), however the underlying disk utilisation is still very low <20%, and the wMB/s is low, under 3MB/s. This results in massive latency for our VM and database workloads (presented over iSCSI)

Are there any settings that would throttle the speed at which the dm-writecache is trying to push writes out to the underlying disks - I would have expected if it was the HDDs that were the cause of things being slow then the utilisation on them would show much higher.

Thanks,

Will

I notice that the dm-cache writeback_jobs is set to 15360 by PetaSAN, could this be bottlenecking flushing of the cache to the underlying disks? If I read the documentation correctly, the default is to have this as unlimited.

How big are your cache partitions ?
Large cache partitions can stress the disks while flushing and memory requirement is 2.5% of partition size.
100 GB partition size is enough to get good cache latency and consumes 2.5 GB ram and will not stress the disks too much during flushes.

you can change tuning params in:
/opt/petasan/config/tuning/current/writecache

then run script
/opt/petasan/scripts/tuning/writecache_tune.py
detail of tuning params are explianed at top of script file

Hi,

The cache partitions are 90GB in size. Memory usage seems fine (10OSDs and 256GB RAM per node)

I'll check these params in that file - have you observed situations where they've been seemingly slow to flush in the past?

As I say, the raw HDDs are showing low utilisation - would you expect this to be high if that was the bottleneck.

Thanks,

Will

So the settings currently in /opt/petasan/config/tuning/current/writecache (which is unmodified so just whatever the default is), are not what is currently set on the writecaches.

{
"dm_writecache_throttle": "50",
"high_watermark": "50",
"low_watermark": "49",
"writeback_jobs": "512"
}

root@gl-san-02d:~# lvs -o+cache_policy,cache_settings
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert CachePolicy CacheSettings
osd-block-f6d3451f-e4fb-45bd-aa8b-f6f8f1cd3f2f ceph-34d9987c-44cf-4723-a6c0-b3cdb5008923 -wi-ao---- 894.25g
osd-block-74f49683-32e0-45c7-8f87-0e646805fc0c ceph-deb0043d-7913-4f2b-b2c9-f67b194b6720 -wi-ao---- 894.25g
main ps-671676d6-6306-4c60-9a67-31733fa5d7ed-wc-osd.18 Cwi-aoC--- <3.64t [cache_cvol] [main_wcorig] 51.19 writeback_jobs=15360
main ps-671676d6-6306-4c60-9a67-31733fa5d7ed-wc-osd.26 Cwi-aoC--- <3.64t [cache_cvol] [main_wcorig] 49.70 writeback_jobs=15360
main ps-671676d6-6306-4c60-9a67-31733fa5d7ed-wc-osd.33 Cwi-aoC--- <3.64t [cache_cvol] [main_wcorig] 68.44 writeback_jobs=15360
main ps-671676d6-6306-4c60-9a67-31733fa5d7ed-wc-osd.34 Cwi-aoC--- <3.64t [cache_cvol] [main_wcorig] 74.54 writeback_jobs=15360
main ps-671676d6-6306-4c60-9a67-31733fa5d7ed-wc-osd.36 Cwi-a-C--- <3.64t [cache_cvol] [main_wcorig] 82.29 writeback_jobs=15360
main ps-671676d6-6306-4c60-9a67-31733fa5d7ed-wc-osd.37 Cwi-aoC--- <3.64t [cache_cvol] [main_wcorig] 54.74 writeback_jobs=15360
main ps-671676d6-6306-4c60-9a67-31733fa5d7ed-wc-osd.47 Cwi-aoC--- <953.87g [cache_cvol] [main_wcorig] 49.29 writeback_jobs=15360
main ps-671676d6-6306-4c60-9a67-31733fa5d7ed-wc-osd.48 Cwi-aoC--- <953.87g [cache_cvol] [main_wcorig] 49.26 writeback_jobs=15360

(Apologies for bad formatting, wasn't sure how to preserve layout).

Is this expected if that script has never been run?

What would the default dm_writecache_throttle therefore be set to? (And is this a % of a single core?)

If we're seeing the cache being slow to flush, do you think increasing the write jobs would help, or would increasing the  dm_writecache_throttle higher than 100 perhaps help?

Thanks,

Will

I found that the dm_writecache_throttle was set to 50.

Changing it to 95 has increased the utilisation up from 15% on the disks to around 85%, and throughput from 1MB/s to 10MB/s! My understanding is that this is percentage of disk IO time, rather than CPU time.

The caches are now emptying again - disaster averted, success! Thank you for your advice on this!

I would still appreciate  your feedback on the differences between the writeback_jobs settings between the config file and what what is shown from "lvs -o+cache_policy,cache_settings" (presumably what's live)

Thanks,

Will

 

Glad it helped 🙂

Tuning of cache depends on many cases, 95% throttle could be too stress in some cases.

The throttle is % io time of the flush thread, which is 1 per cache device. It should affect the flush operation only, client i/o should still continue to operate, whether the flush is active or throttled. But as you see there are several variables that affect overall performance.

One recommendation when you measure the % utilization of the disk is to use sysstat/sar/iostat/atop.. directly and not reply on the dashboard charts. The charts rely on 1 min sample but in case of cache flush you need to use the tools to monitor 1 or 2 sec sample due to their spike nature.

cache policy and cache settings are lvm options that apply to dm-cache, for dm-writecache better read the options from the devices directly:

vgs | grep "wc-osd" | cut -d ' ' -f3 | xargs -I{} dmsetup table  {}/main

Perfect - thanks again for all you help, it's much appriciated!