Writecache behaviour is erratic
Pages: 1 2
dbutti
28 Posts
March 15, 2024, 11:23 amQuote from dbutti on March 15, 2024, 11:23 amThank-you for your reply 🙂
this is the output from uname -r: 5.14.21-04-petasan
Until recently, the cache devices were quite small (about 10GB partition for each OSD), so I thought it was more or less normal to have them constantly near 100% occupation. The cluster worked, write operations actually went almost always directly to the HDDs, sync write performance was pretty bad, but we didn't care much because we used the cluster mostly to store backup data.
Then the hardware was upgraded, and new cache SSDs have been installed, so that now every OSD has a 50GB cache partition and it could be increased further. But even after the upgrade, write performance was still sloppy, so I've started investigating the issue and I found out that the cache volumes have this bizarre behaviour.
Running a flush command manually is something I do from time to time, and of course it makes 100% of the cache blocks free again. But then usage increases again, and when the high_watermark is reached sometimes flushing occurs, but most of the time it does not.
Is there anything I could do in order to trace what happens within the writecache module?
Thank-you,
Thank-you for your reply 🙂
this is the output from uname -r: 5.14.21-04-petasan
Until recently, the cache devices were quite small (about 10GB partition for each OSD), so I thought it was more or less normal to have them constantly near 100% occupation. The cluster worked, write operations actually went almost always directly to the HDDs, sync write performance was pretty bad, but we didn't care much because we used the cluster mostly to store backup data.
Then the hardware was upgraded, and new cache SSDs have been installed, so that now every OSD has a 50GB cache partition and it could be increased further. But even after the upgrade, write performance was still sloppy, so I've started investigating the issue and I found out that the cache volumes have this bizarre behaviour.
Running a flush command manually is something I do from time to time, and of course it makes 100% of the cache blocks free again. But then usage increases again, and when the high_watermark is reached sometimes flushing occurs, but most of the time it does not.
Is there anything I could do in order to trace what happens within the writecache module?
Thank-you,
dbutti
28 Posts
March 15, 2024, 10:44 pmQuote from dbutti on March 15, 2024, 10:44 pmHello, I finally found out exactly what is going on with writecache.
Short summary: even when „dmsetup“ shows „pause_writeback 0“, this parameter is actually non-zero if nobody sets it to an explicit value. And for this reason, writecache will suspend writeback for a while whenever it notices some I/O activity to the slow device. In my case, this was enough to make the fast device go over the high_watermark, and often reach 100%.
Line 33 of drivers/md/dm-writecache.c in the kernel source has:
#define PAUSE_WRITEBACK (HZ * 3)
This non-zero value is applied by default, even if the „table“ command later shows it to be zero (I would say this is a bug in the kernel module - two different variables are used, one for display and one for the real logic).
I could restore the expected normal behaviour by running this command on each node:
vgs | grep wc-osd | cut -d ' ' -f3 | xargs -I{} dmsetup message {}/main 0 pause_writeback 0
Maybe other users never run into this issue because they have faster devices/processor, I have no idea. In any case, we should be aware that writecache gives misleading information in this regard.
Even better, Petasan could add pause_writeback to the set of cache parameters which get tuned by /opt/petasan/config/tuning/current/writecache, so it will always have a predictible value.
Thank-you for your support.
Hello, I finally found out exactly what is going on with writecache.
Short summary: even when „dmsetup“ shows „pause_writeback 0“, this parameter is actually non-zero if nobody sets it to an explicit value. And for this reason, writecache will suspend writeback for a while whenever it notices some I/O activity to the slow device. In my case, this was enough to make the fast device go over the high_watermark, and often reach 100%.
Line 33 of drivers/md/dm-writecache.c in the kernel source has:
#define PAUSE_WRITEBACK (HZ * 3)
This non-zero value is applied by default, even if the „table“ command later shows it to be zero (I would say this is a bug in the kernel module - two different variables are used, one for display and one for the real logic).
I could restore the expected normal behaviour by running this command on each node:
vgs | grep wc-osd | cut -d ' ' -f3 | xargs -I{} dmsetup message {}/main 0 pause_writeback 0
Maybe other users never run into this issue because they have faster devices/processor, I have no idea. In any case, we should be aware that writecache gives misleading information in this regard.
Even better, Petasan could add pause_writeback to the set of cache parameters which get tuned by /opt/petasan/config/tuning/current/writecache, so it will always have a predictible value.
Thank-you for your support.
Last edited on March 15, 2024, 10:45 pm by dbutti · #12
admin
2,929 Posts
March 16, 2024, 10:16 amQuote from admin on March 16, 2024, 10:16 amThank you very much for this detailed feedback 🙂
As suggested, it is better/safer for us to always set pause_writeback 0 in our tune script.
Looking at the original kernel commits that added this feature, it seems the idea was if your cache partition gets full and all your client writes will have to go direcly to slow device, the idea is to pause all backgound flushes so not to impact client writes, and only resumes after pause_writeback (3sec) from last client write, it could pause forever if client writes persist.
Maybe the issue you saw was due to hitting a cache full situation or maybe there is another bug associated with above feature, i would not doubt the later as looking at the added code, its integration and logic it may have flaws.
In the longer term we would run more tests to replicate this condition and if so, will report it to kernel upstream or suggest fix.
Thanks again!
Thank you very much for this detailed feedback 🙂
As suggested, it is better/safer for us to always set pause_writeback 0 in our tune script.
Looking at the original kernel commits that added this feature, it seems the idea was if your cache partition gets full and all your client writes will have to go direcly to slow device, the idea is to pause all backgound flushes so not to impact client writes, and only resumes after pause_writeback (3sec) from last client write, it could pause forever if client writes persist.
Maybe the issue you saw was due to hitting a cache full situation or maybe there is another bug associated with above feature, i would not doubt the later as looking at the added code, its integration and logic it may have flaws.
In the longer term we would run more tests to replicate this condition and if so, will report it to kernel upstream or suggest fix.
Thanks again!
Last edited on March 16, 2024, 10:18 am by admin · #13
robindewolf
9 Posts
April 22, 2024, 11:57 amQuote from robindewolf on April 22, 2024, 11:57 amHi all,
we had exactly the same situation on our PetaSAN cluster. Applying the following command on each node resulted in clearing the cache and solving our issue:
vgs | grep wc-osd | cut -d ' ' -f3 | xargs -I{} dmsetup message {}/main 0 pause_writeback 0
We saw that after rebooting a node, the cache is again getting full. Is there a way to make it persistent?
Thanks!
kr,
Robin
Hi all,
we had exactly the same situation on our PetaSAN cluster. Applying the following command on each node resulted in clearing the cache and solving our issue:
vgs | grep wc-osd | cut -d ' ' -f3 | xargs -I{} dmsetup message {}/main 0 pause_writeback 0
We saw that after rebooting a node, the cache is again getting full. Is there a way to make it persistent?
Thanks!
kr,
Robin
dbutti
28 Posts
April 26, 2024, 2:32 pmQuote from dbutti on April 26, 2024, 2:32 pmHello, in order to make the change persistent you can for example create a service definition of your own under /lib/systemd/system, using something like:
[Unit]
Description=Custom Petasan tuning
After=ceph.target
[Service]
Type=simple
ExecStart=/root/reset-writecache-pause
Restart=on-failure
RestartSec=60
RemainAfterExit=yes
[Install]
WantedBy=ceph.target
You can save that for example as custom-tuning.service, then use systemctl daemon-reload to update the config.
And the content of /root/reset-writecache-pause: (the file should be executable)
#!/bin/bash
vgs | grep wc-osd | cut -d ' ' -f3 | xargs -I{} dmsetup message {}/main 0 pause_writeback 0
Hello, in order to make the change persistent you can for example create a service definition of your own under /lib/systemd/system, using something like:
[Unit]
Description=Custom Petasan tuning
After=ceph.target
[Service]
Type=simple
ExecStart=/root/reset-writecache-pause
Restart=on-failure
RestartSec=60
RemainAfterExit=yes
[Install]
WantedBy=ceph.target
You can save that for example as custom-tuning.service, then use systemctl daemon-reload to update the config.
And the content of /root/reset-writecache-pause: (the file should be executable)
#!/bin/bash
vgs | grep wc-osd | cut -d ' ' -f3 | xargs -I{} dmsetup message {}/main 0 pause_writeback 0
Pages: 1 2
Writecache behaviour is erratic
dbutti
28 Posts
Quote from dbutti on March 15, 2024, 11:23 amThank-you for your reply 🙂
this is the output from uname -r: 5.14.21-04-petasan
Until recently, the cache devices were quite small (about 10GB partition for each OSD), so I thought it was more or less normal to have them constantly near 100% occupation. The cluster worked, write operations actually went almost always directly to the HDDs, sync write performance was pretty bad, but we didn't care much because we used the cluster mostly to store backup data.
Then the hardware was upgraded, and new cache SSDs have been installed, so that now every OSD has a 50GB cache partition and it could be increased further. But even after the upgrade, write performance was still sloppy, so I've started investigating the issue and I found out that the cache volumes have this bizarre behaviour.
Running a flush command manually is something I do from time to time, and of course it makes 100% of the cache blocks free again. But then usage increases again, and when the high_watermark is reached sometimes flushing occurs, but most of the time it does not.
Is there anything I could do in order to trace what happens within the writecache module?
Thank-you,
Thank-you for your reply 🙂
this is the output from uname -r: 5.14.21-04-petasan
Until recently, the cache devices were quite small (about 10GB partition for each OSD), so I thought it was more or less normal to have them constantly near 100% occupation. The cluster worked, write operations actually went almost always directly to the HDDs, sync write performance was pretty bad, but we didn't care much because we used the cluster mostly to store backup data.
Then the hardware was upgraded, and new cache SSDs have been installed, so that now every OSD has a 50GB cache partition and it could be increased further. But even after the upgrade, write performance was still sloppy, so I've started investigating the issue and I found out that the cache volumes have this bizarre behaviour.
Running a flush command manually is something I do from time to time, and of course it makes 100% of the cache blocks free again. But then usage increases again, and when the high_watermark is reached sometimes flushing occurs, but most of the time it does not.
Is there anything I could do in order to trace what happens within the writecache module?
Thank-you,
dbutti
28 Posts
Quote from dbutti on March 15, 2024, 10:44 pmHello, I finally found out exactly what is going on with writecache.
Short summary: even when „dmsetup“ shows „pause_writeback 0“, this parameter is actually non-zero if nobody sets it to an explicit value. And for this reason, writecache will suspend writeback for a while whenever it notices some I/O activity to the slow device. In my case, this was enough to make the fast device go over the high_watermark, and often reach 100%.
Line 33 of drivers/md/dm-writecache.c in the kernel source has:
#define PAUSE_WRITEBACK (HZ * 3)
This non-zero value is applied by default, even if the „table“ command later shows it to be zero (I would say this is a bug in the kernel module - two different variables are used, one for display and one for the real logic).
I could restore the expected normal behaviour by running this command on each node:
vgs | grep wc-osd | cut -d ' ' -f3 | xargs -I{} dmsetup message {}/main 0 pause_writeback 0
Maybe other users never run into this issue because they have faster devices/processor, I have no idea. In any case, we should be aware that writecache gives misleading information in this regard.
Even better, Petasan could add pause_writeback to the set of cache parameters which get tuned by /opt/petasan/config/tuning/current/writecache, so it will always have a predictible value.
Thank-you for your support.
Hello, I finally found out exactly what is going on with writecache.
Short summary: even when „dmsetup“ shows „pause_writeback 0“, this parameter is actually non-zero if nobody sets it to an explicit value. And for this reason, writecache will suspend writeback for a while whenever it notices some I/O activity to the slow device. In my case, this was enough to make the fast device go over the high_watermark, and often reach 100%.
Line 33 of drivers/md/dm-writecache.c in the kernel source has:
#define PAUSE_WRITEBACK (HZ * 3)
This non-zero value is applied by default, even if the „table“ command later shows it to be zero (I would say this is a bug in the kernel module - two different variables are used, one for display and one for the real logic).
I could restore the expected normal behaviour by running this command on each node:
vgs | grep wc-osd | cut -d ' ' -f3 | xargs -I{} dmsetup message {}/main 0 pause_writeback 0
Maybe other users never run into this issue because they have faster devices/processor, I have no idea. In any case, we should be aware that writecache gives misleading information in this regard.
Even better, Petasan could add pause_writeback to the set of cache parameters which get tuned by /opt/petasan/config/tuning/current/writecache, so it will always have a predictible value.
Thank-you for your support.
admin
2,929 Posts
Quote from admin on March 16, 2024, 10:16 amThank you very much for this detailed feedback 🙂
As suggested, it is better/safer for us to always set pause_writeback 0 in our tune script.Looking at the original kernel commits that added this feature, it seems the idea was if your cache partition gets full and all your client writes will have to go direcly to slow device, the idea is to pause all backgound flushes so not to impact client writes, and only resumes after pause_writeback (3sec) from last client write, it could pause forever if client writes persist.
Maybe the issue you saw was due to hitting a cache full situation or maybe there is another bug associated with above feature, i would not doubt the later as looking at the added code, its integration and logic it may have flaws.
In the longer term we would run more tests to replicate this condition and if so, will report it to kernel upstream or suggest fix.Thanks again!
Thank you very much for this detailed feedback 🙂
As suggested, it is better/safer for us to always set pause_writeback 0 in our tune script.
Looking at the original kernel commits that added this feature, it seems the idea was if your cache partition gets full and all your client writes will have to go direcly to slow device, the idea is to pause all backgound flushes so not to impact client writes, and only resumes after pause_writeback (3sec) from last client write, it could pause forever if client writes persist.
Maybe the issue you saw was due to hitting a cache full situation or maybe there is another bug associated with above feature, i would not doubt the later as looking at the added code, its integration and logic it may have flaws.
In the longer term we would run more tests to replicate this condition and if so, will report it to kernel upstream or suggest fix.
Thanks again!
robindewolf
9 Posts
Quote from robindewolf on April 22, 2024, 11:57 amHi all,
we had exactly the same situation on our PetaSAN cluster. Applying the following command on each node resulted in clearing the cache and solving our issue:
vgs | grep wc-osd | cut -d ' ' -f3 | xargs -I{} dmsetup message {}/main 0 pause_writeback 0
We saw that after rebooting a node, the cache is again getting full. Is there a way to make it persistent?
Thanks!
kr,
Robin
Hi all,
we had exactly the same situation on our PetaSAN cluster. Applying the following command on each node resulted in clearing the cache and solving our issue:
vgs | grep wc-osd | cut -d ' ' -f3 | xargs -I{} dmsetup message {}/main 0 pause_writeback 0
We saw that after rebooting a node, the cache is again getting full. Is there a way to make it persistent?
Thanks!
kr,
Robin
dbutti
28 Posts
Quote from dbutti on April 26, 2024, 2:32 pmHello, in order to make the change persistent you can for example create a service definition of your own under /lib/systemd/system, using something like:
[Unit]
Description=Custom Petasan tuning
After=ceph.target[Service]
Type=simple
ExecStart=/root/reset-writecache-pause
Restart=on-failure
RestartSec=60
RemainAfterExit=yes[Install]
WantedBy=ceph.targetYou can save that for example as custom-tuning.service, then use systemctl daemon-reload to update the config.
And the content of /root/reset-writecache-pause: (the file should be executable)
#!/bin/bash
vgs | grep wc-osd | cut -d ' ' -f3 | xargs -I{} dmsetup message {}/main 0 pause_writeback 0
Hello, in order to make the change persistent you can for example create a service definition of your own under /lib/systemd/system, using something like:
[Unit]
Description=Custom Petasan tuning
After=ceph.target[Service]
Type=simple
ExecStart=/root/reset-writecache-pause
Restart=on-failure
RestartSec=60
RemainAfterExit=yes[Install]
WantedBy=ceph.target
You can save that for example as custom-tuning.service, then use systemctl daemon-reload to update the config.
And the content of /root/reset-writecache-pause: (the file should be executable)
#!/bin/bash
vgs | grep wc-osd | cut -d ' ' -f3 | xargs -I{} dmsetup message {}/main 0 pause_writeback 0