Forums - PetaSAN

ForumGeneral DiscussionAsync Journal writes
You need to log in to create posts and topics. Login · Register
Async Journal writes

Pages: 1 2

alienn
37 Posts

November 4, 2019, 11:26 pm
Quote from alienn on November 4, 2019, 11:26 pm
Hi,

just a quick question. How can I disable the sync words to the journal drive for Testing purposes? I want to test the performance difference between sync and async writes to my journaling device.
I have the problem that my cluster has an Intel Optane 900p ssd and it seems that it performs really bad with sync writes.

cheers

Alienn

Hi,

just a quick question. How can I disable the sync words to the journal drive for Testing purposes? I want to test the performance difference between sync and async writes to my journaling device.
I have the problem that my cluster has an Intel Optane 900p ssd and it seems that it performs really bad with sync writes.

cheers

Alienn

#1

admin
2,930 Posts

November 5, 2019, 8:48 am
Quote from admin on November 5, 2019, 8:48 am
There is a disk test in PetaSAN from the blue console menu. It will test sync and normal speeds. It uses only unused disks, if you do not have an unused disk and need to test on a running disk, you can use the fio tool yourself but make sure you do not write over existing data.

There is a disk test in PetaSAN from the blue console menu. It will test sync and normal speeds. It uses only unused disks, if you do not have an unused disk and need to test on a running disk, you can use the fio tool yourself but make sure you do not write over existing data.

#2

alienn
37 Posts

November 5, 2019, 9:26 am
Quote from alienn on November 5, 2019, 9:26 am
I think we are talking about different subjects.

From the beginning I'm struggling in my cluster with bad performance. And from the beginning I had the problem that my journaling ssd (Intel Optane P900 280GB) was 100% busy on each node. At first I thought that this must be some kind of erroneous measurement in regards to the infamous low latency of the optanes.

Now I found a posting on the proxmox forum describing the same problem I'm facing (sorry german: http://bit.ly/2qkYQ9a). The first answer seems to hold the answer (http://bit.ly/33jXCtp ; crude translation following):

"The problem lies with the (missing) power protection of consumer ssds. SSD, ans especially NVMe, profit massivly from parallelism. One single nand is always slow. Thats why ssd controller use generally at least 8 channels. To use these properly ssd usually have extreme big caches (min 512MB). The problem now arises with sync write (which are utilized by ceph journal and waldb), as the cache gets deactived (de facto) and the channel cannot be filles properly due to missing data supply. Expensice enterpise ssd have something like a bbu for the internal ssd cache and still use the cache despite of the instructions send by the os, because they can guarantee that the data in the cache will be properly written to the nand."

In another post (that I cannot find the link to anymore) I found the statement that even very fast nvme ssd will drop to several hundreds (perhaps low thounds) iops in sync write mode.

The two informations combined would explain the problem I'm facing. And before I go out and buy another ssd I'd like to test what happens when I switch from a sync written waldb/rocksdb to an async written waldb/rocksdb. And I'm hoping that there is some kind of config option in ceph that I can easily change. 🙂 As my nodes are on redundant ups I'm fairly certain that I can enable async write with reasonable risk.

EDIT:

Just found the link again (from a zfs forum regarding sil/zil optane performance; http://bit.ly/36JjAIy):
"You have to remember that the POSIX sync write requires a guarantee that the data has been committed to stable storage. This can be actually written to disk, or to an intermediate cache of some sort, but once written, the hardware and operating system are guaranteeing that it will be retrievable in the written format even under adverse conditions such as power loss. This is inherently going to be a hell of a lot slower, meaning lots fewer IOPS, than if you just queue up write commands without sync.

Many people are stunned that their "capable of billion IOPS" device works out to a few thousand (or even just high hundreds) in practice, but there are so many layers to go through."

Cheers,

Alienn

I think we are talking about different subjects.

From the beginning I'm struggling in my cluster with bad performance. And from the beginning I had the problem that my journaling ssd (Intel Optane P900 280GB) was 100% busy on each node. At first I thought that this must be some kind of erroneous measurement in regards to the infamous low latency of the optanes.

Now I found a posting on the proxmox forum describing the same problem I'm facing (sorry german: http://bit.ly/2qkYQ9a). The first answer seems to hold the answer (http://bit.ly/33jXCtp ; crude translation following):

"The problem lies with the (missing) power protection of consumer ssds. SSD, ans especially NVMe, profit massivly from parallelism. One single nand is always slow. Thats why ssd controller use generally at least 8 channels. To use these properly ssd usually have extreme big caches (min 512MB). The problem now arises with sync write (which are utilized by ceph journal and waldb), as the cache gets deactived (de facto) and the channel cannot be filles properly due to missing data supply. Expensice enterpise ssd have something like a bbu for the internal ssd cache and still use the cache despite of the instructions send by the os, because they can guarantee that the data in the cache will be properly written to the nand."

In another post (that I cannot find the link to anymore) I found the statement that even very fast nvme ssd will drop to several hundreds (perhaps low thounds) iops in sync write mode.

The two informations combined would explain the problem I'm facing. And before I go out and buy another ssd I'd like to test what happens when I switch from a sync written waldb/rocksdb to an async written waldb/rocksdb. And I'm hoping that there is some kind of config option in ceph that I can easily change. 🙂 As my nodes are on redundant ups I'm fairly certain that I can enable async write with reasonable risk.

EDIT:

Just found the link again (from a zfs forum regarding sil/zil optane performance; http://bit.ly/36JjAIy):
"You have to remember that the POSIX sync write requires a guarantee that the data has been committed to stable storage. This can be actually written to disk, or to an intermediate cache of some sort, but once written, the hardware and operating system are guaranteeing that it will be retrievable in the written format even under adverse conditions such as power loss. This is inherently going to be a hell of a lot slower, meaning lots fewer IOPS, than if you just queue up write commands without sync.

Many people are stunned that their "capable of billion IOPS" device works out to a few thousand (or even just high hundreds) in practice, but there are so many layers to go through."

Cheers,

Alienn

Last edited on November 5, 2019, 9:37 am by alienn · #3

admin
2,930 Posts

November 5, 2019, 12:47 pm
Quote from admin on November 5, 2019, 12:47 pm
I think the disk test in the PetaSAN menu is valid for you question: it will show the different speeds you get from both sync and async. Again it requires a free unused disk, if you do not have a free disk you can use the fio tool yourself on a new partition you create on journal.

Yes the different SSD types make a big difference for performance, durability and power loss protection.

If you are using versions prior to 2.3.1, the kernel may not report the % utilization for some nvme correctly as per

https://access.redhat.com/solutions/3901291

this is fixed in 2.3.1

Have you done any benchmarks from the PetaSAN benchmark menu, was the performance bad ? was there resource load aside from nvme ?

I think the disk test in the PetaSAN menu is valid for you question: it will show the different speeds you get from both sync and async. Again it requires a free unused disk, if you do not have a free disk you can use the fio tool yourself on a new partition you create on journal.

Yes the different SSD types make a big difference for performance, durability and power loss protection.

If you are using versions prior to 2.3.1, the kernel may not report the % utilization for some nvme correctly as per

https://access.redhat.com/solutions/3901291

this is fixed in 2.3.1

Have you done any benchmarks from the PetaSAN benchmark menu, was the performance bad ? was there resource load aside from nvme ?

Last edited on November 5, 2019, 12:55 pm by admin · #4

alienn
37 Posts

November 5, 2019, 1:21 pm
Quote from alienn on November 5, 2019, 1:21 pm
I upgraded to PetaSAN 2.3.1 last week and nvme utilization is still at 99%.

As I do not have any free nvme ssd, do you have guide that I can follow to achieve sane fio benchmark values on the in use nvme journal drive? There should be enough room to create an empty 20-40GB partition for this test that I can decommision afterwards.

But the main question is still: is it to switch bluestore wal/rocksdb to async writes at all?

For the record:

Disk utilization last week (on node 1, but the same is true for node 2 and 3):

Benchmarktest on node3 against node 1 and 2 (cluster was in use during the time, but values are not much better during off hours):

I upgraded to PetaSAN 2.3.1 last week and nvme utilization is still at 99%.

As I do not have any free nvme ssd, do you have guide that I can follow to achieve sane fio benchmark values on the in use nvme journal drive? There should be enough room to create an empty 20-40GB partition for this test that I can decommision afterwards.

But the main question is still: is it to switch bluestore wal/rocksdb to async writes at all?

For the record:

Disk utilization last week (on node 1, but the same is true for node 2 and 3):

Benchmarktest on node3 against node 1 and 2 (cluster was in use during the time, but values are not much better during off hours):

#5

admin
2,930 Posts

November 5, 2019, 1:49 pm
Quote from admin on November 5, 2019, 1:49 pm
https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
instead of device, create a new partition, format it and use a large file to pass to --filename=

The numbers do look very low. Hard to say if the nvme utilization of 100% is a correct reading or due to the kernel nvme issue outlined eralier, but will 2.3.1 we tested a couple of nvmes and were reporting correct results. Do you have any recollection if this utilization was seen before production deployments ? any performance numbers before you deployed production ?

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
instead of device, create a new partition, format it and use a large file to pass to --filename=

The numbers do look very low. Hard to say if the nvme utilization of 100% is a correct reading or due to the kernel nvme issue outlined eralier, but will 2.3.1 we tested a couple of nvmes and were reporting correct results. Do you have any recollection if this utilization was seen before production deployments ? any performance numbers before you deployed production ?

#6

alienn
37 Posts

November 5, 2019, 2:50 pm
Quote from alienn on November 5, 2019, 2:50 pm
I followed the instructions on the site (except for the hdparm part, as this resulted in Inappropriate ioctl for device). The results can be seen here: https://pastebin.com/mi9tPyEx

These values were created while the cluster was running and the nvme was 100% busy according to the dashboard.

And yes, I think these values are way to low for what is inside these boxes. But I do not have "clear" results from before deploying.

I followed the instructions on the site (except for the hdparm part, as this resulted in Inappropriate ioctl for device). The results can be seen here: https://pastebin.com/mi9tPyEx

These values were created while the cluster was running and the nvme was 100% busy according to the dashboard.

And yes, I think these values are way to low for what is inside these boxes. But I do not have "clear" results from before deploying.

#7

admin
2,930 Posts

November 5, 2019, 3:13 pm
Quote from admin on November 5, 2019, 3:13 pm
what is your current production iops load from the dashboard under cluster iops chart?

can you run atop and see the %busy for disk and cpu, as it measures every 1 or 2 sec interval ?

how many osds do you have in total ? all the same size ?

what is your current production iops load from the dashboard under cluster iops chart?

can you run atop and see the %busy for disk and cpu, as it measures every 1 or 2 sec interval ?

how many osds do you have in total ? all the same size ?

#8

alienn
37 Posts

November 5, 2019, 3:29 pm
Quote from alienn on November 5, 2019, 3:29 pm
Production IOPs

Atop: (The nvme is not listed here as far as I can see...)

Production IOPs

Atop: (The nvme is not listed here as far as I can see...)

#9

admin
2,930 Posts

November 5, 2019, 5:17 pm
Quote from admin on November 5, 2019, 5:17 pm
How many OSDs ? hdds ? all same capacity ?

If you can run the ui benchmark at a time when you can stop client io or do not mind if client io experience slow response and run the 4k iops benchmark for 5 min, using approx 10 threads per OSD count ( example 120 threads for 12 OSDs) and choose 2 stress clients. Then show the benchmark result + charts for cluster iops + disk utilization + cpu utilization on 1 node + screenshot of atop during the test.

How many OSDs ? hdds ? all same capacity ?

If you can run the ui benchmark at a time when you can stop client io or do not mind if client io experience slow response and run the 4k iops benchmark for 5 min, using approx 10 threads per OSD count ( example 120 threads for 12 OSDs) and choose 2 stress clients. Then show the benchmark result + charts for cluster iops + disk utilization + cpu utilization on 1 node + screenshot of atop during the test.

#10

Post Reply: Async Journal writes

Cancel

Pages: 1 2