Forums - PetaSAN

ForumGeneral DiscussionAsync Journal writes
You need to log in to create posts and topics. Login · Register
Async Journal writes

Pages: 1 2

admin
2,969 Posts

November 6, 2019, 8:11 am
Quote from admin on November 6, 2019, 8:11 am
is this the same cluster as

http://www.petasan.org/forums/?view=thread&id=473

is this the same cluster as

http://www.petasan.org/forums/?view=thread&id=473

#11

alienn
37 Posts

November 6, 2019, 10:20 am
Quote from alienn on November 6, 2019, 10:20 am
Yes. It is the same cluster. I never got the oppurtunity to test the cluster without active vms.

It is also the same cluster that has 24 OSDs with BlueFS spillover. So each osd only has 20gb of space on the journal.

But I have the feeling that the performance of the cluster somehow deteriorated over time.

So I have to gather the following intel:

Benchmark: 5min, 120Threads, 2 Clients, 4k iops

Chart of cluster IOPS (from the dashboard)

Disk utilization of 1 node (from the dashboard)

CPU utilization of 1 node (from the dashboard)

screenshot of atop from 1 node (not client) started short before the benchmark

screenshot of atop from 1 node (client) started short before the benchmark

As the fio test gave me expected results (even thoug sync was used) I also don't think that the nvme is the bottleneck. But something is not right...

Yes. It is the same cluster. I never got the oppurtunity to test the cluster without active vms.

It is also the same cluster that has 24 OSDs with BlueFS spillover. So each osd only has 20gb of space on the journal.

But I have the feeling that the performance of the cluster somehow deteriorated over time.

So I have to gather the following intel:

Benchmark: 5min, 120Threads, 2 Clients, 4k iops

Chart of cluster IOPS (from the dashboard)

Disk utilization of 1 node (from the dashboard)

CPU utilization of 1 node (from the dashboard)

screenshot of atop from 1 node (not client) started short before the benchmark

screenshot of atop from 1 node (client) started short before the benchmark

As the fio test gave me expected results (even thoug sync was used) I also don't think that the nvme is the bottleneck. But something is not right...

Last edited on November 6, 2019, 10:26 am by alienn · #12

alienn
37 Posts

November 6, 2019, 12:15 pm
Quote from alienn on November 6, 2019, 12:15 pm
Hi,

during a break where not so many people where in the office I let the benchmark run. For some reason I did not get the final overview of the results. But I can provide the rest of the screenshots.

Hi,

during a break where not so many people where in the office I let the benchmark run. For some reason I did not get the final overview of the results. But I can provide the rest of the screenshots.

#13

admin
2,969 Posts

November 6, 2019, 1:13 pm
Quote from admin on November 6, 2019, 1:13 pm
You get around 20k iops read and 5-6k iops write, write is lower since each iop does 2 or 3 x iops for replicas.

This is OK for 24 hdds, some things you can do: i see disk% utilization during writes on 1 node higher than the other, just double check this is not always the case, else you can have slower disks on this node, or if you use different size disk, the larger disk will get more load. Similarly the atop disk% utilization is not the same across all disks, again it is ok if this happens randomly.

The other thing you mentioned is spill over of metada, this has potential to use an extra iop for metdata reading during each read/write iop. the tool outlined earlier can expand the journal size.

Another value that is relevant, is run 4k test with only 1 thread, this can be run any time within production. it lets you measure latency or the iop a single client thread sees.

You get around 20k iops read and 5-6k iops write, write is lower since each iop does 2 or 3 x iops for replicas.

This is OK for 24 hdds, some things you can do: i see disk% utilization during writes on 1 node higher than the other, just double check this is not always the case, else you can have slower disks on this node, or if you use different size disk, the larger disk will get more load. Similarly the atop disk% utilization is not the same across all disks, again it is ok if this happens randomly.

The other thing you mentioned is spill over of metada, this has potential to use an extra iop for metdata reading during each read/write iop. the tool outlined earlier can expand the journal size.

Another value that is relevant, is run 4k test with only 1 thread, this can be run any time within production. it lets you measure latency or the iop a single client thread sees.

Last edited on November 6, 2019, 1:15 pm by admin · #14

alienn
37 Posts

November 6, 2019, 1:50 pm
Quote from alienn on November 6, 2019, 1:50 pm
All cluster nodes are identical. So the hdd are all the same. Usually the disk activity is the same all over and is at about 30-40%.

Regarding the resize of the ssd journal the only think I can find is bluefs-bdev-expand. But that requires the underlying block device to change. As each osd has a separate partition on the ssd and these are stacked atop each other I am not able to expand them easily.

How would you do it? And as I only have 280GB available I can only expand the partition to 30GB each (8 osd à 30gb = 240gb). Then I could still insert one additional osd some time in the future.

Do you have an idea on how to fix the incorrect nvme busy values? Can I provide some data that would make it easier to adopt the existing patch for this nvme?

All cluster nodes are identical. So the hdd are all the same. Usually the disk activity is the same all over and is at about 30-40%.

Regarding the resize of the ssd journal the only think I can find is bluefs-bdev-expand. But that requires the underlying block device to change. As each osd has a separate partition on the ssd and these are stacked atop each other I am not able to expand them easily.

How would you do it? And as I only have 280GB available I can only expand the partition to 30GB each (8 osd à 30gb = 240gb). Then I could still insert one additional osd some time in the future.

Do you have an idea on how to fix the incorrect nvme busy values? Can I provide some data that would make it easier to adopt the existing patch for this nvme?

#15

alienn
37 Posts

November 6, 2019, 2:33 pm
Quote from alienn on November 6, 2019, 2:33 pm
Here is the stats page (which did load this time) for 1 Thread, 2 Clients, 1 Minute:

Here is the stats page (which did load this time) for 1 Thread, 2 Clients, 1 Minute:

#16

admin
2,969 Posts

November 6, 2019, 8:47 pm
Quote from admin on November 6, 2019, 8:47 pm
This looks reasonable ( you should divide by 2, i meant to run this test using 1 client only). this is the latency a single client operation sees ( 1 /iops).

The utilization issue on some nvme models has been an issue with many recent kernels as per the earlier red hat link, we did test some models with our latest kernel and they all worked.

for expanding partition sizes, you really need to think this yourself.

This looks reasonable ( you should divide by 2, i meant to run this test using 1 client only). this is the latency a single client operation sees ( 1 /iops).

The utilization issue on some nvme models has been an issue with many recent kernels as per the earlier red hat link, we did test some models with our latest kernel and they all worked.

for expanding partition sizes, you really need to think this yourself.

#17

Post Reply: Async Journal writes

Cancel

Pages: 1 2