Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

RBD image 1M block size

Hi,

We’ve recently put our PetaSAN cluster into production and it’s generally great.

We have 5 nodes, each with 9 x 4TB SAS HDDs and 4 NVMEs. The NVMEs are used as journal and write caches for the 9 OSD.

When benchmarking from iscsi clients we see 2GB/a read and writes (maxing the 2x10G basically)

However, it seems that reading of “stale” data which needs to come from the underlying HDDs is painfully slow (20-50MB/s). This is especially pronounced during MS SQL backups which slow to a crawl, but also file copies and VM migrations.

I believe this is perhaps due to the 4M block size, meaning that even with some read ahead on the client, all requests are hitting 1 or maybe 2 OSDs and that’s limiting the speed. We use 64k NTFS allocation size which is the recommended for SQL Server.

Is there a way to change the block size to 1M for the RBD image? Happy to create it manually but don’t want to if it’ll cause issues.

Or if there are any other suggestions as to how we may improve our read performance?

Thanks!

Will

I do not believe reducing object size to 1MB will improve your case.

I would suggest if running in Windows

  1. increase the iSCSI transfer length from 256KB to 4MB
    [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Class\{4D36E97B-E325-11CE-BFC1-08002BE10318}\<Instance>\Parameters]

    <instance> here is an id like 00001 or 00002
    where
    <instance>\DriverDesc Micorosft iSCSI Initiator

    MaxTransferLength:
    default is 0x00040000 hex ( 262144 bytes)
    change to 0x00400000 hex (4194304 bytes)

    a reboot is required

    2. You could try creating RAID 0/5 on Windows from the iSCSI disks

 

Hi,

Thanks for the reply and suggestions.

Unfortunately I've already set the MaxRequestSize for ISCSI to 4M in Windows during my earlier troubleshooting and it's not improved matters. Throughput for "stale" data hangs around 30-50MB/s.

If I write some data, and then reboot to clear the client caches, and then read that data back it's vastly improved (as presumably it's now coming from the NVME dm-writecache).

As these disks are used by Failover Cluster Manager I'm unable to software raid them in Windows.

What would be the expected throughput for a setup such as ours? I'm keen to know if you've seen better read performance than this for HDD backed pools, or if this level is to be expected?

Thanks,

Will

it also depends on the client io size, small block sizes will give less MB/s

if you can test with a tool like CrystalDiskMark which shows the performance for several client io size will help, try to test it while the cluster is not loaded else the results will be wrong.

The issue with testing and benchmarking with CrystalDiskMark is it just tests to the dm-writecache. It shows read and write speeds of 2GB/s.
Or is there a way to benchmark and test the read speeds for data coming from the HDD?

Is 30-50MB/s what you’d expect for sequential reads over iscsi when backed by HDDs, or does it sound like there’s another issue?

it really depends on the block size of the reads, if the latency for HDD OSD is 20 ms, a 1MB block size read will give 50 MB/s, a 4KB  block size will give 0.2 MB/s, 4MB will give over 100 MB/s, the higher the block size the faster, also at over 1 MB block size disk throughput becomes the dominant factor rather than latency.

Note SSD OSDs give around 0.3 ms read latency so reads are much faster for small block sizes.

In some cases, you may know the client read block size for example some backup applications let you control it. But many apps you have no control, like databases.

Yes testing with write cache and trying to bypass it to get real HDD performance can be tricky (which is actually a good thing as the cache is being effective).