Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

EC RBD over iscsi performance

Pages: 1 2 3

1 MB is a good value.

No the 512 is the sector size in bytes.

Quote from admin on March 18, 2019, 9:12 am

1 MB is a good value.

No the 512 is the sector size in bytes.

So how to set block size in lio and linux initiator(open-iscsi)?

are these settings related?

MaxRecvDataSegmentLength
MaxXmitDataSegmentLength
FirstBurstLength
MaxBurstLength

What is the equal setting of "MaxIoSizeKB" in open-iscsi?

Thank you

Yes set those to 1048576

Also these
"ImmediateData": "Yes",
"InitialR2T": "No",
"MaxOutstandingR2T": "8",

Quote from admin on March 18, 2019, 12:11 pm

Yes set those to 1048576

Also these
"ImmediateData": "Yes",
"InitialR2T": "No",
"MaxOutstandingR2T": "8",

Thanks.

For small block size seq write,is it possible to buffer the write in the target side and then write to rbd in a larger block size?

For small block size seq write,is it possible to buffer the write in the target side and then write to rbd in a larger block size?

Interesting question, with this your iops (for sequential small block size writes) will soar and i am sure some people will use this to boast on their numbers. Actually this question is at the core of storage architecture, some points:

  • A cache will work well with sequential io, but worse with random io. Most virtualization workloads are random, most apps that do a lot of sequential io (backup, streaming) will not be using small block size anyway.
  • If you cache in memory and your caching node fails, at best you lose your cached data, at worse you may get filesystem corruption, so you need to persist it.
  • If you try to support high availability between nodes, you need to distribute the (persisted) cache.
  • If you will support the above, it is better to do at the OSD/Rados level, Ceph did support cache tiering, but like most distributed caching it worked well for some io pattern but worse for others so it was deprecated.
  • It is better to let the client/guest vm handle the caching on the client side (OS file system / page cache) rather that the storage network.
  • As stated a tiering approach was tried in Ceph without success, a move toward caching at the block level via dm-cache, bcache rather that at the pool/tier level was shown to be better, however with the move to all flash this is becoming less of a need. For spinning disks you can also use a controller with write-back cache.
Quote from admin on March 19, 2019, 6:51 am

For small block size seq write,is it possible to buffer the write in the target side and then write to rbd in a larger block size?

Interesting question, with this your iops (for sequential small block size writes) will soar and i am sure some people will use this to boast on their numbers. Actually this question is at the core of storage architecture, some points:

  • A cache will work well with sequential io, but worse with random io. Most virtualization workloads are random, most apps that do a lot of sequential io (backup, streaming) will not be using small block size anyway.
  • If you cache in memory and your caching node fails, at best you lose your cached data, at worse you may get filesystem corruption, so you need to persist it.
  • If you try to support high availability between nodes, you need to distribute the (persisted) cache.
  • If you will support the above, it is better to do at the OSD/Rados level, Ceph did support cache tiering, but like most distributed caching it worked well for some io pattern but worse for others so it was deprecated.
  • It is better to let the client/guest vm handle the caching on the client side (OS file system / page cache) rather that the storage network.
  • As stated a tiering approach was tried in Ceph without success, a move toward caching at the block level via dm-cache, bcache rather that at the pool/tier level was shown to be better, however with the move to all flash this is becoming less of a need. For spinning disks you can also use a controller with write-back cache.

Thank you for sharing the knowledge.

The reason I am trying to find a cache for small seq write is because I find even for seq write the block size effect the throughput very much.

This is my test result of native ec rbd on the same ceph cluster:

fio paramters

[seq-write]
description="seq-write"
direct=1
ioengine=libaio
numjobs=8
iodepth=16
group_reporting
rw=write

the test only change the bs(block size)

4m 1m 512k 128k
throughput 893.69 MB/s 513.94MB/s 326.42MB/s 106.61 MB/s
lat 572.69 ms 249.01ms 195.98ms 150.02 ms

 

yes as per my prev message, try caching on client os side, do not use the direct=1 or sync flags which tells not to cache + for spinning disks you can use controller with cache.

I am trying to understand why ceph  is so sensitive to the block size for seq write.I tried with the same fio parameters to benchmark a SATA HDD and block size doesn't matter much(64kb,128kb,4m are almost the same result)

Technically a spinning hdd can do about 100 iops, so with 64kb io you should only see 6 MB/s. The reason you see higher with seq writes is the io path is so direct that the io scheduler + controller can easily concatenate the small block in larger ones, this does not happen for random writes and you would see close to the 6 MB/s.

With a scale out SDS  solution like Ceph, the io involves software daemons talking across the wire and database access to read/write the metada, the io pattern is much more complex .

Pure spinning disks will perform poorly by themselves. An SSD journal will help, controller with cache will greatly help. All flash will be much better as it does not have such low raw disk iops.

I see. Do you have any test result of seq write with different block size that can share?and comparison with or without controller with cache?

 

Pages: 1 2 3