EC RBD over iscsi performance
admin
2,930 Posts
March 18, 2019, 9:12 amQuote from admin on March 18, 2019, 9:12 am1 MB is a good value.
No the 512 is the sector size in bytes.
1 MB is a good value.
No the 512 is the sector size in bytes.
shadowlin
67 Posts
March 18, 2019, 9:38 amQuote from shadowlin on March 18, 2019, 9:38 am
Quote from admin on March 18, 2019, 9:12 am
1 MB is a good value.
No the 512 is the sector size in bytes.
So how to set block size in lio and linux initiator(open-iscsi)?
are these settings related?
MaxRecvDataSegmentLength
MaxXmitDataSegmentLength
FirstBurstLength
MaxBurstLength
What is the equal setting of "MaxIoSizeKB" in open-iscsi?
Thank you
Quote from admin on March 18, 2019, 9:12 am
1 MB is a good value.
No the 512 is the sector size in bytes.
So how to set block size in lio and linux initiator(open-iscsi)?
are these settings related?
MaxRecvDataSegmentLength
MaxXmitDataSegmentLength
FirstBurstLength
MaxBurstLength
What is the equal setting of "MaxIoSizeKB" in open-iscsi?
Thank you
Last edited on March 18, 2019, 9:39 am by shadowlin · #12
admin
2,930 Posts
March 18, 2019, 12:11 pmQuote from admin on March 18, 2019, 12:11 pmYes set those to 1048576
Also these
"ImmediateData": "Yes",
"InitialR2T": "No",
"MaxOutstandingR2T": "8",
Yes set those to 1048576
Also these
"ImmediateData": "Yes",
"InitialR2T": "No",
"MaxOutstandingR2T": "8",
shadowlin
67 Posts
March 19, 2019, 2:48 amQuote from shadowlin on March 19, 2019, 2:48 am
Quote from admin on March 18, 2019, 12:11 pm
Yes set those to 1048576
Also these
"ImmediateData": "Yes",
"InitialR2T": "No",
"MaxOutstandingR2T": "8",
Thanks.
For small block size seq write,is it possible to buffer the write in the target side and then write to rbd in a larger block size?
Quote from admin on March 18, 2019, 12:11 pm
Yes set those to 1048576
Also these
"ImmediateData": "Yes",
"InitialR2T": "No",
"MaxOutstandingR2T": "8",
Thanks.
For small block size seq write,is it possible to buffer the write in the target side and then write to rbd in a larger block size?
admin
2,930 Posts
March 19, 2019, 6:51 amQuote from admin on March 19, 2019, 6:51 amFor small block size seq write,is it possible to buffer the write in the target side and then write to rbd in a larger block size?
Interesting question, with this your iops (for sequential small block size writes) will soar and i am sure some people will use this to boast on their numbers. Actually this question is at the core of storage architecture, some points:
- A cache will work well with sequential io, but worse with random io. Most virtualization workloads are random, most apps that do a lot of sequential io (backup, streaming) will not be using small block size anyway.
- If you cache in memory and your caching node fails, at best you lose your cached data, at worse you may get filesystem corruption, so you need to persist it.
- If you try to support high availability between nodes, you need to distribute the (persisted) cache.
- If you will support the above, it is better to do at the OSD/Rados level, Ceph did support cache tiering, but like most distributed caching it worked well for some io pattern but worse for others so it was deprecated.
- It is better to let the client/guest vm handle the caching on the client side (OS file system / page cache) rather that the storage network.
- As stated a tiering approach was tried in Ceph without success, a move toward caching at the block level via dm-cache, bcache rather that at the pool/tier level was shown to be better, however with the move to all flash this is becoming less of a need. For spinning disks you can also use a controller with write-back cache.
For small block size seq write,is it possible to buffer the write in the target side and then write to rbd in a larger block size?
Interesting question, with this your iops (for sequential small block size writes) will soar and i am sure some people will use this to boast on their numbers. Actually this question is at the core of storage architecture, some points:
- A cache will work well with sequential io, but worse with random io. Most virtualization workloads are random, most apps that do a lot of sequential io (backup, streaming) will not be using small block size anyway.
- If you cache in memory and your caching node fails, at best you lose your cached data, at worse you may get filesystem corruption, so you need to persist it.
- If you try to support high availability between nodes, you need to distribute the (persisted) cache.
- If you will support the above, it is better to do at the OSD/Rados level, Ceph did support cache tiering, but like most distributed caching it worked well for some io pattern but worse for others so it was deprecated.
- It is better to let the client/guest vm handle the caching on the client side (OS file system / page cache) rather that the storage network.
- As stated a tiering approach was tried in Ceph without success, a move toward caching at the block level via dm-cache, bcache rather that at the pool/tier level was shown to be better, however with the move to all flash this is becoming less of a need. For spinning disks you can also use a controller with write-back cache.
Last edited on March 19, 2019, 6:59 am by admin · #15
shadowlin
67 Posts
March 20, 2019, 2:15 amQuote from shadowlin on March 20, 2019, 2:15 am
Quote from admin on March 19, 2019, 6:51 am
For small block size seq write,is it possible to buffer the write in the target side and then write to rbd in a larger block size?
Interesting question, with this your iops (for sequential small block size writes) will soar and i am sure some people will use this to boast on their numbers. Actually this question is at the core of storage architecture, some points:
- A cache will work well with sequential io, but worse with random io. Most virtualization workloads are random, most apps that do a lot of sequential io (backup, streaming) will not be using small block size anyway.
- If you cache in memory and your caching node fails, at best you lose your cached data, at worse you may get filesystem corruption, so you need to persist it.
- If you try to support high availability between nodes, you need to distribute the (persisted) cache.
- If you will support the above, it is better to do at the OSD/Rados level, Ceph did support cache tiering, but like most distributed caching it worked well for some io pattern but worse for others so it was deprecated.
- It is better to let the client/guest vm handle the caching on the client side (OS file system / page cache) rather that the storage network.
- As stated a tiering approach was tried in Ceph without success, a move toward caching at the block level via dm-cache, bcache rather that at the pool/tier level was shown to be better, however with the move to all flash this is becoming less of a need. For spinning disks you can also use a controller with write-back cache.
Thank you for sharing the knowledge.
The reason I am trying to find a cache for small seq write is because I find even for seq write the block size effect the throughput very much.
This is my test result of native ec rbd on the same ceph cluster:
fio paramters
[seq-write]
description="seq-write"
direct=1
ioengine=libaio
numjobs=8
iodepth=16
group_reporting
rw=write
the test only change the bs(block size)
4m
1m
512k
128k
throughput
893.69 MB/s
513.94MB/s
326.42MB/s
106.61 MB/s
lat
572.69 ms
249.01ms
195.98ms
150.02 ms
Quote from admin on March 19, 2019, 6:51 am
For small block size seq write,is it possible to buffer the write in the target side and then write to rbd in a larger block size?
Interesting question, with this your iops (for sequential small block size writes) will soar and i am sure some people will use this to boast on their numbers. Actually this question is at the core of storage architecture, some points:
- A cache will work well with sequential io, but worse with random io. Most virtualization workloads are random, most apps that do a lot of sequential io (backup, streaming) will not be using small block size anyway.
- If you cache in memory and your caching node fails, at best you lose your cached data, at worse you may get filesystem corruption, so you need to persist it.
- If you try to support high availability between nodes, you need to distribute the (persisted) cache.
- If you will support the above, it is better to do at the OSD/Rados level, Ceph did support cache tiering, but like most distributed caching it worked well for some io pattern but worse for others so it was deprecated.
- It is better to let the client/guest vm handle the caching on the client side (OS file system / page cache) rather that the storage network.
- As stated a tiering approach was tried in Ceph without success, a move toward caching at the block level via dm-cache, bcache rather that at the pool/tier level was shown to be better, however with the move to all flash this is becoming less of a need. For spinning disks you can also use a controller with write-back cache.
Thank you for sharing the knowledge.
The reason I am trying to find a cache for small seq write is because I find even for seq write the block size effect the throughput very much.
This is my test result of native ec rbd on the same ceph cluster:
fio paramters
[seq-write]
description="seq-write"
direct=1
ioengine=libaio
numjobs=8
iodepth=16
group_reporting
rw=write
the test only change the bs(block size)
4m
1m
512k
128k
throughput
893.69 MB/s
513.94MB/s
326.42MB/s
106.61 MB/s
lat
572.69 ms
249.01ms
195.98ms
150.02 ms
admin
2,930 Posts
March 20, 2019, 7:04 amQuote from admin on March 20, 2019, 7:04 amyes as per my prev message, try caching on client os side, do not use the direct=1 or sync flags which tells not to cache + for spinning disks you can use controller with cache.
yes as per my prev message, try caching on client os side, do not use the direct=1 or sync flags which tells not to cache + for spinning disks you can use controller with cache.
shadowlin
67 Posts
March 20, 2019, 7:26 amQuote from shadowlin on March 20, 2019, 7:26 amI am trying to understand why ceph is so sensitive to the block size for seq write.I tried with the same fio parameters to benchmark a SATA HDD and block size doesn't matter much(64kb,128kb,4m are almost the same result)
I am trying to understand why ceph is so sensitive to the block size for seq write.I tried with the same fio parameters to benchmark a SATA HDD and block size doesn't matter much(64kb,128kb,4m are almost the same result)
admin
2,930 Posts
March 20, 2019, 8:11 amQuote from admin on March 20, 2019, 8:11 amTechnically a spinning hdd can do about 100 iops, so with 64kb io you should only see 6 MB/s. The reason you see higher with seq writes is the io path is so direct that the io scheduler + controller can easily concatenate the small block in larger ones, this does not happen for random writes and you would see close to the 6 MB/s.
With a scale out SDS solution like Ceph, the io involves software daemons talking across the wire and database access to read/write the metada, the io pattern is much more complex .
Pure spinning disks will perform poorly by themselves. An SSD journal will help, controller with cache will greatly help. All flash will be much better as it does not have such low raw disk iops.
Technically a spinning hdd can do about 100 iops, so with 64kb io you should only see 6 MB/s. The reason you see higher with seq writes is the io path is so direct that the io scheduler + controller can easily concatenate the small block in larger ones, this does not happen for random writes and you would see close to the 6 MB/s.
With a scale out SDS solution like Ceph, the io involves software daemons talking across the wire and database access to read/write the metada, the io pattern is much more complex .
Pure spinning disks will perform poorly by themselves. An SSD journal will help, controller with cache will greatly help. All flash will be much better as it does not have such low raw disk iops.
Last edited on March 20, 2019, 8:12 am by admin · #19
shadowlin
67 Posts
March 20, 2019, 8:40 amQuote from shadowlin on March 20, 2019, 8:40 amI see. Do you have any test result of seq write with different block size that can share?and comparison with or without controller with cache?
I see. Do you have any test result of seq write with different block size that can share?and comparison with or without controller with cache?
EC RBD over iscsi performance
admin
2,930 Posts
Quote from admin on March 18, 2019, 9:12 am1 MB is a good value.
No the 512 is the sector size in bytes.
1 MB is a good value.
No the 512 is the sector size in bytes.
shadowlin
67 Posts
Quote from shadowlin on March 18, 2019, 9:38 amQuote from admin on March 18, 2019, 9:12 am1 MB is a good value.
No the 512 is the sector size in bytes.
So how to set block size in lio and linux initiator(open-iscsi)?
are these settings related?
MaxRecvDataSegmentLength
MaxXmitDataSegmentLength
FirstBurstLength
MaxBurstLengthWhat is the equal setting of "MaxIoSizeKB" in open-iscsi?
Thank you
Quote from admin on March 18, 2019, 9:12 am1 MB is a good value.
No the 512 is the sector size in bytes.
So how to set block size in lio and linux initiator(open-iscsi)?
are these settings related?
MaxRecvDataSegmentLength
MaxXmitDataSegmentLength
FirstBurstLength
MaxBurstLength
What is the equal setting of "MaxIoSizeKB" in open-iscsi?
Thank you
admin
2,930 Posts
Quote from admin on March 18, 2019, 12:11 pmYes set those to 1048576
Also these
"ImmediateData": "Yes",
"InitialR2T": "No",
"MaxOutstandingR2T": "8",
Yes set those to 1048576
Also these
"ImmediateData": "Yes",
"InitialR2T": "No",
"MaxOutstandingR2T": "8",
shadowlin
67 Posts
Quote from shadowlin on March 19, 2019, 2:48 amQuote from admin on March 18, 2019, 12:11 pmYes set those to 1048576
Also these
"ImmediateData": "Yes",
"InitialR2T": "No",
"MaxOutstandingR2T": "8",Thanks.
For small block size seq write,is it possible to buffer the write in the target side and then write to rbd in a larger block size?
Quote from admin on March 18, 2019, 12:11 pmYes set those to 1048576
Also these
"ImmediateData": "Yes",
"InitialR2T": "No",
"MaxOutstandingR2T": "8",
Thanks.
For small block size seq write,is it possible to buffer the write in the target side and then write to rbd in a larger block size?
admin
2,930 Posts
Quote from admin on March 19, 2019, 6:51 amFor small block size seq write,is it possible to buffer the write in the target side and then write to rbd in a larger block size?
Interesting question, with this your iops (for sequential small block size writes) will soar and i am sure some people will use this to boast on their numbers. Actually this question is at the core of storage architecture, some points:
- A cache will work well with sequential io, but worse with random io. Most virtualization workloads are random, most apps that do a lot of sequential io (backup, streaming) will not be using small block size anyway.
- If you cache in memory and your caching node fails, at best you lose your cached data, at worse you may get filesystem corruption, so you need to persist it.
- If you try to support high availability between nodes, you need to distribute the (persisted) cache.
- If you will support the above, it is better to do at the OSD/Rados level, Ceph did support cache tiering, but like most distributed caching it worked well for some io pattern but worse for others so it was deprecated.
- It is better to let the client/guest vm handle the caching on the client side (OS file system / page cache) rather that the storage network.
- As stated a tiering approach was tried in Ceph without success, a move toward caching at the block level via dm-cache, bcache rather that at the pool/tier level was shown to be better, however with the move to all flash this is becoming less of a need. For spinning disks you can also use a controller with write-back cache.
For small block size seq write,is it possible to buffer the write in the target side and then write to rbd in a larger block size?
Interesting question, with this your iops (for sequential small block size writes) will soar and i am sure some people will use this to boast on their numbers. Actually this question is at the core of storage architecture, some points:
- A cache will work well with sequential io, but worse with random io. Most virtualization workloads are random, most apps that do a lot of sequential io (backup, streaming) will not be using small block size anyway.
- If you cache in memory and your caching node fails, at best you lose your cached data, at worse you may get filesystem corruption, so you need to persist it.
- If you try to support high availability between nodes, you need to distribute the (persisted) cache.
- If you will support the above, it is better to do at the OSD/Rados level, Ceph did support cache tiering, but like most distributed caching it worked well for some io pattern but worse for others so it was deprecated.
- It is better to let the client/guest vm handle the caching on the client side (OS file system / page cache) rather that the storage network.
- As stated a tiering approach was tried in Ceph without success, a move toward caching at the block level via dm-cache, bcache rather that at the pool/tier level was shown to be better, however with the move to all flash this is becoming less of a need. For spinning disks you can also use a controller with write-back cache.
shadowlin
67 Posts
Quote from shadowlin on March 20, 2019, 2:15 amQuote from admin on March 19, 2019, 6:51 amFor small block size seq write,is it possible to buffer the write in the target side and then write to rbd in a larger block size?
Interesting question, with this your iops (for sequential small block size writes) will soar and i am sure some people will use this to boast on their numbers. Actually this question is at the core of storage architecture, some points:
- A cache will work well with sequential io, but worse with random io. Most virtualization workloads are random, most apps that do a lot of sequential io (backup, streaming) will not be using small block size anyway.
- If you cache in memory and your caching node fails, at best you lose your cached data, at worse you may get filesystem corruption, so you need to persist it.
- If you try to support high availability between nodes, you need to distribute the (persisted) cache.
- If you will support the above, it is better to do at the OSD/Rados level, Ceph did support cache tiering, but like most distributed caching it worked well for some io pattern but worse for others so it was deprecated.
- It is better to let the client/guest vm handle the caching on the client side (OS file system / page cache) rather that the storage network.
- As stated a tiering approach was tried in Ceph without success, a move toward caching at the block level via dm-cache, bcache rather that at the pool/tier level was shown to be better, however with the move to all flash this is becoming less of a need. For spinning disks you can also use a controller with write-back cache.
Thank you for sharing the knowledge.
The reason I am trying to find a cache for small seq write is because I find even for seq write the block size effect the throughput very much.
This is my test result of native ec rbd on the same ceph cluster:
fio paramters
[seq-write]
description="seq-write"
direct=1
ioengine=libaio
numjobs=8
iodepth=16
group_reporting
rw=writethe test only change the bs(block size)
4m 1m 512k 128k throughput 893.69 MB/s 513.94MB/s 326.42MB/s 106.61 MB/s lat 572.69 ms 249.01ms 195.98ms 150.02 ms
Quote from admin on March 19, 2019, 6:51 amFor small block size seq write,is it possible to buffer the write in the target side and then write to rbd in a larger block size?
Interesting question, with this your iops (for sequential small block size writes) will soar and i am sure some people will use this to boast on their numbers. Actually this question is at the core of storage architecture, some points:
- A cache will work well with sequential io, but worse with random io. Most virtualization workloads are random, most apps that do a lot of sequential io (backup, streaming) will not be using small block size anyway.
- If you cache in memory and your caching node fails, at best you lose your cached data, at worse you may get filesystem corruption, so you need to persist it.
- If you try to support high availability between nodes, you need to distribute the (persisted) cache.
- If you will support the above, it is better to do at the OSD/Rados level, Ceph did support cache tiering, but like most distributed caching it worked well for some io pattern but worse for others so it was deprecated.
- It is better to let the client/guest vm handle the caching on the client side (OS file system / page cache) rather that the storage network.
- As stated a tiering approach was tried in Ceph without success, a move toward caching at the block level via dm-cache, bcache rather that at the pool/tier level was shown to be better, however with the move to all flash this is becoming less of a need. For spinning disks you can also use a controller with write-back cache.
Thank you for sharing the knowledge.
The reason I am trying to find a cache for small seq write is because I find even for seq write the block size effect the throughput very much.
This is my test result of native ec rbd on the same ceph cluster:
fio paramters
[seq-write]
description="seq-write"
direct=1
ioengine=libaio
numjobs=8
iodepth=16
group_reporting
rw=write
the test only change the bs(block size)
4m | 1m | 512k | 128k | |
---|---|---|---|---|
throughput | 893.69 MB/s | 513.94MB/s | 326.42MB/s | 106.61 MB/s |
lat | 572.69 ms | 249.01ms | 195.98ms | 150.02 ms |
admin
2,930 Posts
Quote from admin on March 20, 2019, 7:04 amyes as per my prev message, try caching on client os side, do not use the direct=1 or sync flags which tells not to cache + for spinning disks you can use controller with cache.
yes as per my prev message, try caching on client os side, do not use the direct=1 or sync flags which tells not to cache + for spinning disks you can use controller with cache.
shadowlin
67 Posts
Quote from shadowlin on March 20, 2019, 7:26 amI am trying to understand why ceph is so sensitive to the block size for seq write.I tried with the same fio parameters to benchmark a SATA HDD and block size doesn't matter much(64kb,128kb,4m are almost the same result)
I am trying to understand why ceph is so sensitive to the block size for seq write.I tried with the same fio parameters to benchmark a SATA HDD and block size doesn't matter much(64kb,128kb,4m are almost the same result)
admin
2,930 Posts
Quote from admin on March 20, 2019, 8:11 amTechnically a spinning hdd can do about 100 iops, so with 64kb io you should only see 6 MB/s. The reason you see higher with seq writes is the io path is so direct that the io scheduler + controller can easily concatenate the small block in larger ones, this does not happen for random writes and you would see close to the 6 MB/s.
With a scale out SDS solution like Ceph, the io involves software daemons talking across the wire and database access to read/write the metada, the io pattern is much more complex .
Pure spinning disks will perform poorly by themselves. An SSD journal will help, controller with cache will greatly help. All flash will be much better as it does not have such low raw disk iops.
Technically a spinning hdd can do about 100 iops, so with 64kb io you should only see 6 MB/s. The reason you see higher with seq writes is the io path is so direct that the io scheduler + controller can easily concatenate the small block in larger ones, this does not happen for random writes and you would see close to the 6 MB/s.
With a scale out SDS solution like Ceph, the io involves software daemons talking across the wire and database access to read/write the metada, the io pattern is much more complex .
Pure spinning disks will perform poorly by themselves. An SSD journal will help, controller with cache will greatly help. All flash will be much better as it does not have such low raw disk iops.
shadowlin
67 Posts
Quote from shadowlin on March 20, 2019, 8:40 amI see. Do you have any test result of seq write with different block size that can share?and comparison with or without controller with cache?
I see. Do you have any test result of seq write with different block size that can share?and comparison with or without controller with cache?