Forums - PetaSAN

ForumGeneral DiscussionAbout petasan and ceph and iscsi
You need to log in to create posts and topics. Login · Register
About petasan and ceph and iscsi

Pages: 1 2 3

admin
2,961 Posts

December 21, 2017, 7:20 am
Quote from admin on December 21, 2017, 7:20 am
We did do some evaluation of spdk on the iSCSI target component, the core code seems to be adapted from ISTGT. What is still to be done is Ceph/Bluestore support for spdk, there has been early work but it is still in the Ceph to do list.

We did do some evaluation of spdk on the iSCSI target component, the core code seems to be adapted from ISTGT. What is still to be done is Ceph/Bluestore support for spdk, there has been early work but it is still in the Ceph to do list.

Last edited on December 21, 2017, 7:23 am by admin · #11

shadowlin
67 Posts

December 21, 2017, 8:16 am
Quote from shadowlin on December 21, 2017, 8:16 am

Quote from admin on December 21, 2017, 7:20 am

We did do some evaluation of spdk on the iSCSI target component, the core code seems to be adapted from ISTGT. What is still to be done is Ceph/Bluestore support for spdk, there has been early work but it is still in the Ceph to do list.

I am wondering how spdk iscsi target with rbd backend performs.

Quote from admin on December 21, 2017, 7:20 am

We did do some evaluation of spdk on the iSCSI target component, the core code seems to be adapted from ISTGT. What is still to be done is Ceph/Bluestore support for spdk, there has been early work but it is still in the Ceph to do list.

I am wondering how spdk iscsi target with rbd backend performs.

#12

admin
2,961 Posts

December 21, 2017, 8:36 am
Quote from admin on December 21, 2017, 8:36 am

http://7xweck.com1.z0.glb.clouddn.com/cephdaybeijing201608/04-SPDK%E5%8A%A0%E9%80%9FCeph-XSKY%20Bluestore%E6%A1%88%E4%BE%8B%E5%88%86%E4%BA%AB-%E6%89%AC%E5%AD%90%E5%A4%9C-%E7%8E%8B%E8%B1%AA%E8%BF%88.pdf

http://7xweck.com1.z0.glb.clouddn.com/cephdaybeijing201608/04-SPDK%E5%8A%A0%E9%80%9FCeph-XSKY%20Bluestore%E6%A1%88%E4%BE%8B%E5%88%86%E4%BA%AB-%E6%89%AC%E5%AD%90%E5%A4%9C-%E7%8E%8B%E8%B1%AA%E8%BF%88.pdf

#13

shadowlin
67 Posts

December 21, 2017, 8:43 am
Quote from shadowlin on December 21, 2017, 8:43 am

Quote from admin on December 21, 2017, 8:36 am

http://7xweck.com1.z0.glb.clouddn.com/cephdaybeijing201608/04-SPDK%E5%8A%A0%E9%80%9FCeph-XSKY%20Bluestore%E6%A1%88%E4%BE%8B%E5%88%86%E4%BA%AB-%E6%89%AC%E5%AD%90%E5%A4%9C-%E7%8E%8B%E8%B1%AA%E8%BF%88.pdf

I have read this one, but the performance chart in this ppt is mainly about spdk with nvme backend performance not spdk target with rbd backend.

Quote from admin on December 21, 2017, 8:36 am

http://7xweck.com1.z0.glb.clouddn.com/cephdaybeijing201608/04-SPDK%E5%8A%A0%E9%80%9FCeph-XSKY%20Bluestore%E6%A1%88%E4%BE%8B%E5%88%86%E4%BA%AB-%E6%89%AC%E5%AD%90%E5%A4%9C-%E7%8E%8B%E8%B1%AA%E8%BF%88.pdf

I have read this one, but the performance chart in this ppt is mainly about spdk with nvme backend performance not spdk target with rbd backend.

#14

admin
2,961 Posts

December 21, 2017, 9:06 am
Quote from admin on December 21, 2017, 9:06 am
True it does not have the complete cycle, but in addition to nvme backend driver it also has detailed iSCSI target comparisons, which was our main area of interest. Note that the rbd backend is a client side, to get best cluster performance, the server side OSD/Bluestore needs to support nvme backend driver so the charts are also very relevant. The initial code for Bluestore support was done by one the presenters, but still this is a work in progress for Ceph.

True it does not have the complete cycle, but in addition to nvme backend driver it also has detailed iSCSI target comparisons, which was our main area of interest. Note that the rbd backend is a client side, to get best cluster performance, the server side OSD/Bluestore needs to support nvme backend driver so the charts are also very relevant. The initial code for Bluestore support was done by one the presenters, but still this is a work in progress for Ceph.

Last edited on December 21, 2017, 9:15 am by admin · #15

shadowlin
67 Posts

December 26, 2017, 2:27 pm
Quote from shadowlin on December 26, 2017, 2:27 pm
I downloaded and installed petasan to use lrbd for iscsi test and I found a strange problem.

I use fio to test the performance with this parameters:

fio -directory=fiotest -direct=1 -thread -rw=write -ioengine=libaio -size=200G -group_reporting -bs=1m -iodepth 4 -numjobs=200 -name=writetest

At the beginning everything worked fine, but after about 10 mins the performance dropped.

I used grafana and graphite to monitor my ceph cluster performance and had this graph.

apply latency and commit latency of osd2 increased significantly. I first thought there was something wrong with osd2, but I had found nothing.

Then I did the same test on the same machine with the same fio parameters for about the same time length and got this result.

osd2 seemed ok when using rbd to do the same test.

I did this test for several times and the result is repeatable.

Is this same kind of bug of the target_core_rbd or I can fix this with some tuning ?

I downloaded and installed petasan to use lrbd for iscsi test and I found a strange problem.

I use fio to test the performance with this parameters:

fio -directory=fiotest -direct=1 -thread -rw=write -ioengine=libaio -size=200G -group_reporting -bs=1m -iodepth 4 -numjobs=200 -name=writetest

At the beginning everything worked fine, but after about 10 mins the performance dropped.

I used grafana and graphite to monitor my ceph cluster performance and had this graph.

apply latency and commit latency of osd2 increased significantly. I first thought there was something wrong with osd2, but I had found nothing.

Then I did the same test on the same machine with the same fio parameters for about the same time length and got this result.

osd2 seemed ok when using rbd to do the same test.

I did this test for several times and the result is repeatable.

Is this same kind of bug of the target_core_rbd or I can fix this with some tuning ?

#16

admin
2,961 Posts

December 26, 2017, 6:08 pm
Quote from admin on December 26, 2017, 6:08 pm
You are using 8 magnetic disks, correct ?
Are you using the same disk model / size for all disks ?
Is your entire Ceph cluster built with PetaSAN or are you testing an existing cluster with a PetaSAN node as client ?
If you use PetaSAN cluster, can you send the charts for raw disk iops and disk % busy for the disk in question ?
If not, are you using bluestore or filestore ? do you have external journals ?

Peformance shoud not drop with time, however it may result from Ceph scrubing jobs, this is specially true for magnetic disks. But it should not be seen on a particular disk all the time. You can check by dis-abling scrubing
ceph osd set nodeep-scrub
ceph osd set noscrub
and see i this helps.

The cases where you can see a higher OSD usage than others is in case you working on a single rbd image and there are operations that require constant read/writes to image header object such as object map, exclusive lock.target_core_rbd does make use of image header object to support clustered active/active persistence reservations, so maybe this is what you are see-ing although i doubt since it should not be dependent on time and also doubt it should not have this much effect, since after writing to disk future reads should get it from page cache if it has not changed. The good thing is that we have made changes to target_core_rbd module in v 1.5 ( due first week of Jan ) that does not constantly read the object header (relies on Ceph watch/notify to communicate pr changes), this saves the extra round trip read operation.

Can you please check which OSD is the acting primary for the rbd object header using the following:

rbd info image-00001
rbd image 'image-00001':
size 20480 MB in 5120 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.d38b2443a858 <---------- header name
format: 2
features: layering, striping
flags:
stripe unit: 4096 kB
stripe count: 1

ceph osd map rbd rbd_data.d38b2443a858

You are using 8 magnetic disks, correct ?
Are you using the same disk model / size for all disks ?
Is your entire Ceph cluster built with PetaSAN or are you testing an existing cluster with a PetaSAN node as client ?
If you use PetaSAN cluster, can you send the charts for raw disk iops and disk % busy for the disk in question ?
If not, are you using bluestore or filestore ? do you have external journals ?

Peformance shoud not drop with time, however it may result from Ceph scrubing jobs, this is specially true for magnetic disks. But it should not be seen on a particular disk all the time. You can check by dis-abling scrubing
ceph osd set nodeep-scrub
ceph osd set noscrub
and see i this helps.

The cases where you can see a higher OSD usage than others is in case you working on a single rbd image and there are operations that require constant read/writes to image header object such as object map, exclusive lock.target_core_rbd does make use of image header object to support clustered active/active persistence reservations, so maybe this is what you are see-ing although i doubt since it should not be dependent on time and also doubt it should not have this much effect, since after writing to disk future reads should get it from page cache if it has not changed. The good thing is that we have made changes to target_core_rbd module in v 1.5 ( due first week of Jan ) that does not constantly read the object header (relies on Ceph watch/notify to communicate pr changes), this saves the extra round trip read operation.

Can you please check which OSD is the acting primary for the rbd object header using the following:

rbd info image-00001
rbd image 'image-00001':
size 20480 MB in 5120 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.d38b2443a858 <---------- header name
format: 2
features: layering, striping
flags:
stripe unit: 4096 kB
stripe count: 1

ceph osd map rbd rbd_data.d38b2443a858

Last edited on December 26, 2017, 6:09 pm by admin · #17

shadowlin
67 Posts

December 27, 2017, 2:02 am
Quote from shadowlin on December 27, 2017, 2:02 am
I am using 12 magnetic disks and the pool size is 2.

I am using an existing ceph(luminous 12.2.2) cluster with bluestore and no external journal.

All of the disks and osd nodes(arm based with the same spec) are the same model and size.

I have already disabled scrubbing.

It seems the acting primary osd is not the same osd with high latency.

rbd info rbd_pool/image
rbd image 'image':
size 51200 GB in 13107200 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.8d7b74b0dc51
format: 2
features: layering
flags:
create_timestamp: Fri Dec 15 21:09:46 2017

ceph osd map rbd_pool rbd_data.8d7b74b0dc51
osdmap e375 pool 'rbd_pool' (3) object 'rbd_data.8d7b74b0dc51' -> pg 3.74f868af (3.af) -> up ([11,7], p11) acting ([11,7], p11)

I am using 12 magnetic disks and the pool size is 2.

I am using an existing ceph(luminous 12.2.2) cluster with bluestore and no external journal.

All of the disks and osd nodes(arm based with the same spec) are the same model and size.

I have already disabled scrubbing.

It seems the acting primary osd is not the same osd with high latency.

rbd info rbd_pool/image
rbd image 'image':
size 51200 GB in 13107200 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.8d7b74b0dc51
format: 2
features: layering
flags:
create_timestamp: Fri Dec 15 21:09:46 2017

ceph osd map rbd_pool rbd_data.8d7b74b0dc51
osdmap e375 pool 'rbd_pool' (3) object 'rbd_data.8d7b74b0dc51' -> pg 3.74f868af (3.af) -> up ([11,7], p11) acting ([11,7], p11)

#18

admin
2,961 Posts

December 27, 2017, 11:36 am
Quote from admin on December 27, 2017, 11:36 am
I am not sure why this would be happening. Try to get the raw (os) disk % busy and raw disk iops : is it higher than other raw disks ? If iops are the same and it is busier it could be a hardware issue, else something else is causing more ios.

Another general thing often overlooked, in many cases the results will be affected by client resources or how it is configured. In many cases 1 client will not be enough to stress the cluster, also sometimes adding too many jobs on a single client may give lesser values.

I would recommend running several test as follows:
-iodepth 8 -numjobs=1
-iodepth 16 -numjobs=1
-iodepth 32 -numjobs=1
Then stop if it saturates, or converges to the point where there is very small difference
and
-iodepth 1 -numjobs=8
-iodepth 1 -numjobs=16
-iodepth 1 -numjobs=32
Then stop if it saturates, or converges to the point where there is very small difference

Choose the max configuration, then run it concurrently on 2 clients and adding their combined result, keep adding clients until the result saturate or converges.

PetaSAN cluster benchmark page makes this process much easier by making client sweeps, aggregating results as well as showing resource bottlenecks.

I am not sure why this would be happening. Try to get the raw (os) disk % busy and raw disk iops : is it higher than other raw disks ? If iops are the same and it is busier it could be a hardware issue, else something else is causing more ios.

Another general thing often overlooked, in many cases the results will be affected by client resources or how it is configured. In many cases 1 client will not be enough to stress the cluster, also sometimes adding too many jobs on a single client may give lesser values.

I would recommend running several test as follows:
-iodepth 8 -numjobs=1
-iodepth 16 -numjobs=1
-iodepth 32 -numjobs=1
Then stop if it saturates, or converges to the point where there is very small difference
and
-iodepth 1 -numjobs=8
-iodepth 1 -numjobs=16
-iodepth 1 -numjobs=32
Then stop if it saturates, or converges to the point where there is very small difference

Choose the max configuration, then run it concurrently on 2 clients and adding their combined result, keep adding clients until the result saturate or converges.

PetaSAN cluster benchmark page makes this process much easier by making client sweeps, aggregating results as well as showing resource bottlenecks.

Last edited on December 27, 2017, 11:59 am by admin · #19

shadowlin
67 Posts

December 28, 2017, 7:08 am
Quote from shadowlin on December 28, 2017, 7:08 am
Thanks for the advice I will implement iostat into my grafana dashboard and do more test with different fio parameters.

Thanks for the advice I will implement iostat into my grafana dashboard and do more test with different fio parameters.

#20

Post Reply: About petasan and ceph and iscsi

Cancel

Pages: 1 2 3