Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

About petasan and ceph and iscsi

Pages: 1 2 3

We did do some evaluation of spdk on the iSCSI target component, the core code seems to be adapted from ISTGT.  What is still to be done is Ceph/Bluestore support for spdk, there has been early work but it is still in the Ceph to do list.

Quote from admin on December 21, 2017, 7:20 am

We did do some evaluation of spdk on the iSCSI target component, the core code seems to be adapted from ISTGT.  What is still to be done is Ceph/Bluestore support for spdk, there has been early work but it is still in the Ceph to do list.

I am wondering how spdk iscsi target with rbd backend performs.

 

 

http://7xweck.com1.z0.glb.clouddn.com/cephdaybeijing201608/04-SPDK%E5%8A%A0%E9%80%9FCeph-XSKY%20Bluestore%E6%A1%88%E4%BE%8B%E5%88%86%E4%BA%AB-%E6%89%AC%E5%AD%90%E5%A4%9C-%E7%8E%8B%E8%B1%AA%E8%BF%88.pdf

 

Quote from admin on December 21, 2017, 8:36 am

 

http://7xweck.com1.z0.glb.clouddn.com/cephdaybeijing201608/04-SPDK%E5%8A%A0%E9%80%9FCeph-XSKY%20Bluestore%E6%A1%88%E4%BE%8B%E5%88%86%E4%BA%AB-%E6%89%AC%E5%AD%90%E5%A4%9C-%E7%8E%8B%E8%B1%AA%E8%BF%88.pdf

 

I have read this one, but the performance chart in this ppt is mainly about  spdk with nvme backend performance not spdk target with rbd backend.

True it does not have the complete cycle, but in addition to nvme backend driver it also has detailed iSCSI target comparisons, which was our main area of interest. Note that the rbd backend is a client side,  to get  best cluster performance, the server side OSD/Bluestore needs to support nvme backend driver so the charts are also very relevant. The initial code for Bluestore support was done by one the presenters, but still this is a work in progress for Ceph.

I downloaded and installed petasan to use lrbd for iscsi test and I found a strange problem.

I use fio to test the performance with this parameters:

fio -directory=fiotest -direct=1 -thread -rw=write -ioengine=libaio -size=200G -group_reporting -bs=1m -iodepth 4 -numjobs=200 -name=writetest

At the beginning everything worked fine, but after about 10 mins the performance dropped.

I used grafana and graphite to monitor my ceph cluster performance and had this graph.

apply latency and commit latency of osd2 increased significantly. I first thought there was something wrong with osd2, but I had found nothing.

Then I did the same test on the same machine with the same fio parameters for about the same time length and got this result.

osd2 seemed ok when using rbd to do the same test.

I did this test for several times and the result is repeatable.

Is this same kind of bug of the target_core_rbd or I can fix this with some tuning ?

You are using 8 magnetic disks, correct ?
Are you using the same disk model / size for all disks ?
Is your entire Ceph cluster built with PetaSAN or are you testing an existing cluster with a PetaSAN node as client ?
If you use PetaSAN cluster, can you send the charts for raw disk iops and disk % busy for the disk in question ?
If not, are you using bluestore or filestore ? do you have external journals ?

Peformance shoud not drop with time, however it may result from Ceph scrubing jobs, this is specially true for magnetic disks. But it should not be seen on a particular disk all the time. You can check by dis-abling scrubing
ceph osd set nodeep-scrub
ceph osd set noscrub
and see i this helps.

The cases where you can see a higher OSD usage than others is in case you working on a single rbd image and there are operations that require constant read/writes to image header object such as object map, exclusive lock.target_core_rbd does make use of image header object to support clustered active/active persistence reservations, so maybe this is what you are see-ing although i doubt since it should not be dependent on time and also doubt it should not have this much effect, since after writing to disk future reads should get it from page cache if it has not changed. The good thing is that we have made changes to target_core_rbd module in v 1.5 ( due first week of Jan ) that does not constantly read the object header (relies on Ceph watch/notify to communicate pr changes), this saves the extra round trip read operation.

Can you please check which OSD is the acting primary for the rbd object header using the following:

rbd info image-00001
rbd image 'image-00001':
size 20480 MB in 5120 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.d38b2443a858 <---------- header name
format: 2
features: layering, striping
flags:
stripe unit: 4096 kB
stripe count: 1

ceph osd map rbd rbd_data.d38b2443a858

I am using 12 magnetic disks and the pool size is 2.

I am using an existing ceph(luminous 12.2.2) cluster with bluestore and no external journal.

All of the disks and osd nodes(arm based with the same spec) are the same model and size.

I have already disabled scrubbing.

It seems the acting primary osd is not the same osd with high latency.

rbd info rbd_pool/image
rbd image 'image':
size 51200 GB in 13107200 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.8d7b74b0dc51
format: 2
features: layering
flags:
create_timestamp: Fri Dec 15 21:09:46 2017

ceph osd map rbd_pool rbd_data.8d7b74b0dc51
osdmap e375 pool 'rbd_pool' (3) object 'rbd_data.8d7b74b0dc51' -> pg 3.74f868af (3.af) -> up ([11,7], p11) acting ([11,7], p11)

 

I am not sure why this would be happening.  Try to get the raw (os) disk % busy and raw disk iops : is it higher than other raw disks ?  If iops are the same and it is busier it could be a hardware issue, else something else is causing more ios.

Another general thing often overlooked, in many cases the results will be affected by client resources or how it is configured.  In many cases 1 client will not be enough to stress the cluster, also sometimes adding too many jobs on a single client may give lesser values.

I would recommend running several test as follows:
-iodepth 8 -numjobs=1
-iodepth 16 -numjobs=1
-iodepth 32 -numjobs=1
Then stop if it saturates, or converges to the point where there is very small difference
and
-iodepth 1 -numjobs=8
-iodepth 1 -numjobs=16
-iodepth 1 -numjobs=32
Then stop if it saturates, or converges to the point where there is very small difference

Choose the max configuration, then run it concurrently on 2 clients and adding their combined result, keep adding clients until the result saturate or converges.

PetaSAN cluster benchmark page makes this process much easier by making client sweeps, aggregating results as well as showing resource bottlenecks.

Thanks for the advice I will implement iostat into my grafana dashboard and do more test with different fio parameters.

Pages: 1 2 3