Forums - PetaSAN

ForumGeneral DiscussionCluster performance after upgrade
You need to log in to create posts and topics. Login · Register
Cluster performance after upgrade

BonsaiJoe
53 Posts

February 19, 2018, 6:39 pm
Quote from BonsaiJoe on February 19, 2018, 6:39 pm
Hi,

we could no wait to upgrade our cluster to 2.0 thanks for your grate work !
But we are a bit confused cause of comparing cluster performance from version 1.5 to 2.0 Cluster Benchmark Test 1 Client 32 Threads 1 Minute

in version 1.5 we got 5.000 write Iops 28.100 read Iops and 940 MB/s write, 2260 read MB/s
in version 2.0 we get 2.500 write Iops 23.600 read Iops and 920 MB/s write, 1800 read MB/s

for version 2.0 the system has ben reinstalled using same configuration
the cluster has 4 nodes each node has 20x 1,8 TB 10K SAS OSD and 4x 400GB SSD Journal

Backend network is a 20G Bond (2x10G)
ISCSI is 2x 10 G
Managemnet is 2G (2x1G bond)

any idea ?

Hi,

we could no wait to upgrade our cluster to 2.0 thanks for your grate work !
But we are a bit confused cause of comparing cluster performance from version 1.5 to 2.0 Cluster Benchmark Test 1 Client 32 Threads 1 Minute

in version 1.5 we got 5.000 write Iops 28.100 read Iops and 940 MB/s write, 2260 read MB/s
in version 2.0 we get 2.500 write Iops 23.600 read Iops and 920 MB/s write, 1800 read MB/s

for version 2.0 the system has ben reinstalled using same configuration
the cluster has 4 nodes each node has 20x 1,8 TB 10K SAS OSD and 4x 400GB SSD Journal

Backend network is a 20G Bond (2x10G)
ISCSI is 2x 10 G
Managemnet is 2G (2x1G bond)

any idea ?

#1

admin
2,962 Posts

February 19, 2018, 9:35 pm
Quote from admin on February 19, 2018, 9:35 pm
In most cases bluestore will be faster, filestore will be in some cases.

Bluestore is faster when you use an all ssd / nvme setup.

Filestore could be faster in an ssd jounral / hdd OSDs setup, particularly when your queue depth / client threads is lower than the number of OSDs, it is likely in your setup bluestore will be faster if the total threads across all your clients is several times larger that your total disk count of 80.

Lastly check that you have enough cpu cores to handle the 20 disks per node,

In most cases bluestore will be faster, filestore will be in some cases.

Bluestore is faster when you use an all ssd / nvme setup.

Filestore could be faster in an ssd jounral / hdd OSDs setup, particularly when your queue depth / client threads is lower than the number of OSDs, it is likely in your setup bluestore will be faster if the total threads across all your clients is several times larger that your total disk count of 80.

Lastly check that you have enough cpu cores to handle the 20 disks per node,

Last edited on February 19, 2018, 9:39 pm by admin · #2

BonsaiJoe
53 Posts

February 19, 2018, 10:05 pm
Quote from BonsaiJoe on February 19, 2018, 10:05 pm
perfect thanks for the information you are right if we increase the threads to 64 and use 3 of our nodes as client we get 7.000 write Iops 32.400 read Iops and 1500 MB/s write, 2150 read MB/s

each of the nodes has 2 x 8 core cpu (32 threads) and 64 gb ram

now we have added the cluster to our vmware cluster and run into the next problem if we move a vm from a single ssd only store to the petasan cluster troughput is max 80 mb/s

but if we move the vm between 2 other storages we get 500-800mb/s
btw. we use 8x multipathing (2 path on each node) with round robin iops limit is set to 1 on all esx hosts for the petasan naa.

perfect thanks for the information you are right if we increase the threads to 64 and use 3 of our nodes as client we get 7.000 write Iops 32.400 read Iops and 1500 MB/s write, 2150 read MB/s

each of the nodes has 2 x 8 core cpu (32 threads) and 64 gb ram

now we have added the cluster to our vmware cluster and run into the next problem if we move a vm from a single ssd only store to the petasan cluster troughput is max 80 mb/s

but if we move the vm between 2 other storages we get 500-800mb/s
btw. we use 8x multipathing (2 path on each node) with round robin iops limit is set to 1 on all esx hosts for the petasan naa.

#3

admin
2,962 Posts

February 19, 2018, 10:34 pm
Quote from admin on February 19, 2018, 10:34 pm
How is the speed tested ? are you running several threads inside the vm or it is a single threaded test like file copy ?

As a side note: you can still use 2x multipathing and have each of the 4 esx use the same 2 paths.

How is the speed tested ? are you running several threads inside the vm or it is a single threaded test like file copy ?

As a side note: you can still use 2x multipathing and have each of the 4 esx use the same 2 paths.

#4

BonsaiJoe
53 Posts

February 19, 2018, 11:20 pm
Quote from BonsaiJoe on February 19, 2018, 11:20 pm
both if we move a VM (storage vmotion) we get max 80mb/s write and also if we use CristalDiskMark inside a vm read is ok with 460 mb/s but write is still max 80mb/s (test with sequential 32 queus and 1 Thread) a test with 32queus and 8 threads results 1450 mb/s read but only 100 mb/s write

sorry for the missunderstanding our idea of multipathing with 8 path was to spread the iscsi load over all 4 petasan nodes (each petasan node hosts 2 of the paths) and each of the 6 esx hosts is connected with all 8 path

both if we move a VM (storage vmotion) we get max 80mb/s write and also if we use CristalDiskMark inside a vm read is ok with 460 mb/s but write is still max 80mb/s (test with sequential 32 queus and 1 Thread) a test with 32queus and 8 threads results 1450 mb/s read but only 100 mb/s write

sorry for the missunderstanding our idea of multipathing with 8 path was to spread the iscsi load over all 4 petasan nodes (each petasan node hosts 2 of the paths) and each of the 6 esx hosts is connected with all 8 path

#5

admin
2,962 Posts

February 20, 2018, 7:23 am
Quote from admin on February 20, 2018, 7:23 am

For single threaded io, like file copy, Ceph will not give high results: for reads you will get at most the speed of a single disk, for writes it will be about 3 times slower. For a single io, your disks are idle while 1 disk is serving the io, this is unlike RAID. But Ceph scales when you have many concurrent ios at the same time. VMotion will not be fast for single vm, it does its transfer in sequential 64K io sizes which keeps hitting the same OSD disk.

What block size are you using in CrystalDislMark, for throughput test use 4M. Also increase the thread count from 8 to 32. My understand of  CrystalDislMark is threads will result in simulation of radom clients, queue will do io for the same client sequentially which will keep hitting the same OSDs and not utilize all cluster disks.

Can you keep the test running for 10 min and observe the resource load on a PetaSAN node, specifically cpu,

In ESX, can you  see if you have 2 or more vms, do you still see the same numbers.

If you can, run the CrystalDislMark on 1 or 2 non VM external machines, like physical Windows 2016/2012 with MPIO to see if there is any configuration issues on the ESX side.

If your needs is a few concurrent ios and still try to get max Ceph performance, it may be possible to use rbd striping where each disks images are striped at smaller sizes such as 32K bytes (default is 4M), in such case a single client io will be split into several concurrent OSD requests so it will be able to access several disks in parallel for single io, this feature is not supported in Linux kernel but we added this support in our kernel since version 1.5, it is not fully supported yet and you need to do manual cli commands to set it.

For single threaded io, like file copy, Ceph will not give high results: for reads you will get at most the speed of a single disk, for writes it will be about 3 times slower. For a single io, your disks are idle while 1 disk is serving the io, this is unlike RAID. But Ceph scales when you have many concurrent ios at the same time. VMotion will not be fast for single vm, it does its transfer in sequential 64K io sizes which keeps hitting the same OSD disk.

What block size are you using in CrystalDislMark, for throughput test use 4M. Also increase the thread count from 8 to 32. My understand of  CrystalDislMark is threads will result in simulation of radom clients, queue will do io for the same client sequentially which will keep hitting the same OSDs and not utilize all cluster disks.

Can you keep the test running for 10 min and observe the resource load on a PetaSAN node, specifically cpu,

In ESX, can you  see if you have 2 or more vms, do you still see the same numbers.

If you can, run the CrystalDislMark on 1 or 2 non VM external machines, like physical Windows 2016/2012 with MPIO to see if there is any configuration issues on the ESX side.

If your needs is a few concurrent ios and still try to get max Ceph performance, it may be possible to use rbd striping where each disks images are striped at smaller sizes such as 32K bytes (default is 4M), in such case a single client io will be split into several concurrent OSD requests so it will be able to access several disks in parallel for single io, this feature is not supported in Linux kernel but we added this support in our kernel since version 1.5, it is not fully supported yet and you need to do manual cli commands to set it.

Last edited on February 20, 2018, 11:19 am by admin · #6

yudorogov
31 Posts

February 21, 2018, 6:28 am
Quote from yudorogov on February 21, 2018, 6:28 am

If your needs is a few concurrent ios and still try to get max Ceph performance, it may be possible to use rbd striping where each disks images are striped at smaller sizes such as 32K bytes (default is 4M), in such case a single client io will be split into several concurrent OSD requests so it will be able to access several disks in parallel for single io, this feature is not supported in Linux kernel but we added this support in our kernel since version 1.5, it is not fully supported yet and you need to do manual cli commands to set it.

How i can get cli commands for this?

If your needs is a few concurrent ios and still try to get max Ceph performance, it may be possible to use rbd striping where each disks images are striped at smaller sizes such as 32K bytes (default is 4M), in such case a single client io will be split into several concurrent OSD requests so it will be able to access several disks in parallel for single io, this feature is not supported in Linux kernel but we added this support in our kernel since version 1.5, it is not fully supported yet and you need to do manual cli commands to set it.

How i can get cli commands for this?

#7

admin
2,962 Posts

February 21, 2018, 12:34 pm
Quote from admin on February 21, 2018, 12:34 pm
This striping feature is still experimental, before you try:

Make sure you do need to split a single client io request. This will only be needed for very specific cases where you have a fixed application load and have more disks than io requests, so most of your disks are idle. In real production load if you end up having many concurrent clients, the striping could/would give worse results as all disks would be busy anyway, the splitting will result in unneeded overhead.

To make use of striping, you need to know what block sizes your application is using ( you can measure this by looking at your bandwidth and divide by iops ). You will need to create stripes not larger than your block size.

To try this, create striped image for example:

rbd create test-image --size 10240 --image-feature striping --image-feature layering --stripe-unit 32768 --stripe-count 16 --cluster XX

The stripe size (--stripe-unit) is 32K, client requests larger than 32k will be split and sent to different OSDs/disks. If it is karger than 16 ( --stripe-count) it will cycle back to the first OSD, then second.

After this you need to "Attach" this manual image in PetaSAN in iSCSI disk list use "Attach" action, this will create an iSCSI target disk, then click "Start" action

Again this is experimental.

This striping feature is still experimental, before you try:

Make sure you do need to split a single client io request. This will only be needed for very specific cases where you have a fixed application load and have more disks than io requests, so most of your disks are idle. In real production load if you end up having many concurrent clients, the striping could/would give worse results as all disks would be busy anyway, the splitting will result in unneeded overhead.

To make use of striping, you need to know what block sizes your application is using ( you can measure this by looking at your bandwidth and divide by iops ). You will need to create stripes not larger than your block size.

To try this, create striped image for example:

rbd create test-image --size 10240 --image-feature striping --image-feature layering --stripe-unit 32768 --stripe-count 16 --cluster XX

The stripe size (--stripe-unit) is 32K, client requests larger than 32k will be split and sent to different OSDs/disks. If it is karger than 16 ( --stripe-count) it will cycle back to the first OSD, then second.

After this you need to "Attach" this manual image in PetaSAN in iSCSI disk list use "Attach" action, this will create an iSCSI target disk, then click "Start" action

Again this is experimental.

Last edited on February 21, 2018, 12:34 pm by admin · #8

Post Reply: Cluster performance after upgrade

Cancel