Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

All-flash performance

Hello Admin,

we did a setup of a PetaSAN cluster and connected it via iSCSI to a VMWare Cluster.
We used 3 identical nodes - ProLiant DL380p Gen8:

- 2x Intel(R) Xeon(R) CPU E5-2640
- 64 GB RAM
- 1x 1Gbit network card (eth0: management)
- 2x 10Gbit network cards (eth4: iscsi-1, backend-1; eth5: iscsi-2, backend-2)
- Smart HBA H240 (HBA mode)
- 5x Crucial MX500 2TB (OSDs)
- 1x 512GB Samsung 970 Pro NVMe (Journal)

We created a replicated pool with a size of 3 and a min_size of 2

Right now we are moving a few VMs on the storage to do a few performance tests and the first impression is disappointing.
The performance on the VMs a worse or similar to our old (ZFS, 10k SAS, "Raid10").
Also the PetaSAN build in Cluster Benchmark result is not satisfying:
IOPS with 64 Threads, 10 Minutes: 11k write and 25k read

As we read in an older forum post we look at the CPU and DISK util. during the test.
Please have a look at the screenshots:
https://drive.google.com/open?id=1oFbg5Fr_QYWTx7qYfobmdAEy8gRqqxOX
Test started at 15:27 for 10 minutes. (HBPS01 was the client)

What are normal IOPS values for an all-flash PetaSAN? What can we change to improve the performance ?

Thank you for your help.
Trexman

Not all SSDs perform the same, there could be a large difference in Ceph as in

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

Also Generally RAID will be faster when you have a low number of concurrent io, Ceph will be lower per io but will scale better.

We see all flash systems giving random iops from 30k/10k  to 60k/60k  read/write  for a 3 nodes system + will scale linearly per node.

For your hardware, i am not sure how Crucial SSD perform on sync writes, you can test in via the PetaSAN node console (blue screen) on a raw blank disk (cannot do it after adding it as OSD). However what stands out is the nvme device being at 100% busy ! If this is true then this will be a bottleneck..it could be something wrong with the device or maybe there is an issue in the kernel with this specific device or (better) that our measuring code (base don systat/sar) is reporting wrong values, you can run a different reporting tool like atop or collectl (included with PetaSAN) aside from systat and see if they give more normal values. Another thing is with good SSD drives you really do not need an nvme journal, so i also recommend taking away the nvme and using straight SSD without external journal.

 

Trexman what ever came of this? I also am using these same exact servers and was wondering what you came up with
Quote from trexman on January 14, 2019, 4:14 pm

Hello Admin,

we did a setup of a PetaSAN cluster and connected it via iSCSI to a VMWare Cluster.
We used 3 identical nodes - ProLiant DL380p Gen8:

- 2x Intel(R) Xeon(R) CPU E5-2640
- 64 GB RAM
- 1x 1Gbit network card (eth0: management)
- 2x 10Gbit network cards (eth4: iscsi-1, backend-1; eth5: iscsi-2, backend-2)
- Smart HBA H240 (HBA mode)
- 5x Crucial MX500 2TB (OSDs)
- 1x 512GB Samsung 970 Pro NVMe (Journal)

We created a replicated pool with a size of 3 and a min_size of 2

Right now we are moving a few VMs on the storage to do a few performance tests and the first impression is disappointing.
The performance on the VMs a worse or similar to our old (ZFS, 10k SAS, "Raid10").
Also the PetaSAN build in Cluster Benchmark result is not satisfying:
IOPS with 64 Threads, 10 Minutes: 11k write and 25k read

As we read in an older forum post we look at the CPU and DISK util. during the test.
Please have a look at the screenshots:
https://drive.google.com/open?id=1oFbg5Fr_QYWTx7qYfobmdAEy8gRqqxOX
Test started at 15:27 for 10 minutes. (HBPS01 was the client)

What are normal IOPS values for an all-flash PetaSAN? What can we change to improve the performance ?

Thank you for your help.
Trexman

 

Ceph does its writes using sync=1, this makes sure writes hits storage media and not cached on the disk so to protect again data loss/corruption in case of power failure. Most consumer ssds may have very high iops for normal writes but may give only 600-2000 iops in sync writes, in contrast some enterprise ssds give 500K+ sync iops. so not all ssds work well with Ceph. I believe the MX500 will give around 2k sync iops, evos typically have low sync iops, but i am not sure on  the 970 Pro. If you have empty / raw disks you can use the PetaSAN blue console menu to test sync write speed.

Options for write iops speed:

  • remove the 970 Pro journal and just use MX500, + instead of having 5 per host, have 10-16 ( or until your cpu starts to saturate).
  • get a controller with write back cache (battery backed) this will internally access the disks without the sync write flag as the writes are cached
  • get higher end disks

For read: we configure the read_ahead_kb in /etc/udev/rules.d/90-petasan-disk.rules to 1M, lowering this should give better random read iops..but it is better to leave it as it significantly increase sequential iops. the PetaSAN benchmark is 4k random.

 

Hi,

we are still working on our all-flash PetaSAN. In the meantime we increase the amount of SSD from 5 to 11 per node.
The write back cache of the controller that we are using since them was a real boost for the write IOPS.
Unfortunately it is hard to say if or how the performance increased of the 6 additions SSD per node, because the SAN is now full productive.

Because of some changes we have the opportunity to switch Crucial MX500 2TB to some Enterprise SSD. But we are a little bit helpless now.
As you explained: The consumer SSDs may have very high IOPS for normal writes but much more less for sync writes. If you try to look this up in datasheets you get lost.
So we took some references of the Sébastien Han website above.

But still we are not sure what good 2TB Enterprise 6G SATA SSDs are for a reasonable price.

My idea was now to buy 2-3 different models put them into a PetaSAN node and run the test from the "blue console" menu.
But is this test meaningful in the usage of e.g. 30 OSDs in a PetaSAN/Ceph cluster? Or would this just give us (or confirms) the values like you can find it in the producers datasheets?

I find the saying "better use enterprise SSD" really hard, when you have to choose one 😉

Thanks for your help and ideas.

For high end flash setups, we reach 25k rand write client iops per server.   Improving performance on an existing system could be a never ending cycle, whenever you improve some hardware, something else becomes the bottleneck, so it is important to put this relative to how much your workload requires as well as your budget.

If you add very fast flash disks but your cpu is at 100%, the extra disks will not improve iops performance. Aside from performance, enterprise ssds offer durability and power loss protection. If your workload does require lowest latencies (iops per client) , then fast ssds + cpu + nics are needed, this is the most expensive setup.

fyi some flash devices known to work well :

Samsung PM/SM
863 883 963 983 1633 1725
Intel
S3510 P3600 S3610 P3700 S4500 P/S4510 S4600 P/S4610