Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Very poor performance on PetaSAN cluster

Hello all,

I apologize for coming on here with a complaint but at this point I am running out of options and I need some help. I have 2, 3 node clusters by 45 drives (Storinator Q30s). Upon performance problems we followed their suggestion and switched from plain Ceph with iSCSI target server to PetaSAN. When we first installed each system we were seeing reads in the 4.8GB/s range and writes in the 1.8GB/s range (both sequential 4k). I am sad to say after utilizing the holiday yesterday to shutdown the complete load on both sides I returned with less than flattering results. On one cluster (main) we landed with approx 700MB/s R and 71MB/s W and about 150MB/s R and 120MB/s W on the other. I have many VM that have been slowing down to a crawl including a mission critical SQL system that is hitting disk latencies in the 100s of seconds (not milliseconds).  At this point I really need some help with this. Below are my specs of the 2 clusters (generalized) as well as a bit of relevant config.

Node Hardware - Cluster 1 (main) (all 3 nodes):

  • 2019 Era Storinator Q30
  • 2x - Intel Xeon E5-2620 v4
  • 2x 1Gb/s Intel C610/X99 for management

Node Hardware - Cluster 2 (secondary) (all 3 nodes):

  • 2020 Era Storinator Q30
  • 2x - Intel Xeon Silver 4210
  • 2x 1Gb/s Intel X722 for management

Node Hardware Continued (Both Clusters) -

  • 256GB DDR4
  • 2x 40Gb/s QSFP+ adapter using Intel XL710 for QSFP+

Disk Configuration (main) x3:

  • 2x - 256 GB SSD for boot / redundancy
  • 2x - 12TB HDDs
  • 4x - 12TB HDDs with Journal to sdr and Cache to sdq
  • 2x - 16TB HDDs with Journal to sdr and Cache to sdq
  • 6x - 4TB Micron Enterprise
  • 2x - 8TB Micron Enterprise with Journaling enabled to sdr
  • 1x - 8TB Samsung SSD used for HDD Cache - sdq
  • 1x  - 2TB Samsung SSD used for Journal - sdr

Disk Configuration (secondary) x3:

  • 2x - 256 GB SSD for boot / redundancy
  • 6x - 18TB HDDs with Journal to sdo and Cache to sdm
  • 5x - 4TB Micron Enterprise SSDs with Journaling enabled to sdo
  • 1x - 2TB Samsung SSD used for HDD Cache - sdm
  • 1x  - 2TB Samsung SSD used for Journal - sdo

Pools (on both):

  • rbd - Not Used - PGs 1024 - Size 3 - Min Size 2 - Rule: replicated_rule
  • rdb_hdd2 - PGs 512 - Size 2 - Min Size 2 - Rule: replicated_hdd
  • rbd_hdd3 - PGs 256 - Size 3 - Min Size 2 - Rule: replicated_hdd
  • rbd_ssd2 - PGs 256 - Size 2 - Min Size 2 - Rule: replicated_ssd
  • rbd_ssd3 - PGs 128 - Size 3 - Min Size 2 - Rule: replicated_ssd

Crush Rules:

  1. replicated_hdd
    • {
      id 2
      type replicated
      min_size 1
      max_size 10
      step take default class hdd
      step chooseleaf firstn 0 type host
      step emit
      }
  2. replicated_ssd
    • {
      id 1
      type replicated
      min_size 1
      max_size 10
      step take default class ssd
      step chooseleaf firstn 0 type host
      step emit
      }

 

The images below show the results of the nodes under a full speed test load to the "rbd_ssd3" pool. This is the pool I have previously witnessed performing exponentially better previously.
These attached images showed me something I believe to be strange, no single SSD that was being tested reached a load over 50MB/s at any point durring the test (approx 16:25) yet the network adapters on nodes 2 & 3 were being highly utilized even though the iSCSI mount point was only on Node 1. As a side note although I did not catch it in my screenshots, memory was at a reasonable level and CPU was under 15% load at all times.

Please let me know if you would like any more information... I have old cluster benchmarks but unfortunately nothing super recent as I forgot to run them yesterday when I had everything down. If you would like me to upload those please give me a ring and Ill see if I can find them.

Thank you all in advance for your help!

SPraus

 

The images below show the results of the nodes under a full speed test load to the "rbd_ssd3" pool

How was this test done ?

The graphs do not show the cluster is loaded, was there no other activity but the test ?

Are the disk shown in graph, SSD or HDD ?

Can you show Disk % Util graph for this test ?

What version of PetaSAN do you use ?

 

Hi there, thanks for the quick response! Below are the replies to your questions

How was this test done ?

  • I have tried several different ways but this specific test was done utilizing a VMWare VM using an iSCSI "RDM Disk" mount. In the past I have tried direct iSCSI to Windows and/or linux aswell as using the cluster benchmark built-in and have always gotten within 10% of each other.

The graphs do not show the cluster is loaded, was there no other activity but the test ?

  • That is correct, I did these tests under no load... My usual load has many VMs (around 40) running on the cluster, I used the holiday as an excuse to shut them all down and only place the single load on it to get the best possible single point test. Under load it is far worse, at this moment (with the load) I am getting about 100MB/s R and about 10MB/s write with disk latencies upwards of 5-7 seconds (5000-7000ms).

Are the disk shown in graph, SSD or HDD ?

  • Although all disks are shown, the ones with activity spikes were just the SSDs including the journal disk.

Can you show Disk % Util graph for this test ?

What version of PetaSAN do you use ?

  • I use Version 2.8.1-45drives1 which is I believe just the 45drives fork that adds their tools for their chassis.

 

*Begin Edit
I have also uploaded this image: https://ibb.co/XVkpxv5. On this node
sda,sdb are the OS disks
sdc,sdd,sde,sdf,sdg,sdh are SSD OSD nodes (12-17)
sds,sdt are SSD OSD Nodes (34,40) using Journal
sdr is SSD Journal
*End Edit

 

Thank you very much for your help!!!
SPraus

How was this test done ?

I have tried several different ways but this specific test was done utilizing a VMWare VM using an iSCSI "RDM Disk" mount. In the past I have tried direct iSCSI to Windows and/or linux aswell as using the cluster benchmark built-in and have always gotten within 10% of each other.

Still not clear...what/how did you run from the vm : a fie copy operation ? a 4k file ? a large file ? what block size ? how many concurrent operations / threads ?

 Although all disks are shown, the ones with activity spikes were just the SSDs including the journal disk.

Do you use a journal for your OSD SSDs ?

Can you show the Disk % Util currently under load  ?

Still not clear...what/how did you run from the vm : a fie copy operation ? a 4k file ? a large file ? what block size ? how many concurrent operations / threads ?

Sorry for not being clear, in this specific case I ran Crystal Diskmark with the following configs:

Test

  • Read:
    • Queue=8, Threads=1, Seq 1M -> 791 MB/s
    • Queue=1, Threads=1, Seq 1M -> 297 MB/s
    • Queue=32, Threads=1, Rand 4k -> 243MB/s
    • Queue=1, Threads=1, Rand 4k -> 17.3 MB/s
  • Write:
    • Queue=8, Threads=1, Seq 1M -> 71 MB/s
    • Queue=1, Threads=1, Seq 1M -> 44 MB/s
    • Queue=32, Threads=1, Rand 4k -> 21MB/s
    • Queue=1, Threads=1, Rand 4k -> 3.0 MB/s

Do you use a journal for your OSD SSDs ?

I am currently using a Journal for 2/8 SSDs as I hadnt fully switched since adding it. Honestly I had seen a performance decrease since adding it to the 2 so I held off adding it to the rest.

Can you show the Disk % Util currently under load  ?

No problem, here are all 3 nodes:

Again thank you so much for your help and your speedy responses!

SPraus

 

PS: As for current load, here is a crystal disktest performed a few hours ago:

  • [Read]
    SEQ 1MiB (Q= 8, T= 1): 213.513 MB/s [ 203.6 IOPS] < 34555.91 us>
    SEQ 128KiB (Q= 32, T= 1): 252.746 MB/s [ 1928.3 IOPS] < 16557.61 us>
    RND 4KiB (Q= 32, T=16): 55.607 MB/s [ 13575.9 IOPS] < 36599.42 us>
    RND 4KiB (Q= 1, T= 1): 3.436 MB/s [ 838.9 IOPS] < 1189.75 us>

    [Write]
    SEQ 1MiB (Q= 8, T= 1): 28.733 MB/s [ 27.4 IOPS] <275903.91 us>
    SEQ 128KiB (Q= 32, T= 1): 29.232 MB/s [ 223.0 IOPS] <140706.10 us>
    RND 4KiB (Q= 32, T=16): 4.103 MB/s [ 1001.7 IOPS] <318612.06 us>
    RND 4KiB (Q= 1, T= 1): 0.248 MB/s [ 60.5 IOPS] < 15548.06 us>

do you have support under 45Drives ? if so i would recommend you open an issue with them and they would get us involved.

Hi there,

So yes I have support through 45 drives unfortunately I spent nearly 12 weeks speaking with them about this problem with no end to the problem. Unfortunately that was about 3 months ago as we just went through a restructuring which has caused some delays. I did just get done testing on our 2nd cluster (was able to remove the load) and I was able to get around 2.5GB/s read and 900MB/s write when using 16 threads. Unfortunately I cannot achieve those write numbers on the primary. I have to ask if you think this may be due to the fact that our secondary has 100% of SSDs with a journal disk. Please let me know if you think this may be the reason.

Thanks;

SPraus

Hi,

we have exactly the same problem. A write benchmark under one of the Ceph nodes gives a write speed of 500 MB/s. Network speed of iperf from ESXI to a CEPH node and vice versa reaches maximum speed.

Write within a VM swings between 20 and 100 MB/s.

Writing with dd below an ESXI node to the mounted directory results in even worse values. For testing, I created an NFS export and just moved a VM there - 10 MB/s.
When the move is done, I will test the speed inside the VM, but I already conclude that Petasan and VMware are not meant for each other.
Same hardware (and not junk) running vSAN with disks 450-500 MB/s write speed. Reading at over 1.2 GB/s...

 

Ceph Nodes
2 x AMD EPYC 7252 8-Core Processor
128 GB RAM
8 HDD HGST 6 TB
4 x NVME 1 TB as Journal
10 Gbit/s Network

Network Hardware
EX4550 Juniper

ESXI
2 x AMD EPYC 7252 16-Core Processor
512 GB RAM

EDIT: move is rdy

erc ~ # dd if=/dev/zero of=/root/testfile bs=1G count=1 oflag=direct
1+0 Datensätze ein
1+0 Datensätze aus
1073741824 Bytes (1,1 GB, 1,0 GiB) kopiert, 14,2822 s, 75,2 MB/s

Write within a VM swings between 20 and 100 MB/s.

how is this tested ? is this file copy ? do you use a testing tool ?