Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Removing journals from live pool

Pages: 1 2

I made a bad call putting a cluster into production with some very poor performing SSDs as the journals. I can't get single IO streams such as file copy to the cluster to perform any better than 20-30MB/s. Here's some details about the cluster:

  • PetaSAN 2.0
  • 4 nodes, dual xeon e5-2690v2
  • 128GB DDR3
  • 24 3TB 7200 SAS2 drives each node via LSI-9211-8i in IT mode
  • 2x 512GB NVMe (Samsung 870 Pro) journals each node via PCIe>M.2 cards, so 12 spindles per journal ( I know, also not ideal, but the SSDs seemed to be keeping up during testing)
  • 4 10GBe ports per node, with a 2x 1GBe LAG configured as the management network. there are 2 10GBe LAGs for the the iSCSI and Cluster networks (IE, iSCSI path 1 and Cluster network 1 on one LAG, iSCSI2 and Cluster 2 on other LAG)

The cluster benchmarks as follows on 4M throughput, 16 thread for 1 min: Write 1100MB/s, Read 1400MB/s, Memory util on all nodes is ~40%, CPU util on all nodes are below 20%, and network is ~25%. Disk util are all under 40%. And this benchmark was run on a live system, with ~70 VMs running. These percentages are all the MAX values, not the avg.

I'm thinking I would be better off just banking on spindle speed rather than the journals, that way I should see at least 100MB/s, as these disks can write around 200MB/s individually (as I'm only using 2 replicas). Am I correct in that logic?

If that is correct, how do I go about removing the journals from the pool? Would the best way be to reweight all OSDs on one node, remove them all, remove the journal, then re-add all OSDs one node at a time?

As an aside, other than these performance issues, PetaSAN is FANTASTIC. I'm really looking forward to upgrading to 2.1 after testing, and really looking forward to future releases especially once snapshotting is available. I really can't stress how much I love this software, keep up the EXCELLENT work!!!

 

Can you run 4M benchmark with 1 thread for 1 min choosing 1 client node, what is the read/write bandwidth ?

Can you run 4M + 4k benchmark with 64 threads for 5 min choosing 2 nodes as clients: what are the read/write bandwidth and iops ? what is your cpu %util + disk % util for both journals + osds ?

When not running the benchmark, from the charts what is your cluster throughput and iops  by your production vms ?

When you do a file copy operation: how is this done ? from a vm or from an external system and if so what type: Windows/Linux/ESx

For 4M/1t/1m/1node: write 76MB/s read 141MB/s

for 4M/64t/5m/2node: write 2475 MB/s read 1771 MB/s, CPU % Max on both selected nodes ~20%, disk % Max on both <70%

for 4k/64t/5m/2node: write 8983 IOPS, read 9282 IOPS, CPU % MAX on both ~50%, disk % max on both ~50%

I'm not seeing any details on the journal usage in the benchmark details, am I not looking in the right place?

Idle throughput with all the VMs is averaging 10MB/s r/w. IOPS averaging 200 r/w.

When performing a copy, I am doing the copy from a local disk (SSD, verified that the local disk is reading at 600MB/s) on a hyper-v host to the cluster via iSCSI. There are 5 hosts configured in a failover cluster with MPIO, and i have verified there are 8 functional paths. Each host has a LAG with 2x 10GBe.

Thanks!

You can get the %util for journals from the nodes stats from the nodes that were not chosen as clients, make sure you select the correct time period which should be apparent from surge in bandwidth/iops

Oh, i see you mean just grab that from the dashboard graphs...

The NVMe's are maxed out at 99-100% during the benchmark.

From the benchmark, overall performance is OK. It does scale with many io streams. The nvme are a bottleneck if you want the cluster to scale higher.
For single stream performance: First remember Ceph is not like RAID, your single stream performance will be less than your disk speed. To get better performance you either need an all ssd or in case of spinners, a controller with write cache. If you have access to a controller with cache (dont think LSI-9211 has)  you should try this before changing your journals.

You current 76/140 MBs write/read benchmark result for a single thread is low, i believe this would correlate to the file copy speed you see. Note the cluster benchmark uses standard 4M block size, Windows copy uses 256k or 512k: you can see the effect of block sizes on you existing cluster by creating a test disk image-000XX via the ui and doing rbd level tests:

rbd bench --io-type write rbd/image-000XX --io-threads=1 --io-size 256K --io-pattern rand --rbd_cache=false --cluster xxx
rbd bench --io-type write rbd/image-000XX --io-threads=1 --io-size 512K --io-pattern rand --rbd_cache=false --cluster xxx
rbd bench --io-type write rbd/image-000XX --io-threads=1 --io-size 4M --io-pattern rand --rbd_cache=false --cluster xxx

You will get lower speed at lower block size, the effect is more with spinning disks without write back cache due to their high latency. For spinners, write back cache  can boost write performance by 4-5 x at small block sizes ( 32-64k)

Ok, so just to round out...

  1. You don't think the journals ssd's are currently a bottleneck?
  2. Your recommendation for gaining single IO performance is to move from HBA to controller with write cache? Any recommendation on what controller?
  3. You no longer recommend a spinning cluster with SSD journals, but recommend either all SSD, or all HDD with a write cached controller?
  4. If going all SSD, do you recommend using journals at all?
  5. The only way to increase single IO speed would be to go all SSD, with enterprise grade ssd?

Thanks!

 

Improving performance bottleneck could be a never ending thing, there are many factors and is very much hardware dependent, it also depends on how to get the best improvement quickly and without too much cost..so to your questions :

  1. They are if you need to increase your total cluster performance under load. This is because they were at 100% while your osds were at 50% under load. They will probably increase you single stream performance by some factor but i believe write cache will make a larger difference.
  2. LSI / Areca  would be good. Some models require you to use raid0 others can use jbod.
  3. Our hardware guide recommends all ssds. For spinners, SSD journals will give you about 2x speed, write cache can give 4-5 x at small block sizes (32k-64k): you need both to get decent performance with hdds.
  4. With bluestore most users do not use an external journal/db disk when using ssds, still Red Hat recommends to use external nvme at a rate of 1:4 ssds.
  5. If you have the budget, else use spinners as per above to get something half way.

 

All makes sense, just one more clarification on #3, so you really recommend both SSD journal AND a write cache controller for spinners?

My plan in the long run is to have a large spinning cluster for object storage(this one), and a smaller all flash cluster for boot volumes/high perf loads. Unfortunately, my budget isn't great so trying to take it one step at a time with the best improvement each time.

For point 3, yes.

Another point is if you have special case where you do not have too many concurrent streams (which is what Ceph is good at)  but have a small numbers of io which you want to get better performance,  you can:

1 use rbd stipping feature (it will require manual image creation)

2 create a RAID 0 disk in your client os that is made of several PetaSAN iSCSI disks,

But these are very special cases (example video streaming) and do not scale well  if you have many ios like in a vm environment.

Pages: 1 2