Forums - PetaSAN

ForumGeneral DiscussionRemoving journals from live pool
You need to log in to create posts and topics. Login · Register
Removing journals from live pool

Pages: 1 2

protocol6v
85 Posts

October 5, 2018, 11:57 am
Quote from protocol6v on October 5, 2018, 11:57 am
I made a bad call putting a cluster into production with some very poor performing SSDs as the journals. I can't get single IO streams such as file copy to the cluster to perform any better than 20-30MB/s. Here's some details about the cluster:

PetaSAN 2.0

4 nodes, dual xeon e5-2690v2

128GB DDR3

24 3TB 7200 SAS2 drives each node via LSI-9211-8i in IT mode

2x 512GB NVMe (Samsung 870 Pro) journals each node via PCIe>M.2 cards, so 12 spindles per journal ( I know, also not ideal, but the SSDs seemed to be keeping up during testing)

4 10GBe ports per node, with a 2x 1GBe LAG configured as the management network. there are 2 10GBe LAGs for the the iSCSI and Cluster networks (IE, iSCSI path 1 and Cluster network 1 on one LAG, iSCSI2 and Cluster 2 on other LAG)

The cluster benchmarks as follows on 4M throughput, 16 thread for 1 min: Write 1100MB/s, Read 1400MB/s, Memory util on all nodes is ~40%, CPU util on all nodes are below 20%, and network is ~25%. Disk util are all under 40%. And this benchmark was run on a live system, with ~70 VMs running. These percentages are all the MAX values, not the avg.

I'm thinking I would be better off just banking on spindle speed rather than the journals, that way I should see at least 100MB/s, as these disks can write around 200MB/s individually (as I'm only using 2 replicas). Am I correct in that logic?

If that is correct, how do I go about removing the journals from the pool? Would the best way be to reweight all OSDs on one node, remove them all, remove the journal, then re-add all OSDs one node at a time?

As an aside, other than these performance issues, PetaSAN is FANTASTIC. I'm really looking forward to upgrading to 2.1 after testing, and really looking forward to future releases especially once snapshotting is available. I really can't stress how much I love this software, keep up the EXCELLENT work!!!

I made a bad call putting a cluster into production with some very poor performing SSDs as the journals. I can't get single IO streams such as file copy to the cluster to perform any better than 20-30MB/s. Here's some details about the cluster:

PetaSAN 2.0

4 nodes, dual xeon e5-2690v2

128GB DDR3

24 3TB 7200 SAS2 drives each node via LSI-9211-8i in IT mode

2x 512GB NVMe (Samsung 870 Pro) journals each node via PCIe>M.2 cards, so 12 spindles per journal ( I know, also not ideal, but the SSDs seemed to be keeping up during testing)

4 10GBe ports per node, with a 2x 1GBe LAG configured as the management network. there are 2 10GBe LAGs for the the iSCSI and Cluster networks (IE, iSCSI path 1 and Cluster network 1 on one LAG, iSCSI2 and Cluster 2 on other LAG)

The cluster benchmarks as follows on 4M throughput, 16 thread for 1 min: Write 1100MB/s, Read 1400MB/s, Memory util on all nodes is ~40%, CPU util on all nodes are below 20%, and network is ~25%. Disk util are all under 40%. And this benchmark was run on a live system, with ~70 VMs running. These percentages are all the MAX values, not the avg.

I'm thinking I would be better off just banking on spindle speed rather than the journals, that way I should see at least 100MB/s, as these disks can write around 200MB/s individually (as I'm only using 2 replicas). Am I correct in that logic?

If that is correct, how do I go about removing the journals from the pool? Would the best way be to reweight all OSDs on one node, remove them all, remove the journal, then re-add all OSDs one node at a time?

As an aside, other than these performance issues, PetaSAN is FANTASTIC. I'm really looking forward to upgrading to 2.1 after testing, and really looking forward to future releases especially once snapshotting is available. I really can't stress how much I love this software, keep up the EXCELLENT work!!!

Last edited on October 5, 2018, 11:58 am by protocol6v · #1

admin
2,969 Posts

October 5, 2018, 12:37 pm
Quote from admin on October 5, 2018, 12:37 pm
Can you run 4M benchmark with 1 thread for 1 min choosing 1 client node, what is the read/write bandwidth ?

Can you run 4M + 4k benchmark with 64 threads for 5 min choosing 2 nodes as clients: what are the read/write bandwidth and iops ? what is your cpu %util + disk % util for both journals + osds ?

When not running the benchmark, from the charts what is your cluster throughput and iops by your production vms ?

When you do a file copy operation: how is this done ? from a vm or from an external system and if so what type: Windows/Linux/ESx

Can you run 4M benchmark with 1 thread for 1 min choosing 1 client node, what is the read/write bandwidth ?

Can you run 4M + 4k benchmark with 64 threads for 5 min choosing 2 nodes as clients: what are the read/write bandwidth and iops ? what is your cpu %util + disk % util for both journals + osds ?

When not running the benchmark, from the charts what is your cluster throughput and iops by your production vms ?

When you do a file copy operation: how is this done ? from a vm or from an external system and if so what type: Windows/Linux/ESx

Last edited on October 5, 2018, 12:38 pm by admin · #2

protocol6v
85 Posts

October 5, 2018, 1:07 pm
Quote from protocol6v on October 5, 2018, 1:07 pm
For 4M/1t/1m/1node: write 76MB/s read 141MB/s

for 4M/64t/5m/2node: write 2475 MB/s read 1771 MB/s, CPU % Max on both selected nodes ~20%, disk % Max on both <70%

for 4k/64t/5m/2node: write 8983 IOPS, read 9282 IOPS, CPU % MAX on both ~50%, disk % max on both ~50%

I'm not seeing any details on the journal usage in the benchmark details, am I not looking in the right place?

Idle throughput with all the VMs is averaging 10MB/s r/w. IOPS averaging 200 r/w.

When performing a copy, I am doing the copy from a local disk (SSD, verified that the local disk is reading at 600MB/s) on a hyper-v host to the cluster via iSCSI. There are 5 hosts configured in a failover cluster with MPIO, and i have verified there are 8 functional paths. Each host has a LAG with 2x 10GBe.

Thanks!

For 4M/1t/1m/1node: write 76MB/s read 141MB/s

for 4M/64t/5m/2node: write 2475 MB/s read 1771 MB/s, CPU % Max on both selected nodes ~20%, disk % Max on both <70%

for 4k/64t/5m/2node: write 8983 IOPS, read 9282 IOPS, CPU % MAX on both ~50%, disk % max on both ~50%

I'm not seeing any details on the journal usage in the benchmark details, am I not looking in the right place?

Idle throughput with all the VMs is averaging 10MB/s r/w. IOPS averaging 200 r/w.

When performing a copy, I am doing the copy from a local disk (SSD, verified that the local disk is reading at 600MB/s) on a hyper-v host to the cluster via iSCSI. There are 5 hosts configured in a failover cluster with MPIO, and i have verified there are 8 functional paths. Each host has a LAG with 2x 10GBe.

Thanks!

#3

admin
2,969 Posts

October 5, 2018, 1:19 pm
Quote from admin on October 5, 2018, 1:19 pm
You can get the %util for journals from the nodes stats from the nodes that were not chosen as clients, make sure you select the correct time period which should be apparent from surge in bandwidth/iops

You can get the %util for journals from the nodes stats from the nodes that were not chosen as clients, make sure you select the correct time period which should be apparent from surge in bandwidth/iops

#4

protocol6v
85 Posts

October 5, 2018, 1:25 pm
Quote from protocol6v on October 5, 2018, 1:25 pm
Oh, i see you mean just grab that from the dashboard graphs...

The NVMe's are maxed out at 99-100% during the benchmark.

Oh, i see you mean just grab that from the dashboard graphs...

The NVMe's are maxed out at 99-100% during the benchmark.

#5

admin
2,969 Posts

October 5, 2018, 2:32 pm
Quote from admin on October 5, 2018, 2:32 pm
From the benchmark, overall performance is OK. It does scale with many io streams. The nvme are a bottleneck if you want the cluster to scale higher.
For single stream performance: First remember Ceph is not like RAID, your single stream performance will be less than your disk speed. To get better performance you either need an all ssd or in case of spinners, a controller with write cache. If you have access to a controller with cache (dont think LSI-9211 has) you should try this before changing your journals.

You current 76/140 MBs write/read benchmark result for a single thread is low, i believe this would correlate to the file copy speed you see. Note the cluster benchmark uses standard 4M block size, Windows copy uses 256k or 512k: you can see the effect of block sizes on you existing cluster by creating a test disk image-000XX via the ui and doing rbd level tests:

rbd bench --io-type write rbd/image-000XX --io-threads=1 --io-size 256K --io-pattern rand --rbd_cache=false --cluster xxx
rbd bench --io-type write rbd/image-000XX --io-threads=1 --io-size 512K --io-pattern rand --rbd_cache=false --cluster xxx
rbd bench --io-type write rbd/image-000XX --io-threads=1 --io-size 4M --io-pattern rand --rbd_cache=false --cluster xxx

You will get lower speed at lower block size, the effect is more with spinning disks without write back cache due to their high latency. For spinners, write back cache can boost write performance by 4-5 x at small block sizes ( 32-64k)

From the benchmark, overall performance is OK. It does scale with many io streams. The nvme are a bottleneck if you want the cluster to scale higher.
For single stream performance: First remember Ceph is not like RAID, your single stream performance will be less than your disk speed. To get better performance you either need an all ssd or in case of spinners, a controller with write cache. If you have access to a controller with cache (dont think LSI-9211 has) you should try this before changing your journals.

You current 76/140 MBs write/read benchmark result for a single thread is low, i believe this would correlate to the file copy speed you see. Note the cluster benchmark uses standard 4M block size, Windows copy uses 256k or 512k: you can see the effect of block sizes on you existing cluster by creating a test disk image-000XX via the ui and doing rbd level tests:

rbd bench --io-type write rbd/image-000XX --io-threads=1 --io-size 256K --io-pattern rand --rbd_cache=false --cluster xxx
rbd bench --io-type write rbd/image-000XX --io-threads=1 --io-size 512K --io-pattern rand --rbd_cache=false --cluster xxx
rbd bench --io-type write rbd/image-000XX --io-threads=1 --io-size 4M --io-pattern rand --rbd_cache=false --cluster xxx

You will get lower speed at lower block size, the effect is more with spinning disks without write back cache due to their high latency. For spinners, write back cache can boost write performance by 4-5 x at small block sizes ( 32-64k)

Last edited on October 5, 2018, 2:48 pm by admin · #6

protocol6v
85 Posts

October 5, 2018, 2:43 pm
Quote from protocol6v on October 5, 2018, 2:43 pm
Ok, so just to round out...

You don't think the journals ssd's are currently a bottleneck?

Your recommendation for gaining single IO performance is to move from HBA to controller with write cache? Any recommendation on what controller?

You no longer recommend a spinning cluster with SSD journals, but recommend either all SSD, or all HDD with a write cached controller?

If going all SSD, do you recommend using journals at all?

The only way to increase single IO speed would be to go all SSD, with enterprise grade ssd?

Thanks!

Ok, so just to round out...

You don't think the journals ssd's are currently a bottleneck?

Your recommendation for gaining single IO performance is to move from HBA to controller with write cache? Any recommendation on what controller?

You no longer recommend a spinning cluster with SSD journals, but recommend either all SSD, or all HDD with a write cached controller?

If going all SSD, do you recommend using journals at all?

The only way to increase single IO speed would be to go all SSD, with enterprise grade ssd?

Thanks!

#7

admin
2,969 Posts

October 5, 2018, 3:10 pm
Quote from admin on October 5, 2018, 3:10 pm
Improving performance bottleneck could be a never ending thing, there are many factors and is very much hardware dependent, it also depends on how to get the best improvement quickly and without too much cost..so to your questions :

They are if you need to increase your total cluster performance under load. This is because they were at 100% while your osds were at 50% under load. They will probably increase you single stream performance by some factor but i believe write cache will make a larger difference.

LSI / Areca would be good. Some models require you to use raid0 others can use jbod.

Our hardware guide recommends all ssds. For spinners, SSD journals will give you about 2x speed, write cache can give 4-5 x at small block sizes (32k-64k): you need both to get decent performance with hdds.

With bluestore most users do not use an external journal/db disk when using ssds, still Red Hat recommends to use external nvme at a rate of 1:4 ssds.

If you have the budget, else use spinners as per above to get something half way.

Improving performance bottleneck could be a never ending thing, there are many factors and is very much hardware dependent, it also depends on how to get the best improvement quickly and without too much cost..so to your questions :

They are if you need to increase your total cluster performance under load. This is because they were at 100% while your osds were at 50% under load. They will probably increase you single stream performance by some factor but i believe write cache will make a larger difference.

LSI / Areca would be good. Some models require you to use raid0 others can use jbod.

Our hardware guide recommends all ssds. For spinners, SSD journals will give you about 2x speed, write cache can give 4-5 x at small block sizes (32k-64k): you need both to get decent performance with hdds.

With bluestore most users do not use an external journal/db disk when using ssds, still Red Hat recommends to use external nvme at a rate of 1:4 ssds.

If you have the budget, else use spinners as per above to get something half way.

#8

protocol6v
85 Posts

October 5, 2018, 3:18 pm
Quote from protocol6v on October 5, 2018, 3:18 pm
All makes sense, just one more clarification on #3, so you really recommend both SSD journal AND a write cache controller for spinners?

My plan in the long run is to have a large spinning cluster for object storage(this one), and a smaller all flash cluster for boot volumes/high perf loads. Unfortunately, my budget isn't great so trying to take it one step at a time with the best improvement each time.

All makes sense, just one more clarification on #3, so you really recommend both SSD journal AND a write cache controller for spinners?

My plan in the long run is to have a large spinning cluster for object storage(this one), and a smaller all flash cluster for boot volumes/high perf loads. Unfortunately, my budget isn't great so trying to take it one step at a time with the best improvement each time.

#9

admin
2,969 Posts

October 5, 2018, 3:24 pm
Quote from admin on October 5, 2018, 3:24 pm
For point 3, yes.

Another point is if you have special case where you do not have too many concurrent streams (which is what Ceph is good at) but have a small numbers of io which you want to get better performance, you can:

1 use rbd stipping feature (it will require manual image creation)

2 create a RAID 0 disk in your client os that is made of several PetaSAN iSCSI disks,

But these are very special cases (example video streaming) and do not scale well if you have many ios like in a vm environment.

For point 3, yes.

Another point is if you have special case where you do not have too many concurrent streams (which is what Ceph is good at) but have a small numbers of io which you want to get better performance, you can:

1 use rbd stipping feature (it will require manual image creation)

2 create a RAID 0 disk in your client os that is made of several PetaSAN iSCSI disks,

But these are very special cases (example video streaming) and do not scale well if you have many ios like in a vm environment.

Last edited on October 5, 2018, 3:25 pm by admin · #10

Post Reply: Removing journals from live pool

Cancel

Pages: 1 2