Tuning advice
jeenode
27 Posts
March 14, 2018, 6:21 pmQuote from jeenode on March 14, 2018, 6:21 pmHi,
We are playing with a 2 node setup of Petasan 2.0 with 38 disks total and the performance with it serving VMware datastores doesnt seem to be very impressive.
Setup is as following:
- 2x Supermicro nodes, each with 2x CPU, one node with 8 total physical cores (16 hyperthreaded) and 12 physical cores (24 hyperthreaded)
- Each node runs ESXi and Petasan VMs:
- Separate management VMs (3)
- Separate disk storage VMs, one on each physical box with PCI passthrough controller for access to disks, each node has a total of 19 disks, which are 2.5" Seagate Constellation series, mix of 500Gb abd 1Tb. Each storage VM has 48Gb of RAM reserved for it
- Separate ISCSI VMs, one per physical host, 4 vCPU, 16Gb RAM (reserved)
- Did following tuning:
- Set the tuning set to "Low End Hardware" - maybe should have used "Mid-Range"?
- Setup a manually created RBD image with striping of 32K
- Vmware - setup round-robin with IOPS limit=1
The internal benchmarks seem to be quite good, but when I am now moving 3 VMs to the datastore off the setup above, the total write speed is at 35 Mb/s (10-12 Mb/s per VM).
We dont have any SSD caching setup, but with that many spindles, I was expecting a bit more.
Any ideas as to where to look? Maybe try filestore instead of bluestore? Anything else?
Hi,
We are playing with a 2 node setup of Petasan 2.0 with 38 disks total and the performance with it serving VMware datastores doesnt seem to be very impressive.
Setup is as following:
- 2x Supermicro nodes, each with 2x CPU, one node with 8 total physical cores (16 hyperthreaded) and 12 physical cores (24 hyperthreaded)
- Each node runs ESXi and Petasan VMs:
- Separate management VMs (3)
- Separate disk storage VMs, one on each physical box with PCI passthrough controller for access to disks, each node has a total of 19 disks, which are 2.5" Seagate Constellation series, mix of 500Gb abd 1Tb. Each storage VM has 48Gb of RAM reserved for it
- Separate ISCSI VMs, one per physical host, 4 vCPU, 16Gb RAM (reserved)
- Did following tuning:
- Set the tuning set to "Low End Hardware" - maybe should have used "Mid-Range"?
- Setup a manually created RBD image with striping of 32K
- Vmware - setup round-robin with IOPS limit=1
The internal benchmarks seem to be quite good, but when I am now moving 3 VMs to the datastore off the setup above, the total write speed is at 35 Mb/s (10-12 Mb/s per VM).
We dont have any SSD caching setup, but with that many spindles, I was expecting a bit more.
Any ideas as to where to look? Maybe try filestore instead of bluestore? Anything else?
admin
2,930 Posts
March 14, 2018, 8:32 pmQuote from admin on March 14, 2018, 8:32 pmMy recommendation if you can use a controller with write back cache (battery backed), this will give improvements by several factors. Next if you use external SSDs for wal/db it will increase speed by about x2. If you do not use write back cache, you may find filestore with external SSD journals to give better results than Bluestore.
The internal benchmarks use 4M block sizes for throughput and 4k for iops, these are the standard block sizes for such tests. spinning disks do have good 4M performance but low 4k values (for example multiply your iops values x 4k it will not give high MB/s), your vms are probably around 64K block size ( divide your throughput values / iops as shown on dashboard, this will give you your average block size ) which are more prone to iops/latency with spinning disks.
If you look at your disk iops and disk utilization charts, are they high ?
The more vm load you add to a spinning disk cluster, you will find it will give higher total performance, this is because the disk can re-oder the ios in such a way to reduce seeks, but it will not improve per vm speed.
In general, all io in Ceph will become random ( even if clients were doing sequential writes), so at small block sizes, those will be controlled by disk latency. Moreover with Bluestore, a single io operation will require many other operations to write metadata in database , so with pure spinning disks it will be affected a lot by latency/ disk seeks. A pure SSD Bluestore will beat filestore, else try to use controller with write back cache.
My recommendation if you can use a controller with write back cache (battery backed), this will give improvements by several factors. Next if you use external SSDs for wal/db it will increase speed by about x2. If you do not use write back cache, you may find filestore with external SSD journals to give better results than Bluestore.
The internal benchmarks use 4M block sizes for throughput and 4k for iops, these are the standard block sizes for such tests. spinning disks do have good 4M performance but low 4k values (for example multiply your iops values x 4k it will not give high MB/s), your vms are probably around 64K block size ( divide your throughput values / iops as shown on dashboard, this will give you your average block size ) which are more prone to iops/latency with spinning disks.
If you look at your disk iops and disk utilization charts, are they high ?
The more vm load you add to a spinning disk cluster, you will find it will give higher total performance, this is because the disk can re-oder the ios in such a way to reduce seeks, but it will not improve per vm speed.
In general, all io in Ceph will become random ( even if clients were doing sequential writes), so at small block sizes, those will be controlled by disk latency. Moreover with Bluestore, a single io operation will require many other operations to write metadata in database , so with pure spinning disks it will be affected a lot by latency/ disk seeks. A pure SSD Bluestore will beat filestore, else try to use controller with write back cache.
Last edited on March 14, 2018, 8:36 pm by admin · #2
jeenode
27 Posts
March 14, 2018, 8:59 pmQuote from jeenode on March 14, 2018, 8:59 pmThanks for your quick reply.
We do not have controllers with write cache, will look into using SSDs for WAL+RockDB. What would be SSD to spindle ratio you would recommend?
Thanks for your quick reply.
We do not have controllers with write cache, will look into using SSDs for WAL+RockDB. What would be SSD to spindle ratio you would recommend?
admin
2,930 Posts
March 14, 2018, 9:29 pmQuote from admin on March 14, 2018, 9:29 pmgeneral recommendation is 1 ssd per 4 hdds
also in your case do compare with v1.5 : filestore with sdd journals
general recommendation is 1 ssd per 4 hdds
also in your case do compare with v1.5 : filestore with sdd journals
jeenode
27 Posts
March 16, 2018, 3:46 pmQuote from jeenode on March 16, 2018, 3:46 pmThanks.
How can I use filestore with PeataSAN 2.0 ? Need to change ceph.conf to have default object store be filestore or manually create OSDs?
Thanks.
How can I use filestore with PeataSAN 2.0 ? Need to change ceph.conf to have default object store be filestore or manually create OSDs?
admin
2,930 Posts
March 16, 2018, 6:01 pmQuote from admin on March 16, 2018, 6:01 pmYou can only create bluestore osds from PetaSAN 2.0, for filestore you need v 1.5. If you are benchmarking, do a v 1.5 test and when done choose a "fresh" install of 2.0 rather than upgrade else it will preserve existing filestore osds.
You can only create bluestore osds from PetaSAN 2.0, for filestore you need v 1.5. If you are benchmarking, do a v 1.5 test and when done choose a "fresh" install of 2.0 rather than upgrade else it will preserve existing filestore osds.
Last edited on March 16, 2018, 6:02 pm by admin · #6
gmc
2 Posts
March 26, 2018, 11:45 pmQuote from gmc on March 26, 2018, 11:45 pmHi, I just wanted to chime in here to say that I have seen similar performance characteristics in my testing on 2.0.
My test setup was a virtual 3 node cluster, on separate ESXi hosts, with 6 disks in each VM actually hosted on a backend san stack. VMs with 8vCPU and 16GB ram.
VMware\Storage is all 10Gb connected, jumbo frames enabled.
Note that this was the initial test install just to have a look at petasan, I know that it isn’t designed to work this way.
Internal benchmarking of the ceph cluster was good, delivering the iops and throughput that I expected given the test infrastructure and san based disks, about 5Gbps and 10000 iops.
However when connecting either a separate esxi host via iscsi, or a Windows host via iscsi, to the ceph storage I also saw 35MBps throughput, and lowly iops.
As I am the perfect target market for petasan (ie MIcrosoft\VMware\Network corporate dude) and not a Linux admin, and although I have access to guys in the same room who are awesome Linux admins who manage hundred of systems including large scale storage, we haven’t done any troubleshooting due to “insert appropriate reason here”.
Happy to help if needed, but would also welcome any advice.
gmc
Hi, I just wanted to chime in here to say that I have seen similar performance characteristics in my testing on 2.0.
My test setup was a virtual 3 node cluster, on separate ESXi hosts, with 6 disks in each VM actually hosted on a backend san stack. VMs with 8vCPU and 16GB ram.
VMware\Storage is all 10Gb connected, jumbo frames enabled.
Note that this was the initial test install just to have a look at petasan, I know that it isn’t designed to work this way.
Internal benchmarking of the ceph cluster was good, delivering the iops and throughput that I expected given the test infrastructure and san based disks, about 5Gbps and 10000 iops.
However when connecting either a separate esxi host via iscsi, or a Windows host via iscsi, to the ceph storage I also saw 35MBps throughput, and lowly iops.
As I am the perfect target market for petasan (ie MIcrosoft\VMware\Network corporate dude) and not a Linux admin, and although I have access to guys in the same room who are awesome Linux admins who manage hundred of systems including large scale storage, we haven’t done any troubleshooting due to “insert appropriate reason here”.
Happy to help if needed, but would also welcome any advice.
gmc
admin
2,930 Posts
March 27, 2018, 8:41 amQuote from admin on March 27, 2018, 8:41 amUse a controller with write back cache, this will boost performance by several factors or better yet use an all ssd solution. Use real hardware and disks in jbod, no raid or san backstores.
There are 2 factors: small block size performance and single threaded performance.
Disk latency will affect your small block sizes. The 5Gbps and 10000 iops are done using 4M and 4k block sizes which are standard test sizes. At 10K iops this translates to only 40Mbps throughput. The wide variance 40 MBps-5Gbps is for a block size range of 4k-4M, so if your ESX/Windows were writing at 64K for example, you will not reach 5Gbps but for example 300 MBps in total.
Single threaded performance: Ceph scales well with concurrent access, but for 1 (or few) concurrent io, it will not give full or high performance, unlike raid. You probably ran your internal benchmarks with high thread counts (64?), if you run it with 1 thread you will get much less numbers. If you are able to run multiple ESX vms /Windows workload with high concurrency you will get close to the internal performance for the same block size.
Effect of small size disk latency is not as apparent in raid devices, the controller is able to re-order/assemble small block into continuous larger sizes, in Ceph the io pattern is more complex: each io operation requires many supporting ios (to read/write from rocksdb database) and the randomness is very high even for sequential client io. An all SSD/nvme solution or using hdd with a write back cache will solve this.
Use a controller with write back cache, this will boost performance by several factors or better yet use an all ssd solution. Use real hardware and disks in jbod, no raid or san backstores.
There are 2 factors: small block size performance and single threaded performance.
Disk latency will affect your small block sizes. The 5Gbps and 10000 iops are done using 4M and 4k block sizes which are standard test sizes. At 10K iops this translates to only 40Mbps throughput. The wide variance 40 MBps-5Gbps is for a block size range of 4k-4M, so if your ESX/Windows were writing at 64K for example, you will not reach 5Gbps but for example 300 MBps in total.
Single threaded performance: Ceph scales well with concurrent access, but for 1 (or few) concurrent io, it will not give full or high performance, unlike raid. You probably ran your internal benchmarks with high thread counts (64?), if you run it with 1 thread you will get much less numbers. If you are able to run multiple ESX vms /Windows workload with high concurrency you will get close to the internal performance for the same block size.
Effect of small size disk latency is not as apparent in raid devices, the controller is able to re-order/assemble small block into continuous larger sizes, in Ceph the io pattern is more complex: each io operation requires many supporting ios (to read/write from rocksdb database) and the randomness is very high even for sequential client io. An all SSD/nvme solution or using hdd with a write back cache will solve this.
Last edited on March 27, 2018, 9:24 am by admin · #8
gmc
2 Posts
March 28, 2018, 6:23 amQuote from gmc on March 28, 2018, 6:23 amThanks for the clear reply. Using the Windows dskspd tool I've been able to replicate the scenarios you've described.
I have a 4 node Supermicro SuperServer here on the bench with nvme and spinning disk, so if I am able to a chance before it is retasked, I will run petasan up on it and test again.
Thanks again
gmc
Thanks for the clear reply. Using the Windows dskspd tool I've been able to replicate the scenarios you've described.
I have a 4 node Supermicro SuperServer here on the bench with nvme and spinning disk, so if I am able to a chance before it is retasked, I will run petasan up on it and test again.
Thanks again
gmc
Tuning advice
jeenode
27 Posts
Quote from jeenode on March 14, 2018, 6:21 pmHi,
We are playing with a 2 node setup of Petasan 2.0 with 38 disks total and the performance with it serving VMware datastores doesnt seem to be very impressive.
Setup is as following:
- 2x Supermicro nodes, each with 2x CPU, one node with 8 total physical cores (16 hyperthreaded) and 12 physical cores (24 hyperthreaded)
- Each node runs ESXi and Petasan VMs:
- Separate management VMs (3)
- Separate disk storage VMs, one on each physical box with PCI passthrough controller for access to disks, each node has a total of 19 disks, which are 2.5" Seagate Constellation series, mix of 500Gb abd 1Tb. Each storage VM has 48Gb of RAM reserved for it
- Separate ISCSI VMs, one per physical host, 4 vCPU, 16Gb RAM (reserved)
- Did following tuning:
- Set the tuning set to "Low End Hardware" - maybe should have used "Mid-Range"?
- Setup a manually created RBD image with striping of 32K
- Vmware - setup round-robin with IOPS limit=1
The internal benchmarks seem to be quite good, but when I am now moving 3 VMs to the datastore off the setup above, the total write speed is at 35 Mb/s (10-12 Mb/s per VM).
We dont have any SSD caching setup, but with that many spindles, I was expecting a bit more.
Any ideas as to where to look? Maybe try filestore instead of bluestore? Anything else?
Hi,
We are playing with a 2 node setup of Petasan 2.0 with 38 disks total and the performance with it serving VMware datastores doesnt seem to be very impressive.
Setup is as following:
- 2x Supermicro nodes, each with 2x CPU, one node with 8 total physical cores (16 hyperthreaded) and 12 physical cores (24 hyperthreaded)
- Each node runs ESXi and Petasan VMs:
- Separate management VMs (3)
- Separate disk storage VMs, one on each physical box with PCI passthrough controller for access to disks, each node has a total of 19 disks, which are 2.5" Seagate Constellation series, mix of 500Gb abd 1Tb. Each storage VM has 48Gb of RAM reserved for it
- Separate ISCSI VMs, one per physical host, 4 vCPU, 16Gb RAM (reserved)
- Did following tuning:
- Set the tuning set to "Low End Hardware" - maybe should have used "Mid-Range"?
- Setup a manually created RBD image with striping of 32K
- Vmware - setup round-robin with IOPS limit=1
The internal benchmarks seem to be quite good, but when I am now moving 3 VMs to the datastore off the setup above, the total write speed is at 35 Mb/s (10-12 Mb/s per VM).
We dont have any SSD caching setup, but with that many spindles, I was expecting a bit more.
Any ideas as to where to look? Maybe try filestore instead of bluestore? Anything else?
admin
2,930 Posts
Quote from admin on March 14, 2018, 8:32 pmMy recommendation if you can use a controller with write back cache (battery backed), this will give improvements by several factors. Next if you use external SSDs for wal/db it will increase speed by about x2. If you do not use write back cache, you may find filestore with external SSD journals to give better results than Bluestore.
The internal benchmarks use 4M block sizes for throughput and 4k for iops, these are the standard block sizes for such tests. spinning disks do have good 4M performance but low 4k values (for example multiply your iops values x 4k it will not give high MB/s), your vms are probably around 64K block size ( divide your throughput values / iops as shown on dashboard, this will give you your average block size ) which are more prone to iops/latency with spinning disks.
If you look at your disk iops and disk utilization charts, are they high ?
The more vm load you add to a spinning disk cluster, you will find it will give higher total performance, this is because the disk can re-oder the ios in such a way to reduce seeks, but it will not improve per vm speed.
In general, all io in Ceph will become random ( even if clients were doing sequential writes), so at small block sizes, those will be controlled by disk latency. Moreover with Bluestore, a single io operation will require many other operations to write metadata in database , so with pure spinning disks it will be affected a lot by latency/ disk seeks. A pure SSD Bluestore will beat filestore, else try to use controller with write back cache.
My recommendation if you can use a controller with write back cache (battery backed), this will give improvements by several factors. Next if you use external SSDs for wal/db it will increase speed by about x2. If you do not use write back cache, you may find filestore with external SSD journals to give better results than Bluestore.
The internal benchmarks use 4M block sizes for throughput and 4k for iops, these are the standard block sizes for such tests. spinning disks do have good 4M performance but low 4k values (for example multiply your iops values x 4k it will not give high MB/s), your vms are probably around 64K block size ( divide your throughput values / iops as shown on dashboard, this will give you your average block size ) which are more prone to iops/latency with spinning disks.
If you look at your disk iops and disk utilization charts, are they high ?
The more vm load you add to a spinning disk cluster, you will find it will give higher total performance, this is because the disk can re-oder the ios in such a way to reduce seeks, but it will not improve per vm speed.
In general, all io in Ceph will become random ( even if clients were doing sequential writes), so at small block sizes, those will be controlled by disk latency. Moreover with Bluestore, a single io operation will require many other operations to write metadata in database , so with pure spinning disks it will be affected a lot by latency/ disk seeks. A pure SSD Bluestore will beat filestore, else try to use controller with write back cache.
jeenode
27 Posts
Quote from jeenode on March 14, 2018, 8:59 pmThanks for your quick reply.
We do not have controllers with write cache, will look into using SSDs for WAL+RockDB. What would be SSD to spindle ratio you would recommend?
Thanks for your quick reply.
We do not have controllers with write cache, will look into using SSDs for WAL+RockDB. What would be SSD to spindle ratio you would recommend?
admin
2,930 Posts
Quote from admin on March 14, 2018, 9:29 pmgeneral recommendation is 1 ssd per 4 hdds
also in your case do compare with v1.5 : filestore with sdd journals
general recommendation is 1 ssd per 4 hdds
also in your case do compare with v1.5 : filestore with sdd journals
jeenode
27 Posts
Quote from jeenode on March 16, 2018, 3:46 pmThanks.
How can I use filestore with PeataSAN 2.0 ? Need to change ceph.conf to have default object store be filestore or manually create OSDs?
Thanks.
How can I use filestore with PeataSAN 2.0 ? Need to change ceph.conf to have default object store be filestore or manually create OSDs?
admin
2,930 Posts
Quote from admin on March 16, 2018, 6:01 pmYou can only create bluestore osds from PetaSAN 2.0, for filestore you need v 1.5. If you are benchmarking, do a v 1.5 test and when done choose a "fresh" install of 2.0 rather than upgrade else it will preserve existing filestore osds.
You can only create bluestore osds from PetaSAN 2.0, for filestore you need v 1.5. If you are benchmarking, do a v 1.5 test and when done choose a "fresh" install of 2.0 rather than upgrade else it will preserve existing filestore osds.
gmc
2 Posts
Quote from gmc on March 26, 2018, 11:45 pmHi, I just wanted to chime in here to say that I have seen similar performance characteristics in my testing on 2.0.
My test setup was a virtual 3 node cluster, on separate ESXi hosts, with 6 disks in each VM actually hosted on a backend san stack. VMs with 8vCPU and 16GB ram.
VMware\Storage is all 10Gb connected, jumbo frames enabled.
Note that this was the initial test install just to have a look at petasan, I know that it isn’t designed to work this way.
Internal benchmarking of the ceph cluster was good, delivering the iops and throughput that I expected given the test infrastructure and san based disks, about 5Gbps and 10000 iops.
However when connecting either a separate esxi host via iscsi, or a Windows host via iscsi, to the ceph storage I also saw 35MBps throughput, and lowly iops.
As I am the perfect target market for petasan (ie MIcrosoft\VMware\Network corporate dude) and not a Linux admin, and although I have access to guys in the same room who are awesome Linux admins who manage hundred of systems including large scale storage, we haven’t done any troubleshooting due to “insert appropriate reason here”.
Happy to help if needed, but would also welcome any advice.
gmc
Hi, I just wanted to chime in here to say that I have seen similar performance characteristics in my testing on 2.0.
My test setup was a virtual 3 node cluster, on separate ESXi hosts, with 6 disks in each VM actually hosted on a backend san stack. VMs with 8vCPU and 16GB ram.
VMware\Storage is all 10Gb connected, jumbo frames enabled.
Note that this was the initial test install just to have a look at petasan, I know that it isn’t designed to work this way.
Internal benchmarking of the ceph cluster was good, delivering the iops and throughput that I expected given the test infrastructure and san based disks, about 5Gbps and 10000 iops.
However when connecting either a separate esxi host via iscsi, or a Windows host via iscsi, to the ceph storage I also saw 35MBps throughput, and lowly iops.
As I am the perfect target market for petasan (ie MIcrosoft\VMware\Network corporate dude) and not a Linux admin, and although I have access to guys in the same room who are awesome Linux admins who manage hundred of systems including large scale storage, we haven’t done any troubleshooting due to “insert appropriate reason here”.
Happy to help if needed, but would also welcome any advice.
gmc
admin
2,930 Posts
Quote from admin on March 27, 2018, 8:41 amUse a controller with write back cache, this will boost performance by several factors or better yet use an all ssd solution. Use real hardware and disks in jbod, no raid or san backstores.
There are 2 factors: small block size performance and single threaded performance.
Disk latency will affect your small block sizes. The 5Gbps and 10000 iops are done using 4M and 4k block sizes which are standard test sizes. At 10K iops this translates to only 40Mbps throughput. The wide variance 40 MBps-5Gbps is for a block size range of 4k-4M, so if your ESX/Windows were writing at 64K for example, you will not reach 5Gbps but for example 300 MBps in total.
Single threaded performance: Ceph scales well with concurrent access, but for 1 (or few) concurrent io, it will not give full or high performance, unlike raid. You probably ran your internal benchmarks with high thread counts (64?), if you run it with 1 thread you will get much less numbers. If you are able to run multiple ESX vms /Windows workload with high concurrency you will get close to the internal performance for the same block size.
Effect of small size disk latency is not as apparent in raid devices, the controller is able to re-order/assemble small block into continuous larger sizes, in Ceph the io pattern is more complex: each io operation requires many supporting ios (to read/write from rocksdb database) and the randomness is very high even for sequential client io. An all SSD/nvme solution or using hdd with a write back cache will solve this.
Use a controller with write back cache, this will boost performance by several factors or better yet use an all ssd solution. Use real hardware and disks in jbod, no raid or san backstores.
There are 2 factors: small block size performance and single threaded performance.
Disk latency will affect your small block sizes. The 5Gbps and 10000 iops are done using 4M and 4k block sizes which are standard test sizes. At 10K iops this translates to only 40Mbps throughput. The wide variance 40 MBps-5Gbps is for a block size range of 4k-4M, so if your ESX/Windows were writing at 64K for example, you will not reach 5Gbps but for example 300 MBps in total.
Single threaded performance: Ceph scales well with concurrent access, but for 1 (or few) concurrent io, it will not give full or high performance, unlike raid. You probably ran your internal benchmarks with high thread counts (64?), if you run it with 1 thread you will get much less numbers. If you are able to run multiple ESX vms /Windows workload with high concurrency you will get close to the internal performance for the same block size.
Effect of small size disk latency is not as apparent in raid devices, the controller is able to re-order/assemble small block into continuous larger sizes, in Ceph the io pattern is more complex: each io operation requires many supporting ios (to read/write from rocksdb database) and the randomness is very high even for sequential client io. An all SSD/nvme solution or using hdd with a write back cache will solve this.
gmc
2 Posts
Quote from gmc on March 28, 2018, 6:23 amThanks for the clear reply. Using the Windows dskspd tool I've been able to replicate the scenarios you've described.
I have a 4 node Supermicro SuperServer here on the bench with nvme and spinning disk, so if I am able to a chance before it is retasked, I will run petasan up on it and test again.
Thanks again
gmc
Thanks for the clear reply. Using the Windows dskspd tool I've been able to replicate the scenarios you've described.
I have a 4 node Supermicro SuperServer here on the bench with nvme and spinning disk, so if I am able to a chance before it is retasked, I will run petasan up on it and test again.
Thanks again
gmc