Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

PetaSAN cluster with traditional SAN Shared Filesystem

Hi,

First of all congratulations on amazing project, this really has big potential, and hopefully, commercial support, so i can recommend this project to my workplace.

I'm thinking of building a test cluster, but after reading some other posts here regarding performance, and especially that ceph does not perform that well for few clients with high performance requirements, i'm curious regarding building a test cluster.

I'm coming from post-production background, especially video editing and playback, that has very high throughput requirements. For example for real-time playback of files that have resolution 4096x2048 we need in excess of 1 GB/s throughput from storage. And this is for single client, that does multi-threaded async direct I/O.

Now, since we are not necessarily target industry for this solution, seeing that most of the posts are people using it for VM's, and those tools (e.g ESXi) manage clustering storage themselves, i would like to know if your platform, would work with a SAN shared filesystem, such as Quantum's Stornext, Apple's XSAN, or Clustered XFS (CXFS).

These filesystems are usually tuned for high throughput, and volumes are mounted from multiple LUN's, where client software (usually implemented as kernel module) is doing software striping against those LUN's, which are actually separate RAID logical volumes. This helps achieve high throughput for multiple clients (also known as "multiple streams playback", storage vendors specifications are measured in how many streams of say, HD can be streamed of storage.)

SAN software mentioned above is also optimized for streaming workloads, where it can read-ahead files in order to cache them, in addition to raid controllers prefetching and such.

Knowing this, would it be possible to configure ceph / PetaSAN, that would give somewhat acceptable performance?

Thank you in advance,

 

Vedran

Thanks for your comments 🙂  We are building a support portal, i will send you our support plans via email.

Yes i have seen comments on performance, what i can say is there is large range of different hardware that people use for PetaSAN, you can see a 100x performance variations per node because of this.

Another factor, as you pointed, is that by design and unlike RAID, for a single client stream Ceph will give a max of a single disk read performance and a third ( assuming 3 replicas ) for write. In many cases it will give lower than this due to the network overhead of a distributed system, specially with random small block sizes (4k-32k bytes,  common in db apps) , but with good hardware and in the case of  streaming  (+copy/backup apps) which typically have larger block sizes you can reach very close to the above numbers. It is quite common for a couple of streams doing large block ios to saturate your 10G nics.

We have not tested the dedicated filesystems you describe, but from the sounds of it, if they can split a single client stream to multiple parallel streams then this would be the magic bullet to get the per stream numbers you want,

So i would say with good hardware it will be possible. To get really high per stream bandwidth you can also use some of the new nvme, in such case the sky is the limit, actually your nics will be.

Hi,

That sounds good, and i forgot to mention that our workload block sizes are somewhere between 256K and 2 M, so multi-threaded parallel I/O could in theory achieve high throughput for many clients. Of course that everything depends on number of spindles and compute power of the host. Best part of ceph that we can scale linearly. But yes, you are right, SAN software that we use does exactly that, it runs multiple streams in parallel (since it's volume is built of many LUN's).

Since i was not familiar with Ceph, i did some research, and found out that it was performing very slowly for Rados block device, at least for workload that needs high number of IOPS (One example is Openstack Summit where EMC demonstrated difference between ScaleIO and Ceph) at the time, and being more specific, because of double writes it was required to do when it used underlying POSIX FS (XFS). That seems to be improved in bluestore engine, but i saw that you recommended to people on the forum that bluestore is better with all ssd/ NVMe configurations? That leaves us with old filestore engine that apparently does not perform well..

In regard to SAN software that we use, how do you recommend to configure PetaSAN, in order to have multiple LUN's? Would it be multiple volumes visible on different paths? Or all LUN's visible on all paths?

Also is there a way to change how many dedicated iSCSI networks we can create (from 2)? As i understand, since i cannot achieve more then single disk speed for single-thread I/O, what if multi-threaded I/O goes on the same path? Maybe i'm wrong but I'm worried that i could fill the 10 GbE bandwith of single port just with single iSCSI path.

Finally, can you recommend hardware configuration that would be tuned for high bandwith requirements of multiple clients? We have storage appliance here that has about 36 SATA  7200 rpm drives, and it can achieve sustained 6 GB/s aggregated throughput. But usually we don't stress it nearly that much.

Thanks

The EMC comparison on small block sizes (4k random) iops was rather old, shortly afterwards a bug was discovered with the memory allocation library tcmalloc, and using jemalloc gave 4.2x better:
https://ceph.com/planet/the-ceph-and-tcmalloc-performance-story/
PetaSAN up to v 2.0 uses jemalloc, with 2.0 tcmalloc gives same performance.
Aside from this, things have changed quite a lot, with many developers from flash companies enhancing the ceph code, now some of these companies (intel/sandisk/samsung/mircon) show iops benchmark in the 1M+ range.

For max performance use 8 paths per iSCSI lun, your client initiator will perform io in parallel across all 8 which means the load will be distributed on 8 nodes in parallel. You still use 2 subnets for mpio but you will have 4 concurrent ios on each, if you need higher network pipe, use 25/40G or use nic bonding on your 10G. So your single client initiator will be accessing the lun over 8 storage nodes, in some applications that use a clustered filesystem (such as esx/hyper-v) you can have initiators on different client machines accessing the lun in parallel, but this is application and filesystem dependent.

Bluestore is designed with flash in mind, if you use spinning disks + ssd wal/db you still need a controller with write back cache for vm/database workloads, for large block sizes you may not require this and jbod hba may do.
Most new Ceph installs are all flash, in 1 year it will be the large majority, i would still recommend it if you can, good hardware will make a big difference.

Hey thanks for recommendation, so i should use multiple volumes (more LUN's) and at least 8 nodes cluster?

1M+ iops benchmark is not bad at all 🙂

Cheers,

Vedran

I cannot talk too much how he filesystems you describe split a logical stream into multiple volimes, they may have limits..etc but generally the more volumes the better, From PetaSAN side these volumes are viewed as separate/unrelated luns, 8 nodes will be great if you can (the more the better) but even with less nodes you will get benefit for example each path connection on a lun from a client will be put on a cpu core.