Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Design question

Hi, I've played with Ceph before but now I found PetaSAN and installed a demo cluster in VMs and it was really smooth and hassle-free, I'm impressed!

We are considering a PetaSAN Ceph cluster for a couple of VMware Vsphere clients (RBD+iSCSI). We have 2 small datacenters, both active with 2 different links a couple km apart so we think we'd create a single cluster over both DCs. 6 nodes altogether, 3 in each DC.

The proposed node config (not finalized yet) would be:
- 4x10G nics - 2x iscsi + 2x cluster
- 1x1G mgmt
- 16core CPU + 64G RAM
- 1 or 2 NVMe disks for fast pool (F)
- 4-6 HDD (8TB) for slower pool (S) with SSD/NVMe cache
- 2 small SSD for system

The VSphere setup is such that you can run any VM on any host/DC at any time.

For the 2 pools the crush rules would be:
- F(ast) pool would be for for important stuff and should be available in the unlikely event of loosing one DC.
- S(lower) pool would be a EC k+m for less important stuff.
All pools would have server constraint (different servers for PG placments)

Now I have some questions about the optimal(right) way to go:

1. for pool F if I want it to be DC disaster resilient the only way to  do this would be Rep4 (2+2 (or higher)) as Ceph puts the pool into RO mode when only one replica is alive (in case of Rep3 with 2node DC going down). Is that correct?

Is there another/better policy to go for covering DC disaster?

2. if we have 6 nodes and do EC 4+2 on pool S that means in case of one server down the pool is degraded. Is that something one should avoid (not having enough nodes to relocate copies) or if you expect to get it fixed reasonably soon you're ok with this?
The idea here is that when expanding it should be done with additional nodes but regarding the huge amount of disk space we'd be getting I don't know if that will be happening soon..

3. The clients in both DCs will be accessing same disks/rbds  all thetime as any VM can be run from any VSpere host.
Is it possible to have some kind of crush-like rules for iSCSI paths placments? I'd like to create perhaps 4 paths and have 2 per DC and try to minimize the inter-DC communication (keep DC1 clients stick to DC1 paths) when everything is up.
I can use the fixed path policy on vsphere but then I can't utilize the 1 IO round-robin trick, right?

4. I can't seem to find any sizing guide for SSD cache for HDD pool anywhere. I understand the SATA bus troughput part but have no pointers on how much space such a cache should have. Is it utilized as one common cache device for all HDDs or do you have to partition it to #HDDs and have each one have it's own cache?

5. On the same track, the rule of thumb one SSD cache per 4 HDDs relates to 550MB/s SATA/SSD vs 150MB/s HDD troughput? Does that mean if you use a NVMe cache you can have one for ~10+HDDs? (strictly theoretical q, not applicable here)

6. For ceph data network is it better to have one network and use bonded NICs or 2 separate with single ones or it doesn't really matter?

Thank you very much for any clarification.

Cheers, Jure

1- Yes rep x4 is what you would need.

One tricky thing is how to maintain quorum for ceph mons in a 2 center setup, you would have to setup 1 of the mons such the the first node to be highly available across both centers. either via an HA setup with drbd or use a vm in an HA hypersisor setup with external storage.

The other thing is to make sure latency across the centres is not high or it will impact latency/iops

2- EC 4+2 will allow up to 2 host failures, the cluster will be degraded in any failures which i will start to recover by creating the lost data copies, it will take as long as the amount of data to copy + the backfill speed setting you can control. You cannot make a EC 4+2  resilient in case of 1 dc fails, this will require a 2+2 or 2+4 profile but in such case EC will not be beneficial over replicated pool.

3- We do not have a way to direct iSCSI across your centres in an automated way, however you can use the path assignment page to manually distributed the paths in the manner required.

4- Typically you would use 4 partitions on the SSD as cache, each serving 1 HDD OSD. The higher the partition size the better but it requires 2% ram to operate, so do not create it too large.

5- Not recommended to go over 8 partitions ( ideally 2-4). Note that for write cache helps with iops/latency but in case of large block size io like doing backups, your nvme write throughput may be the limiting factor if you have 10 HDD or more.

Thanks a lot for your answer.

ad 1. I guess one could have 5 mons so you don't get left with just one active in case of crash. Yes, latency is a known unknown at the moment-)

ad 2. The second pool would not need to be DC failure resilient. The thing I'm wondering is it an acceptable policy to have an EC rule that uses all the nodes so in case of one down there is no spare to relocate the missing data. I guess that's more of a "political" decision than a technical one..

Regarding SSD caching, in ceph docs I've found "Cache tiering will degrade performance for most workloads." What is your experience with this? Do you see any benefits fron SSD caching in real life (icsci esx scenarios) or would you recommend to start without SSD cache and apply it later if it turns out the performance is not ok?

Thank you

BR Jure

 

It is not that you need more than 1 monitor, this is not enough, you need a quorum such as 2 out of 3 total or 3 out of 5 total. This is why you need to make the first host highly available. The quorum is needed by both ceph mons and consul servers.

You do not have store more replicas per center, you can create a custom crush rule to specify how many replicas to store in each center. Note that EC is good at requiring less storage overhead for most profiles at expense of performance, however in case of having the data available on both centres you already meed a profile that at least doubles the storage so you need to consider if EC is beneficial in this case.

Yes Ceph/Red Hat deprecated cache tiering where you create 2 pools : fast pool acts as cache to slow pool. However PetaSAN does not use this, we use a block level cache, kernel based dm-writecache to achieve caching at the block disk layer outside of Ceph.

Ok, got it. Thank you very much for these clarifications.

Cheers, Jure