Forums - PetaSAN

ForumGeneral DiscussionLooking for opinions/experiences
You need to log in to create posts and topics. Login · Register
Looking for opinions/experiences

psychodad
2 Posts

November 10, 2023, 1:28 pm
Quote from psychodad on November 10, 2023, 1:28 pm
Hi everyone!

We have 18 x HPE Apollo 4510 (Proliant XL450 Gen10) servers, each with:

CPU: 2 x Intel(R) Xeon(R) Silver 4210R @ 2.40GHz (total 20 cores / 40 threads)
RAM: 384 GB DDR4
OS: 2 x 960GB 22.5G SAS SSD in RAID1
DATA: 2 x HPE 6.4TB PCIe x8 MU HH DS Card NVMe + 56 x HPE 18TB 12G SAS HDD
RAID: HPE Smart Array P408i-p SR Gen10 (data disks only)
NIC: HPE 10/25Gb 2-port 640FLR-SFP28 (using teaming to get 50G)

Servers are distributed equally in 3 geographically distant datacenters (let's call them DC-1a, DC-2 and DC-3) - 6 servers per datacenter - and are running a SDS solution from a known vendor which provides S3 and NFS to virtual machines (VMs) running in our environment.

In location 1 we have 2 DCs just a few kilometers apart; other distances are like below:

DC-2 <-- 400 km --> DC-1a/b <-- 200 km --> DC3

All DCs are interconnected with multiple 100G links (both in location 1 and between locations).

RTT between DC-1a/b and DC-2 / DC-3 is a bit less than 10 ms.

RTT between DC-1a and DC-1b is around 0.5 ms.

We are very disappointed with the functionality, support and (most of all!) with the performance of the SDS system in general, especially when it comes to files smaller than 10 MB - both read and write speeds are catastrophically low, in range of 1-10 kB/s which makes it almost unusable for general purpose / everyday use (for example, serving a Laravel based web site on NFS disk).

NFS speed with bigger files (100M+) are OK, write is maxing at around 500 MB/s and read at around 800 MB/s.

Current SDS system is configured like this: files smaller than 60k are using replication (original + 2 x replica) and files larger than 60k are erasure coded with ARC 7+5 algorithm and parts are distributed equally among 3 locations.

System is so slow when hosting smaller files that our users are unable to issue "ls -l" or "du -sh" on mounted NFS share(s) in a timely fashion; commands take eternity to finish, file level backups are failing due to slowness etc etc.

GUI and API are buggy so it can only be managed from CLI - it wouldn't be such a problem if we didn't lose months writing and testing scripts which should work via API to finally rip the info out of the tech supoport that GUI & API are buggy and shoudn't be used at all.

Taking all that into account, we are thinking about moving all the servers in DC-1a and DC-1b (9 servers in each DC) to reduce netwrok latencies and also replacing a paid SDS solution with something else - hopefully PetaSAN.

Each 18 TB HDD gives around 120 IOPS (+/- 5%) according to tests.
Each 6.4TB SSD gives between 25k and 40k IOPS according to tests.

Has anyone had any experience with PetaSAN on similar HW, providing S3/NFS?

Under the assumption all servers are moved into location 1 (half in DC-1a and other half in DC-1b), what performance can we expect for smaller files, i.e. should we expect big performance gain or just a marginal one with HW described above?

Less important, but under the same assumption, what performance can we expect for bigger files, would it be similar to existing system?

Just a quick note: if we give up on existing SDS solution and go with PetaSAN, we will be left with 6 unused ProLiant DL20 Gen10 servers from LoadBalancer.org (https://pdfs.loadbalancer.org/datasheets/hardware/Loadbalancer_Enterprise50G_Datasheet.pdf) - is there any possible use for them if we go PetaSAN way or they will not be needed at all?

Every bit of information or experience you can share is very welcome!

Cheers,
psychodad

Hi everyone!

We have 18 x HPE Apollo 4510 (Proliant XL450 Gen10) servers, each with:

CPU: 2 x Intel(R) Xeon(R) Silver 4210R @ 2.40GHz (total 20 cores / 40 threads)
RAM: 384 GB DDR4
OS: 2 x 960GB 22.5G SAS SSD in RAID1
DATA: 2 x HPE 6.4TB PCIe x8 MU HH DS Card NVMe + 56 x HPE 18TB 12G SAS HDD
RAID: HPE Smart Array P408i-p SR Gen10 (data disks only)
NIC: HPE 10/25Gb 2-port 640FLR-SFP28 (using teaming to get 50G)

Servers are distributed equally in 3 geographically distant datacenters (let's call them DC-1a, DC-2 and DC-3) - 6 servers per datacenter - and are running a SDS solution from a known vendor which provides S3 and NFS to virtual machines (VMs) running in our environment.

In location 1 we have 2 DCs just a few kilometers apart; other distances are like below:

DC-2 <-- 400 km --> DC-1a/b <-- 200 km --> DC3

All DCs are interconnected with multiple 100G links (both in location 1 and between locations).

RTT between DC-1a/b and DC-2 / DC-3 is a bit less than 10 ms.

RTT between DC-1a and DC-1b is around 0.5 ms.

We are very disappointed with the functionality, support and (most of all!) with the performance of the SDS system in general, especially when it comes to files smaller than 10 MB - both read and write speeds are catastrophically low, in range of 1-10 kB/s which makes it almost unusable for general purpose / everyday use (for example, serving a Laravel based web site on NFS disk).

NFS speed with bigger files (100M+) are OK, write is maxing at around 500 MB/s and read at around 800 MB/s.

Current SDS system is configured like this: files smaller than 60k are using replication (original + 2 x replica) and files larger than 60k are erasure coded with ARC 7+5 algorithm and parts are distributed equally among 3 locations.

System is so slow when hosting smaller files that our users are unable to issue "ls -l" or "du -sh" on mounted NFS share(s) in a timely fashion; commands take eternity to finish, file level backups are failing due to slowness etc etc.

GUI and API are buggy so it can only be managed from CLI - it wouldn't be such a problem if we didn't lose months writing and testing scripts which should work via API to finally rip the info out of the tech supoport that GUI & API are buggy and shoudn't be used at all.

Taking all that into account, we are thinking about moving all the servers in DC-1a and DC-1b (9 servers in each DC) to reduce netwrok latencies and also replacing a paid SDS solution with something else - hopefully PetaSAN.

Each 18 TB HDD gives around 120 IOPS (+/- 5%) according to tests.
Each 6.4TB SSD gives between 25k and 40k IOPS according to tests.

Has anyone had any experience with PetaSAN on similar HW, providing S3/NFS?

Under the assumption all servers are moved into location 1 (half in DC-1a and other half in DC-1b), what performance can we expect for smaller files, i.e. should we expect big performance gain or just a marginal one with HW described above?

Less important, but under the same assumption, what performance can we expect for bigger files, would it be similar to existing system?

Just a quick note: if we give up on existing SDS solution and go with PetaSAN, we will be left with 6 unused ProLiant DL20 Gen10 servers from LoadBalancer.org (https://pdfs.loadbalancer.org/datasheets/hardware/Loadbalancer_Enterprise50G_Datasheet.pdf) - is there any possible use for them if we go PetaSAN way or they will not be needed at all?

Every bit of information or experience you can share is very welcome!

Cheers,
psychodad

Last edited on November 10, 2023, 2:30 pm by psychodad · #1

admin
2,930 Posts

November 13, 2023, 10:28 pm
Quote from admin on November 13, 2023, 10:28 pm
Without going specific detail, some general points:

Yes PetaSAN is able to scale performance very well.

For HDD, it is always recommended to use flash as a journal device, with ratio of 1 SSD:4 HDDs or 1:NVME:10HDDS. Each HDD requires a 300GB partition on flash journal. Journals will reduced latency and increase iops by factor of 2 to 3.

Latency for i/o on journal/HDD will be approx 15 ms, reads if done from OSD cache, will be approx 0.25 ms. In contrast an all flash OSD setup will give 0.25 ms read, 0.8-1 ms write latency.

For iops: each thread queue/depth will see iops = 1/latency. The total max threads is a factor of number of OSDs, roughly x16

It is recommended for CephFS ( used for NFS) to use a dedicated SSD pool with flash OSDs for the metadata pool. . For S3, the bucket index pool should also be on SSD. This is for metadata only so you may have only 1 SSD per node to serve both pools.

Without going specific detail, some general points:

Yes PetaSAN is able to scale performance very well.

For HDD, it is always recommended to use flash as a journal device, with ratio of 1 SSD:4 HDDs or 1:NVME:10HDDS. Each HDD requires a 300GB partition on flash journal. Journals will reduced latency and increase iops by factor of 2 to 3.

Latency for i/o on journal/HDD will be approx 15 ms, reads if done from OSD cache, will be approx 0.25 ms. In contrast an all flash OSD setup will give 0.25 ms read, 0.8-1 ms write latency.

For iops: each thread queue/depth will see iops = 1/latency. The total max threads is a factor of number of OSDs, roughly x16

It is recommended for CephFS ( used for NFS) to use a dedicated SSD pool with flash OSDs for the metadata pool. . For S3, the bucket index pool should also be on SSD. This is for metadata only so you may have only 1 SSD per node to serve both pools.

#2

psychodad
2 Posts

November 15, 2023, 10:16 am
Quote from psychodad on November 15, 2023, 10:16 am
Thank you for your reply and recommendations!

Taking all into account, I think we can build pretty solid PetaSAN cluster with hardware we have in our hands; we are a bit short on NVMe - current size would cover 40 out of 56 HDDs installed in server, but I hope it won't make big difference.

Cheers!

Thank you for your reply and recommendations!

Taking all into account, I think we can build pretty solid PetaSAN cluster with hardware we have in our hands; we are a bit short on NVMe - current size would cover 40 out of 56 HDDs installed in server, but I hope it won't make big difference.

Cheers!

#3

admin
2,930 Posts

November 15, 2023, 10:26 am
Quote from admin on November 15, 2023, 10:26 am
i would recommend you use 40 HDDs out of the 56 until you can get more journal devices, Having a mix will slow down due to bottleneck. Any client i/o will hit the slower devices many times per second.

i would recommend you use 40 HDDs out of the 56 until you can get more journal devices, Having a mix will slow down due to bottleneck. Any client i/o will hit the slower devices many times per second.

#4

Post Reply: Looking for opinions/experiences

Cancel