Untuned Ceph Setup quite slow
Pages: 1 2
fx882
17 Posts
September 20, 2017, 2:38 pmQuote from fx882 on September 20, 2017, 2:38 pmHi,
I tested my petasan setup. And it performs not really impressive. It's build with
entry-level used hardware and no tuning has been taken place(no jumbo frames, nothing).
Did I do any major error in my setup, or is this the way it normally is?
When I look at the performance of the ceph cluster, it's idling. No CPU usage
spikes. No network spikes. No HDD read/write spikes.
And yes, pure raw sequential I/O is not the main strength of ceph - concurrency is.
Hardware:
- 4 nodes, intel core i7, 16 GB RAM, 5 x intel e1000 nic
- all networks separate 1 GBit/s vlans
- os disk separate from osd disks
- 12 osd ( 3 x sata 500 GB, 3 x SATA 600 GB, 6 x SATA 1 TB on Adaptec ASR5405)
Testing Environment:
- simple Citrix XenServer VM running on ceph and local storage as comparison, one with single disk, other with 4-disk RAID-10
- The comparison machines with RAID-10 were running in production, the ceph-system is testing only without load
- simple direct+sync sequential IO(read+write)
These scripts I used:
#!/bin/bash
# SYNC WRITE TEST
export LC_ALL=C
COPY_SIZE_MB=1000
TEMP=$(mktemp --tmpdir=/dev/shm)
trap "rm -f '$TEMP' ; rm -f /tmp/out ; exit 0" 1 2 3 15
while :;do
dd </dev/urandom bs=1M count=$COPY_SIZE_MB >$TEMP 2>/dev/null
OUT=$({ time dd <$TEMP bs=1M oflag=sync,direct >/tmp/out ; } 2>&1)
rm -f $TEMP
REAL="$(echo -e "$OUT" | grep real | awk '{print $NF}')"
COPIED="$(echo -e "$OUT" | grep copied )"
echo "$COPIED in $REAL"
done
#!/bin/bash
# SYNC READ TEST
export LC_ALL=C
COPY_SIZE_MB=1000
TEMP=$(mktemp --tmpdir=/dev/shm)
trap "rm -f '$TEMP' ; rm -f /tmp/out ; exit 0" 1 2 3 15
while :;do
dd </dev/urandom bs=1M count=$COPY_SIZE_MB >$TEMP 2>/dev/null
cp $TEMP /tmp/out
OUT=$({ time dd >/dev/null bs=1M iflag=sync,direct </tmp/out ; } 2>&1)
rm -f $TEMP
REAL="$(echo -e "$OUT" | grep real | awk '{print $NF}')"
COPIED="$(echo -e "$OUT" | grep copied )"
echo "$COPIED in $REAL"
done
These are the results:
***** 1 GB SYNC/DIRECT WRITE *****
VM1 SR-1 local RAID-10(4 Disks)
1045700608 bytes (1.0 GB, 997 MiB) copied, 6.96848 s, 150 MB/s in 0m6.970s
1045700608 bytes (1.0 GB, 997 MiB) copied, 6.32714 s, 165 MB/s in 0m6.328s
1045700608 bytes (1.0 GB, 997 MiB) copied, 6.63697 s, 158 MB/s in 0m6.639s
1045700608 bytes (1.0 GB, 997 MiB) copied, 5.0153 s, 209 MB/s in 0m5.017s
1045700608 bytes (1.0 GB, 997 MiB) copied, 4.89159 s, 214 MB/s in 0m4.894s
1045700608 bytes (1.0 GB, 997 MiB) copied, 5.3156 s, 197 MB/s in 0m5.317s
1045700608 bytes (1.0 GB, 997 MiB) copied, 4.2745 s, 245 MB/s in 0m4.276s
VM2 on local RAID-10(4 Disks)
1047875584 bytes (1.0 GB, 999 MiB) copied, 4.23788 s, 247 MB/s in 0m4.240s
1047875584 bytes (1.0 GB, 999 MiB) copied, 4.27559 s, 245 MB/s in 0m4.277s
1047875584 bytes (1.0 GB, 999 MiB) copied, 3.93707 s, 266 MB/s in 0m3.938s
1047875584 bytes (1.0 GB, 999 MiB) copied, 4.06792 s, 258 MB/s in 0m4.069s
1047875584 bytes (1.0 GB, 999 MiB) copied, 4.07228 s, 257 MB/s in 0m4.073s
1047875584 bytes (1.0 GB, 999 MiB) copied, 4.45442 s, 235 MB/s in 0m4.456s
VM3 on local Single Disk
1048576000 bytes (1.0 GB) copied, 11.0755 s, 94.7 MB/s in 0m11.087s
1048576000 bytes (1.0 GB) copied, 9.78886 s, 107 MB/s in 0m9.791s
1048576000 bytes (1.0 GB) copied, 13.2969 s, 78.9 MB/s in 0m13.308s
VM4 on ceph 12 x OSD / 1 GBit/s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 81.9041 s, 12.8 MB/s in 1m21.905s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 82.1762 s, 12.8 MB/s in 1m22.177s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 82.601 s, 12.7 MB/s in 1m22.602s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 84.0655 s, 12.5 MB/s in 1m24.066s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 85.9616 s, 12.2 MB/s in 1m25.962s
***** 1 GB SYNC/DIRECT READ *****
VM1 on local RAID-10(4 Disks)
1045700608 bytes (1.0 GB, 997 MiB) copied, 7.36673 s, 142 MB/s in 0m7.843s
1045700608 bytes (1.0 GB, 997 MiB) copied, 7.51898 s, 139 MB/s in 0m7.880s
1045700608 bytes (1.0 GB, 997 MiB) copied, 6.73566 s, 155 MB/s in 0m6.979s
1045700608 bytes (1.0 GB, 997 MiB) copied, 10.1356 s, 103 MB/s in 0m10.147s
1045700608 bytes (1.0 GB, 997 MiB) copied, 7.0072 s, 149 MB/s in 0m7.390s
VM2 on local RAID-10(4 Disks)
1047875584 bytes (1.0 GB, 999 MiB) copied, 5.22031 s, 201 MB/s in 0m5.484s
1047875584 bytes (1.0 GB, 999 MiB) copied, 5.31487 s, 197 MB/s in 0m5.404s
1047875584 bytes (1.0 GB, 999 MiB) copied, 5.0924 s, 206 MB/s in 0m5.201s
1047875584 bytes (1.0 GB, 999 MiB) copied, 5.17836 s, 202 MB/s in 0m5.246s
1047875584 bytes (1.0 GB, 999 MiB) copied, 5.03164 s, 208 MB/s in 0m5.098s
1047875584 bytes (1.0 GB, 999 MiB) copied, 5.37757 s, 195 MB/s in 0m5.471s
1047875584 bytes (1.0 GB, 999 MiB) copied, 5.16821 s, 203 MB/s in 0m5.300s
VM3 on local Single Disk
1048576000 bytes (1.0 GB) copied, 12.8235 s, 81.8 MB/s in 0m12.831s
1048576000 bytes (1.0 GB) copied, 12.0348 s, 87.1 MB/s in 0m12.040s
1048576000 bytes (1.0 GB) copied, 9.85168 s, 106 MB/s in 0m9.865s
1048576000 bytes (1.0 GB) copied, 15.1218 s, 69.3 MB/s in 0m15.126s
VM4 on ceph 12 x OSD / 1 GBit/s / 4 Nodes
1048576000 bytes (1.0 GB, 1000 MiB) copied, 53.4321 s, 19.6 MB/s in 0m53.433s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 21.9414 s, 47.8 MB/s in 0m21.942s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 23.038 s, 45.5 MB/s in 0m23.039s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 24.2718 s, 43.2 MB/s in 0m24.272s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 23.4865 s, 44.6 MB/s in 0m23.487s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 24.7262 s, 42.4 MB/s in 0m24.727s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 21.8496 s, 48.0 MB/s in 0m21.850s
Hi,
I tested my petasan setup. And it performs not really impressive. It's build with
entry-level used hardware and no tuning has been taken place(no jumbo frames, nothing).
Did I do any major error in my setup, or is this the way it normally is?
When I look at the performance of the ceph cluster, it's idling. No CPU usage
spikes. No network spikes. No HDD read/write spikes.
And yes, pure raw sequential I/O is not the main strength of ceph - concurrency is.
Hardware:
- 4 nodes, intel core i7, 16 GB RAM, 5 x intel e1000 nic
- all networks separate 1 GBit/s vlans
- os disk separate from osd disks
- 12 osd ( 3 x sata 500 GB, 3 x SATA 600 GB, 6 x SATA 1 TB on Adaptec ASR5405)
Testing Environment:
- simple Citrix XenServer VM running on ceph and local storage as comparison, one with single disk, other with 4-disk RAID-10
- The comparison machines with RAID-10 were running in production, the ceph-system is testing only without load
- simple direct+sync sequential IO(read+write)
These scripts I used:
#!/bin/bash
# SYNC WRITE TEST
export LC_ALL=C
COPY_SIZE_MB=1000
TEMP=$(mktemp --tmpdir=/dev/shm)
trap "rm -f '$TEMP' ; rm -f /tmp/out ; exit 0" 1 2 3 15
while :;do
dd </dev/urandom bs=1M count=$COPY_SIZE_MB >$TEMP 2>/dev/null
OUT=$({ time dd <$TEMP bs=1M oflag=sync,direct >/tmp/out ; } 2>&1)
rm -f $TEMP
REAL="$(echo -e "$OUT" | grep real | awk '{print $NF}')"
COPIED="$(echo -e "$OUT" | grep copied )"
echo "$COPIED in $REAL"
done
#!/bin/bash
# SYNC READ TEST
export LC_ALL=C
COPY_SIZE_MB=1000
TEMP=$(mktemp --tmpdir=/dev/shm)
trap "rm -f '$TEMP' ; rm -f /tmp/out ; exit 0" 1 2 3 15
while :;do
dd </dev/urandom bs=1M count=$COPY_SIZE_MB >$TEMP 2>/dev/null
cp $TEMP /tmp/out
OUT=$({ time dd >/dev/null bs=1M iflag=sync,direct </tmp/out ; } 2>&1)
rm -f $TEMP
REAL="$(echo -e "$OUT" | grep real | awk '{print $NF}')"
COPIED="$(echo -e "$OUT" | grep copied )"
echo "$COPIED in $REAL"
done
These are the results:
***** 1 GB SYNC/DIRECT WRITE *****
VM1 SR-1 local RAID-10(4 Disks)
1045700608 bytes (1.0 GB, 997 MiB) copied, 6.96848 s, 150 MB/s in 0m6.970s
1045700608 bytes (1.0 GB, 997 MiB) copied, 6.32714 s, 165 MB/s in 0m6.328s
1045700608 bytes (1.0 GB, 997 MiB) copied, 6.63697 s, 158 MB/s in 0m6.639s
1045700608 bytes (1.0 GB, 997 MiB) copied, 5.0153 s, 209 MB/s in 0m5.017s
1045700608 bytes (1.0 GB, 997 MiB) copied, 4.89159 s, 214 MB/s in 0m4.894s
1045700608 bytes (1.0 GB, 997 MiB) copied, 5.3156 s, 197 MB/s in 0m5.317s
1045700608 bytes (1.0 GB, 997 MiB) copied, 4.2745 s, 245 MB/s in 0m4.276s
VM2 on local RAID-10(4 Disks)
1047875584 bytes (1.0 GB, 999 MiB) copied, 4.23788 s, 247 MB/s in 0m4.240s
1047875584 bytes (1.0 GB, 999 MiB) copied, 4.27559 s, 245 MB/s in 0m4.277s
1047875584 bytes (1.0 GB, 999 MiB) copied, 3.93707 s, 266 MB/s in 0m3.938s
1047875584 bytes (1.0 GB, 999 MiB) copied, 4.06792 s, 258 MB/s in 0m4.069s
1047875584 bytes (1.0 GB, 999 MiB) copied, 4.07228 s, 257 MB/s in 0m4.073s
1047875584 bytes (1.0 GB, 999 MiB) copied, 4.45442 s, 235 MB/s in 0m4.456s
VM3 on local Single Disk
1048576000 bytes (1.0 GB) copied, 11.0755 s, 94.7 MB/s in 0m11.087s
1048576000 bytes (1.0 GB) copied, 9.78886 s, 107 MB/s in 0m9.791s
1048576000 bytes (1.0 GB) copied, 13.2969 s, 78.9 MB/s in 0m13.308s
VM4 on ceph 12 x OSD / 1 GBit/s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 81.9041 s, 12.8 MB/s in 1m21.905s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 82.1762 s, 12.8 MB/s in 1m22.177s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 82.601 s, 12.7 MB/s in 1m22.602s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 84.0655 s, 12.5 MB/s in 1m24.066s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 85.9616 s, 12.2 MB/s in 1m25.962s
***** 1 GB SYNC/DIRECT READ *****
VM1 on local RAID-10(4 Disks)
1045700608 bytes (1.0 GB, 997 MiB) copied, 7.36673 s, 142 MB/s in 0m7.843s
1045700608 bytes (1.0 GB, 997 MiB) copied, 7.51898 s, 139 MB/s in 0m7.880s
1045700608 bytes (1.0 GB, 997 MiB) copied, 6.73566 s, 155 MB/s in 0m6.979s
1045700608 bytes (1.0 GB, 997 MiB) copied, 10.1356 s, 103 MB/s in 0m10.147s
1045700608 bytes (1.0 GB, 997 MiB) copied, 7.0072 s, 149 MB/s in 0m7.390s
VM2 on local RAID-10(4 Disks)
1047875584 bytes (1.0 GB, 999 MiB) copied, 5.22031 s, 201 MB/s in 0m5.484s
1047875584 bytes (1.0 GB, 999 MiB) copied, 5.31487 s, 197 MB/s in 0m5.404s
1047875584 bytes (1.0 GB, 999 MiB) copied, 5.0924 s, 206 MB/s in 0m5.201s
1047875584 bytes (1.0 GB, 999 MiB) copied, 5.17836 s, 202 MB/s in 0m5.246s
1047875584 bytes (1.0 GB, 999 MiB) copied, 5.03164 s, 208 MB/s in 0m5.098s
1047875584 bytes (1.0 GB, 999 MiB) copied, 5.37757 s, 195 MB/s in 0m5.471s
1047875584 bytes (1.0 GB, 999 MiB) copied, 5.16821 s, 203 MB/s in 0m5.300s
VM3 on local Single Disk
1048576000 bytes (1.0 GB) copied, 12.8235 s, 81.8 MB/s in 0m12.831s
1048576000 bytes (1.0 GB) copied, 12.0348 s, 87.1 MB/s in 0m12.040s
1048576000 bytes (1.0 GB) copied, 9.85168 s, 106 MB/s in 0m9.865s
1048576000 bytes (1.0 GB) copied, 15.1218 s, 69.3 MB/s in 0m15.126s
VM4 on ceph 12 x OSD / 1 GBit/s / 4 Nodes
1048576000 bytes (1.0 GB, 1000 MiB) copied, 53.4321 s, 19.6 MB/s in 0m53.433s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 21.9414 s, 47.8 MB/s in 0m21.942s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 23.038 s, 45.5 MB/s in 0m23.039s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 24.2718 s, 43.2 MB/s in 0m24.272s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 23.4865 s, 44.6 MB/s in 0m23.487s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 24.7262 s, 42.4 MB/s in 0m24.727s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 21.8496 s, 48.0 MB/s in 0m21.850s
Last edited on September 20, 2017, 3:18 pm by fx882 · #1
admin
2,930 Posts
September 20, 2017, 4:43 pmQuote from admin on September 20, 2017, 4:43 pmHi,
Can you please run the cluster benchmark in PetaSAN (from left menu) and measure your 4M Throughput for 1, 16 and 64 threads + your resource busy % for all 3 cases.
Hi,
Can you please run the cluster benchmark in PetaSAN (from left menu) and measure your 4M Throughput for 1, 16 and 64 threads + your resource busy % for all 3 cases.
Last edited on September 20, 2017, 4:58 pm by admin · #2
fx882
17 Posts
September 20, 2017, 6:31 pmQuote from fx882 on September 20, 2017, 6:31 pmYes. Will do. After I reinstalled the testcluster I destroyed by now 🙂
Yes. Will do. After I reinstalled the testcluster I destroyed by now 🙂
admin
2,930 Posts
September 21, 2017, 8:43 amQuote from admin on September 21, 2017, 8:43 amYes the benchmarks will show a lot.
The current tests also show your raw disk performance av 90 MB/s which is low.
Ceph does act like a giant raid using all disks when you have concurrent io (many cients/threads). The io path for a single client operation will use single disks and its performance will be less than raw disk performance...unlike RAID.
Theoretically the max speed for a single thread:
read speed = raw disk read speed
write speed = raw disk write / 4 for 2 replicas / 6 for 3 replicas
Theoretically the max speed for concurrent io:
read speed = raw disk read speed * number of disks in cluster
write speed = raw disk write / 4 for 2 replicas / 6 for 3 replicas * number of disks in cluster
For single io stream, Ceph is slower than single disk and of course RAID but scales out in unlimited way for concurrent ios.
For a replica of 2, we do 4 writes due to Ceph's current 2 phase commit approach much like a transactional database. In Bluestore engine the speed will be doubled.
Yes the benchmarks will show a lot.
The current tests also show your raw disk performance av 90 MB/s which is low.
Ceph does act like a giant raid using all disks when you have concurrent io (many cients/threads). The io path for a single client operation will use single disks and its performance will be less than raw disk performance...unlike RAID.
Theoretically the max speed for a single thread:
read speed = raw disk read speed
write speed = raw disk write / 4 for 2 replicas / 6 for 3 replicas
Theoretically the max speed for concurrent io:
read speed = raw disk read speed * number of disks in cluster
write speed = raw disk write / 4 for 2 replicas / 6 for 3 replicas * number of disks in cluster
For single io stream, Ceph is slower than single disk and of course RAID but scales out in unlimited way for concurrent ios.
For a replica of 2, we do 4 writes due to Ceph's current 2 phase commit approach much like a transactional database. In Bluestore engine the speed will be doubled.
Last edited on September 21, 2017, 8:59 am by admin · #4
fx882
17 Posts
September 21, 2017, 9:12 amQuote from fx882 on September 21, 2017, 9:12 amIs there a possibility to reset the cluster setup(with destroying the current storage) without the need to have all nodes reinstalled?
EDIT: I checked all read/write speeds of every single disk. The Speeds are between 150 MB/s(used SAS-Disks). and one device is even at 45 MB/s. I assume the slow device is slowing down the whole cluster. I'll drop all disks below 100 MB/s and test again. Since I'm using only used/junk hardware, I do not expect miracles.
Since I'm using replica count 3 it's even slower. (replica count 2 seems to be no secure option for me. 2 concurrent disk fails in a cluster of 20-40 disks does not seem such a rare case. It's only a question of time when two disks with the same replicas of data will fail. maybe the use of raid-5 osds is better than increasing the replica count).
EDIT-2: Regarding the test above: The test is done with very different HDDs and Controllers. Some old disks on 3ware Controllers(9750-4i) with backed synchronous DRBD gave me poor write speeds of 8-12 MB/sec. The same setup with newer hard disk and write-cached LSI Controllers (9271-4i) gave high write speeds of 600 MB/sec.
Is there a possibility to reset the cluster setup(with destroying the current storage) without the need to have all nodes reinstalled?
EDIT: I checked all read/write speeds of every single disk. The Speeds are between 150 MB/s(used SAS-Disks). and one device is even at 45 MB/s. I assume the slow device is slowing down the whole cluster. I'll drop all disks below 100 MB/s and test again. Since I'm using only used/junk hardware, I do not expect miracles.
Since I'm using replica count 3 it's even slower. (replica count 2 seems to be no secure option for me. 2 concurrent disk fails in a cluster of 20-40 disks does not seem such a rare case. It's only a question of time when two disks with the same replicas of data will fail. maybe the use of raid-5 osds is better than increasing the replica count).
EDIT-2: Regarding the test above: The test is done with very different HDDs and Controllers. Some old disks on 3ware Controllers(9750-4i) with backed synchronous DRBD gave me poor write speeds of 8-12 MB/sec. The same setup with newer hard disk and write-cached LSI Controllers (9271-4i) gave high write speeds of 600 MB/sec.
Last edited on September 21, 2017, 10:22 am by fx882 · #5
admin
2,930 Posts
September 21, 2017, 12:15 pmQuote from admin on September 21, 2017, 12:15 pmThe creation of the cluster is done once during node deployment of 3rd node. You can then delete/add osds and non management nodes but you cannot delete or change your management nodes their data is hard coded in the ceph monitors. To rebuild the cluster it is probably easier to re install from start.
Re disk types. It is much better to use the same model/size of disks and not mix. A slow disk will slow your entire cluster and, even a disk double the size of others will be overloaded twice as much and will become a bottleneck. This is not a problem in large clusters but on small clusters it will have a large effect. The PetaSAN benchmark will also tell you which disk is slowing you.
The creation of the cluster is done once during node deployment of 3rd node. You can then delete/add osds and non management nodes but you cannot delete or change your management nodes their data is hard coded in the ceph monitors. To rebuild the cluster it is probably easier to re install from start.
Re disk types. It is much better to use the same model/size of disks and not mix. A slow disk will slow your entire cluster and, even a disk double the size of others will be overloaded twice as much and will become a bottleneck. This is not a problem in large clusters but on small clusters it will have a large effect. The PetaSAN benchmark will also tell you which disk is slowing you.
fx882
17 Posts
September 21, 2017, 4:07 pmQuote from fx882 on September 21, 2017, 4:07 pm
- Cluster Hardware changed to(12 x OSD with 1 TB HDD SATA/SAS MIXED as single drives - All at least 140 MB/Sec)
- Cluster Reinstalled
- Benchmarks done for 4M with 1, 16 and 64 Threads.
Screenshots of Benchmarks
https://imgur.com/1rwHlyc
https://imgur.com/vDBwfAC
https://imgur.com/MtBDrKq
Note I first reinstalled the cluster with Jumbo-Frames enabled. After I have now my switch correctly configured with Jumbo-Frames, the cluster setup completes correctly. Nevertheless the cluster wasn't stable. OSDs failed and came back online. So I had a lot of PG failures, that could not automatically be fixed. So I reinstalled again without Jumbo Frames. Now the cluster is working stable again. Maybe a switch issue? (HP ProCurve J4904A Switch 2848, not exactly brand new)
EDIT Regarding my own little benchmark: Write rates are 2 times as before / Read rates are at the same level
./script-read.sh
1048576000 bytes (1.0 GB, 1000 MiB) copied, 21.3062 s, 49.2 MB/s in 0m21.307s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 19.9498 s, 52.6 MB/s in 0m19.951s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 20.1992 s, 51.9 MB/s in 0m20.200s
./script-write.sh
1048576000 bytes (1.0 GB, 1000 MiB) copied, 40.2073 s, 26.1 MB/s in 0m40.208s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 40.2507 s, 26.1 MB/s in 0m40.252s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 33.9868 s, 30.9 MB/s in 0m33.988s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 37.7985 s, 27.7 MB/s in 0m37.799s
- Cluster Hardware changed to(12 x OSD with 1 TB HDD SATA/SAS MIXED as single drives - All at least 140 MB/Sec)
- Cluster Reinstalled
- Benchmarks done for 4M with 1, 16 and 64 Threads.
Screenshots of Benchmarks
Note I first reinstalled the cluster with Jumbo-Frames enabled. After I have now my switch correctly configured with Jumbo-Frames, the cluster setup completes correctly. Nevertheless the cluster wasn't stable. OSDs failed and came back online. So I had a lot of PG failures, that could not automatically be fixed. So I reinstalled again without Jumbo Frames. Now the cluster is working stable again. Maybe a switch issue? (HP ProCurve J4904A Switch 2848, not exactly brand new)
EDIT Regarding my own little benchmark: Write rates are 2 times as before / Read rates are at the same level
./script-read.sh
1048576000 bytes (1.0 GB, 1000 MiB) copied, 21.3062 s, 49.2 MB/s in 0m21.307s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 19.9498 s, 52.6 MB/s in 0m19.951s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 20.1992 s, 51.9 MB/s in 0m20.200s
./script-write.sh
1048576000 bytes (1.0 GB, 1000 MiB) copied, 40.2073 s, 26.1 MB/s in 0m40.208s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 40.2507 s, 26.1 MB/s in 0m40.252s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 33.9868 s, 30.9 MB/s in 0m33.988s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 37.7985 s, 27.7 MB/s in 0m37.799s
Last edited on September 21, 2017, 5:11 pm by fx882 · #7
admin
2,930 Posts
September 22, 2017, 5:57 amQuote from admin on September 22, 2017, 5:57 amThe cluster does scale when you add more io clients, though it saturates quickly since your 16 and 64 numbers are similar. However even when it saturates its resources (cpu/disks/net) are not fully utilized (max network was 38%)...your cluster as is will probably give 2x the max numbers shown. The reason the numbers have saturated yet your fourth node is not fully utilized is that we are using the first 3 nodes to simulate client traffic on top of being server nodes, their resources will be increased for this simulation and probably they saturated their backend-1 1G net work due to the double client/server traffic. The blue info message at top of benchmark page recommends running client simulation on non OSD nodes to get accurate results but the test you did give us a good idea: you should get higher results (probably double) as is and if you need to increase more since your disks/cpu are underutilized you may want add/bond your 1G backend nics.
For you script test, i will be useful to run from outside Xen VMs, it should be easy to setup a windows client with mpio and run a file copy of use a tool like crystaldiskmark or iometer to see single client performance on other platforms.
The cluster does scale when you add more io clients, though it saturates quickly since your 16 and 64 numbers are similar. However even when it saturates its resources (cpu/disks/net) are not fully utilized (max network was 38%)...your cluster as is will probably give 2x the max numbers shown. The reason the numbers have saturated yet your fourth node is not fully utilized is that we are using the first 3 nodes to simulate client traffic on top of being server nodes, their resources will be increased for this simulation and probably they saturated their backend-1 1G net work due to the double client/server traffic. The blue info message at top of benchmark page recommends running client simulation on non OSD nodes to get accurate results but the test you did give us a good idea: you should get higher results (probably double) as is and if you need to increase more since your disks/cpu are underutilized you may want add/bond your 1G backend nics.
For you script test, i will be useful to run from outside Xen VMs, it should be easy to setup a windows client with mpio and run a file copy of use a tool like crystaldiskmark or iometer to see single client performance on other platforms.
Last edited on September 22, 2017, 5:59 am by admin · #8
fx882
17 Posts
September 22, 2017, 8:15 amQuote from fx882 on September 22, 2017, 8:15 am
The blue info message at top of benchmark page recommends running client simulation on non OSD nodes to get accurate results
Yes. I read that. Since I want to spare the extra effort for setting up a fifth node, I did it that way. And of course I misunderstood the test screen. I assumed the 3 nodes where the one to be tested.
Further I did not mention that I changed from 3 to 2 replicas. I changed back to 3 replicas and did my tests again. The write speed is a bit lower as expected.
READ
1048576000 bytes (1.0 GB, 1000 MiB) copied, 20.9689 s, 50.0 MB/s in 0m20.970s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 20.5114 s, 51.1 MB/s in 0m20.512s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 20.5808 s, 50.9 MB/s in 0m20.581s
WRITE
1048576000 bytes (1.0 GB, 1000 MiB) copied, 46.8314 s, 22.4 MB/s in 0m46.832s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 48.0568 s, 21.8 MB/s in 0m48.058s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 46.0883 s, 22.8 MB/s in 0m46.089s
The blue info message at top of benchmark page recommends running client simulation on non OSD nodes to get accurate results
Yes. I read that. Since I want to spare the extra effort for setting up a fifth node, I did it that way. And of course I misunderstood the test screen. I assumed the 3 nodes where the one to be tested.
Further I did not mention that I changed from 3 to 2 replicas. I changed back to 3 replicas and did my tests again. The write speed is a bit lower as expected.
READ
1048576000 bytes (1.0 GB, 1000 MiB) copied, 20.9689 s, 50.0 MB/s in 0m20.970s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 20.5114 s, 51.1 MB/s in 0m20.512s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 20.5808 s, 50.9 MB/s in 0m20.581s
WRITE
1048576000 bytes (1.0 GB, 1000 MiB) copied, 46.8314 s, 22.4 MB/s in 0m46.832s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 48.0568 s, 21.8 MB/s in 0m48.058s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 46.0883 s, 22.8 MB/s in 0m46.089s
Last edited on September 22, 2017, 8:42 am by fx882 · #9
admin
2,930 Posts
September 22, 2017, 9:10 amQuote from admin on September 22, 2017, 9:10 amYes the test you did gives a good enough idea. You should be able to get higher that these values with actual clients and can increase it by doubling nics in your backend networks using nic bonding so you get 2G instead of 1Gbps interfaces.
After knowing what you could get from the cluster, try to run multiple clients such as running many of your test scripts at the same time, from different processes, vms and preferably running on different physical machines and see if they scale to reach the cluster peak numbers (by adding the numbers seen by each client). Also as i suggested earlier it will be very beneficial to test client on platforms other that Xen vms such as using Windows to see if there is differences between them which can point to client iSCSI configuration issues.
Yes the test you did gives a good enough idea. You should be able to get higher that these values with actual clients and can increase it by doubling nics in your backend networks using nic bonding so you get 2G instead of 1Gbps interfaces.
After knowing what you could get from the cluster, try to run multiple clients such as running many of your test scripts at the same time, from different processes, vms and preferably running on different physical machines and see if they scale to reach the cluster peak numbers (by adding the numbers seen by each client). Also as i suggested earlier it will be very beneficial to test client on platforms other that Xen vms such as using Windows to see if there is differences between them which can point to client iSCSI configuration issues.
Last edited on September 22, 2017, 9:12 am by admin · #10
Pages: 1 2
Untuned Ceph Setup quite slow
fx882
17 Posts
Quote from fx882 on September 20, 2017, 2:38 pmHi,
I tested my petasan setup. And it performs not really impressive. It's build with
entry-level used hardware and no tuning has been taken place(no jumbo frames, nothing).Did I do any major error in my setup, or is this the way it normally is?
When I look at the performance of the ceph cluster, it's idling. No CPU usage
spikes. No network spikes. No HDD read/write spikes.And yes, pure raw sequential I/O is not the main strength of ceph - concurrency is.
Hardware:
- 4 nodes, intel core i7, 16 GB RAM, 5 x intel e1000 nic
- all networks separate 1 GBit/s vlans
- os disk separate from osd disks
- 12 osd ( 3 x sata 500 GB, 3 x SATA 600 GB, 6 x SATA 1 TB on Adaptec ASR5405)
Testing Environment:
- simple Citrix XenServer VM running on ceph and local storage as comparison, one with single disk, other with 4-disk RAID-10
- The comparison machines with RAID-10 were running in production, the ceph-system is testing only without load
- simple direct+sync sequential IO(read+write)
These scripts I used:
#!/bin/bash
# SYNC WRITE TEST
export LC_ALL=C
COPY_SIZE_MB=1000
TEMP=$(mktemp --tmpdir=/dev/shm)
trap "rm -f '$TEMP' ; rm -f /tmp/out ; exit 0" 1 2 3 15while :;do
dd </dev/urandom bs=1M count=$COPY_SIZE_MB >$TEMP 2>/dev/null
OUT=$({ time dd <$TEMP bs=1M oflag=sync,direct >/tmp/out ; } 2>&1)
rm -f $TEMP
REAL="$(echo -e "$OUT" | grep real | awk '{print $NF}')"
COPIED="$(echo -e "$OUT" | grep copied )"
echo "$COPIED in $REAL"
done#!/bin/bash
# SYNC READ TEST
export LC_ALL=C
COPY_SIZE_MB=1000
TEMP=$(mktemp --tmpdir=/dev/shm)
trap "rm -f '$TEMP' ; rm -f /tmp/out ; exit 0" 1 2 3 15while :;do
dd </dev/urandom bs=1M count=$COPY_SIZE_MB >$TEMP 2>/dev/null
cp $TEMP /tmp/out
OUT=$({ time dd >/dev/null bs=1M iflag=sync,direct </tmp/out ; } 2>&1)
rm -f $TEMP
REAL="$(echo -e "$OUT" | grep real | awk '{print $NF}')"
COPIED="$(echo -e "$OUT" | grep copied )"
echo "$COPIED in $REAL"
doneThese are the results:
***** 1 GB SYNC/DIRECT WRITE *****
VM1 SR-1 local RAID-10(4 Disks)
1045700608 bytes (1.0 GB, 997 MiB) copied, 6.96848 s, 150 MB/s in 0m6.970s
1045700608 bytes (1.0 GB, 997 MiB) copied, 6.32714 s, 165 MB/s in 0m6.328s
1045700608 bytes (1.0 GB, 997 MiB) copied, 6.63697 s, 158 MB/s in 0m6.639s
1045700608 bytes (1.0 GB, 997 MiB) copied, 5.0153 s, 209 MB/s in 0m5.017s
1045700608 bytes (1.0 GB, 997 MiB) copied, 4.89159 s, 214 MB/s in 0m4.894s
1045700608 bytes (1.0 GB, 997 MiB) copied, 5.3156 s, 197 MB/s in 0m5.317s
1045700608 bytes (1.0 GB, 997 MiB) copied, 4.2745 s, 245 MB/s in 0m4.276sVM2 on local RAID-10(4 Disks)
1047875584 bytes (1.0 GB, 999 MiB) copied, 4.23788 s, 247 MB/s in 0m4.240s
1047875584 bytes (1.0 GB, 999 MiB) copied, 4.27559 s, 245 MB/s in 0m4.277s
1047875584 bytes (1.0 GB, 999 MiB) copied, 3.93707 s, 266 MB/s in 0m3.938s
1047875584 bytes (1.0 GB, 999 MiB) copied, 4.06792 s, 258 MB/s in 0m4.069s
1047875584 bytes (1.0 GB, 999 MiB) copied, 4.07228 s, 257 MB/s in 0m4.073s
1047875584 bytes (1.0 GB, 999 MiB) copied, 4.45442 s, 235 MB/s in 0m4.456sVM3 on local Single Disk
1048576000 bytes (1.0 GB) copied, 11.0755 s, 94.7 MB/s in 0m11.087s
1048576000 bytes (1.0 GB) copied, 9.78886 s, 107 MB/s in 0m9.791s
1048576000 bytes (1.0 GB) copied, 13.2969 s, 78.9 MB/s in 0m13.308sVM4 on ceph 12 x OSD / 1 GBit/s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 81.9041 s, 12.8 MB/s in 1m21.905s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 82.1762 s, 12.8 MB/s in 1m22.177s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 82.601 s, 12.7 MB/s in 1m22.602s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 84.0655 s, 12.5 MB/s in 1m24.066s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 85.9616 s, 12.2 MB/s in 1m25.962s***** 1 GB SYNC/DIRECT READ *****
VM1 on local RAID-10(4 Disks)
1045700608 bytes (1.0 GB, 997 MiB) copied, 7.36673 s, 142 MB/s in 0m7.843s
1045700608 bytes (1.0 GB, 997 MiB) copied, 7.51898 s, 139 MB/s in 0m7.880s
1045700608 bytes (1.0 GB, 997 MiB) copied, 6.73566 s, 155 MB/s in 0m6.979s
1045700608 bytes (1.0 GB, 997 MiB) copied, 10.1356 s, 103 MB/s in 0m10.147s
1045700608 bytes (1.0 GB, 997 MiB) copied, 7.0072 s, 149 MB/s in 0m7.390sVM2 on local RAID-10(4 Disks)
1047875584 bytes (1.0 GB, 999 MiB) copied, 5.22031 s, 201 MB/s in 0m5.484s
1047875584 bytes (1.0 GB, 999 MiB) copied, 5.31487 s, 197 MB/s in 0m5.404s
1047875584 bytes (1.0 GB, 999 MiB) copied, 5.0924 s, 206 MB/s in 0m5.201s
1047875584 bytes (1.0 GB, 999 MiB) copied, 5.17836 s, 202 MB/s in 0m5.246s
1047875584 bytes (1.0 GB, 999 MiB) copied, 5.03164 s, 208 MB/s in 0m5.098s
1047875584 bytes (1.0 GB, 999 MiB) copied, 5.37757 s, 195 MB/s in 0m5.471s
1047875584 bytes (1.0 GB, 999 MiB) copied, 5.16821 s, 203 MB/s in 0m5.300sVM3 on local Single Disk
1048576000 bytes (1.0 GB) copied, 12.8235 s, 81.8 MB/s in 0m12.831s
1048576000 bytes (1.0 GB) copied, 12.0348 s, 87.1 MB/s in 0m12.040s
1048576000 bytes (1.0 GB) copied, 9.85168 s, 106 MB/s in 0m9.865s
1048576000 bytes (1.0 GB) copied, 15.1218 s, 69.3 MB/s in 0m15.126sVM4 on ceph 12 x OSD / 1 GBit/s / 4 Nodes
1048576000 bytes (1.0 GB, 1000 MiB) copied, 53.4321 s, 19.6 MB/s in 0m53.433s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 21.9414 s, 47.8 MB/s in 0m21.942s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 23.038 s, 45.5 MB/s in 0m23.039s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 24.2718 s, 43.2 MB/s in 0m24.272s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 23.4865 s, 44.6 MB/s in 0m23.487s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 24.7262 s, 42.4 MB/s in 0m24.727s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 21.8496 s, 48.0 MB/s in 0m21.850s
Hi,
I tested my petasan setup. And it performs not really impressive. It's build with
entry-level used hardware and no tuning has been taken place(no jumbo frames, nothing).
Did I do any major error in my setup, or is this the way it normally is?
When I look at the performance of the ceph cluster, it's idling. No CPU usage
spikes. No network spikes. No HDD read/write spikes.
And yes, pure raw sequential I/O is not the main strength of ceph - concurrency is.
Hardware:
- 4 nodes, intel core i7, 16 GB RAM, 5 x intel e1000 nic
- all networks separate 1 GBit/s vlans
- os disk separate from osd disks
- 12 osd ( 3 x sata 500 GB, 3 x SATA 600 GB, 6 x SATA 1 TB on Adaptec ASR5405)
Testing Environment:
- simple Citrix XenServer VM running on ceph and local storage as comparison, one with single disk, other with 4-disk RAID-10
- The comparison machines with RAID-10 were running in production, the ceph-system is testing only without load
- simple direct+sync sequential IO(read+write)
These scripts I used:
#!/bin/bash
# SYNC WRITE TEST
export LC_ALL=C
COPY_SIZE_MB=1000
TEMP=$(mktemp --tmpdir=/dev/shm)
trap "rm -f '$TEMP' ; rm -f /tmp/out ; exit 0" 1 2 3 15while :;do
dd </dev/urandom bs=1M count=$COPY_SIZE_MB >$TEMP 2>/dev/null
OUT=$({ time dd <$TEMP bs=1M oflag=sync,direct >/tmp/out ; } 2>&1)
rm -f $TEMP
REAL="$(echo -e "$OUT" | grep real | awk '{print $NF}')"
COPIED="$(echo -e "$OUT" | grep copied )"
echo "$COPIED in $REAL"
done#!/bin/bash
# SYNC READ TEST
export LC_ALL=C
COPY_SIZE_MB=1000
TEMP=$(mktemp --tmpdir=/dev/shm)
trap "rm -f '$TEMP' ; rm -f /tmp/out ; exit 0" 1 2 3 15while :;do
dd </dev/urandom bs=1M count=$COPY_SIZE_MB >$TEMP 2>/dev/null
cp $TEMP /tmp/out
OUT=$({ time dd >/dev/null bs=1M iflag=sync,direct </tmp/out ; } 2>&1)
rm -f $TEMP
REAL="$(echo -e "$OUT" | grep real | awk '{print $NF}')"
COPIED="$(echo -e "$OUT" | grep copied )"
echo "$COPIED in $REAL"
done
These are the results:
***** 1 GB SYNC/DIRECT WRITE *****
VM1 SR-1 local RAID-10(4 Disks)
1045700608 bytes (1.0 GB, 997 MiB) copied, 6.96848 s, 150 MB/s in 0m6.970s
1045700608 bytes (1.0 GB, 997 MiB) copied, 6.32714 s, 165 MB/s in 0m6.328s
1045700608 bytes (1.0 GB, 997 MiB) copied, 6.63697 s, 158 MB/s in 0m6.639s
1045700608 bytes (1.0 GB, 997 MiB) copied, 5.0153 s, 209 MB/s in 0m5.017s
1045700608 bytes (1.0 GB, 997 MiB) copied, 4.89159 s, 214 MB/s in 0m4.894s
1045700608 bytes (1.0 GB, 997 MiB) copied, 5.3156 s, 197 MB/s in 0m5.317s
1045700608 bytes (1.0 GB, 997 MiB) copied, 4.2745 s, 245 MB/s in 0m4.276s
VM2 on local RAID-10(4 Disks)
1047875584 bytes (1.0 GB, 999 MiB) copied, 4.23788 s, 247 MB/s in 0m4.240s
1047875584 bytes (1.0 GB, 999 MiB) copied, 4.27559 s, 245 MB/s in 0m4.277s
1047875584 bytes (1.0 GB, 999 MiB) copied, 3.93707 s, 266 MB/s in 0m3.938s
1047875584 bytes (1.0 GB, 999 MiB) copied, 4.06792 s, 258 MB/s in 0m4.069s
1047875584 bytes (1.0 GB, 999 MiB) copied, 4.07228 s, 257 MB/s in 0m4.073s
1047875584 bytes (1.0 GB, 999 MiB) copied, 4.45442 s, 235 MB/s in 0m4.456s
VM3 on local Single Disk
1048576000 bytes (1.0 GB) copied, 11.0755 s, 94.7 MB/s in 0m11.087s
1048576000 bytes (1.0 GB) copied, 9.78886 s, 107 MB/s in 0m9.791s
1048576000 bytes (1.0 GB) copied, 13.2969 s, 78.9 MB/s in 0m13.308s
VM4 on ceph 12 x OSD / 1 GBit/s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 81.9041 s, 12.8 MB/s in 1m21.905s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 82.1762 s, 12.8 MB/s in 1m22.177s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 82.601 s, 12.7 MB/s in 1m22.602s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 84.0655 s, 12.5 MB/s in 1m24.066s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 85.9616 s, 12.2 MB/s in 1m25.962s
***** 1 GB SYNC/DIRECT READ *****
VM1 on local RAID-10(4 Disks)
1045700608 bytes (1.0 GB, 997 MiB) copied, 7.36673 s, 142 MB/s in 0m7.843s
1045700608 bytes (1.0 GB, 997 MiB) copied, 7.51898 s, 139 MB/s in 0m7.880s
1045700608 bytes (1.0 GB, 997 MiB) copied, 6.73566 s, 155 MB/s in 0m6.979s
1045700608 bytes (1.0 GB, 997 MiB) copied, 10.1356 s, 103 MB/s in 0m10.147s
1045700608 bytes (1.0 GB, 997 MiB) copied, 7.0072 s, 149 MB/s in 0m7.390s
VM2 on local RAID-10(4 Disks)
1047875584 bytes (1.0 GB, 999 MiB) copied, 5.22031 s, 201 MB/s in 0m5.484s
1047875584 bytes (1.0 GB, 999 MiB) copied, 5.31487 s, 197 MB/s in 0m5.404s
1047875584 bytes (1.0 GB, 999 MiB) copied, 5.0924 s, 206 MB/s in 0m5.201s
1047875584 bytes (1.0 GB, 999 MiB) copied, 5.17836 s, 202 MB/s in 0m5.246s
1047875584 bytes (1.0 GB, 999 MiB) copied, 5.03164 s, 208 MB/s in 0m5.098s
1047875584 bytes (1.0 GB, 999 MiB) copied, 5.37757 s, 195 MB/s in 0m5.471s
1047875584 bytes (1.0 GB, 999 MiB) copied, 5.16821 s, 203 MB/s in 0m5.300s
VM3 on local Single Disk
1048576000 bytes (1.0 GB) copied, 12.8235 s, 81.8 MB/s in 0m12.831s
1048576000 bytes (1.0 GB) copied, 12.0348 s, 87.1 MB/s in 0m12.040s
1048576000 bytes (1.0 GB) copied, 9.85168 s, 106 MB/s in 0m9.865s
1048576000 bytes (1.0 GB) copied, 15.1218 s, 69.3 MB/s in 0m15.126s
VM4 on ceph 12 x OSD / 1 GBit/s / 4 Nodes
1048576000 bytes (1.0 GB, 1000 MiB) copied, 53.4321 s, 19.6 MB/s in 0m53.433s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 21.9414 s, 47.8 MB/s in 0m21.942s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 23.038 s, 45.5 MB/s in 0m23.039s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 24.2718 s, 43.2 MB/s in 0m24.272s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 23.4865 s, 44.6 MB/s in 0m23.487s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 24.7262 s, 42.4 MB/s in 0m24.727s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 21.8496 s, 48.0 MB/s in 0m21.850s
admin
2,930 Posts
Quote from admin on September 20, 2017, 4:43 pmHi,
Can you please run the cluster benchmark in PetaSAN (from left menu) and measure your 4M Throughput for 1, 16 and 64 threads + your resource busy % for all 3 cases.
Hi,
Can you please run the cluster benchmark in PetaSAN (from left menu) and measure your 4M Throughput for 1, 16 and 64 threads + your resource busy % for all 3 cases.
fx882
17 Posts
Quote from fx882 on September 20, 2017, 6:31 pmYes. Will do. After I reinstalled the testcluster I destroyed by now 🙂
Yes. Will do. After I reinstalled the testcluster I destroyed by now 🙂
admin
2,930 Posts
Quote from admin on September 21, 2017, 8:43 amYes the benchmarks will show a lot.
The current tests also show your raw disk performance av 90 MB/s which is low.
Ceph does act like a giant raid using all disks when you have concurrent io (many cients/threads). The io path for a single client operation will use single disks and its performance will be less than raw disk performance...unlike RAID.
Theoretically the max speed for a single thread:
read speed = raw disk read speed
write speed = raw disk write / 4 for 2 replicas / 6 for 3 replicas
Theoretically the max speed for concurrent io:
read speed = raw disk read speed * number of disks in cluster
write speed = raw disk write / 4 for 2 replicas / 6 for 3 replicas * number of disks in cluster
For single io stream, Ceph is slower than single disk and of course RAID but scales out in unlimited way for concurrent ios.
For a replica of 2, we do 4 writes due to Ceph's current 2 phase commit approach much like a transactional database. In Bluestore engine the speed will be doubled.
Yes the benchmarks will show a lot.
The current tests also show your raw disk performance av 90 MB/s which is low.
Ceph does act like a giant raid using all disks when you have concurrent io (many cients/threads). The io path for a single client operation will use single disks and its performance will be less than raw disk performance...unlike RAID.
Theoretically the max speed for a single thread:
read speed = raw disk read speed
write speed = raw disk write / 4 for 2 replicas / 6 for 3 replicas
Theoretically the max speed for concurrent io:
read speed = raw disk read speed * number of disks in cluster
write speed = raw disk write / 4 for 2 replicas / 6 for 3 replicas * number of disks in cluster
For single io stream, Ceph is slower than single disk and of course RAID but scales out in unlimited way for concurrent ios.
For a replica of 2, we do 4 writes due to Ceph's current 2 phase commit approach much like a transactional database. In Bluestore engine the speed will be doubled.
fx882
17 Posts
Quote from fx882 on September 21, 2017, 9:12 amIs there a possibility to reset the cluster setup(with destroying the current storage) without the need to have all nodes reinstalled?
EDIT: I checked all read/write speeds of every single disk. The Speeds are between 150 MB/s(used SAS-Disks). and one device is even at 45 MB/s. I assume the slow device is slowing down the whole cluster. I'll drop all disks below 100 MB/s and test again. Since I'm using only used/junk hardware, I do not expect miracles.
Since I'm using replica count 3 it's even slower. (replica count 2 seems to be no secure option for me. 2 concurrent disk fails in a cluster of 20-40 disks does not seem such a rare case. It's only a question of time when two disks with the same replicas of data will fail. maybe the use of raid-5 osds is better than increasing the replica count).
EDIT-2: Regarding the test above: The test is done with very different HDDs and Controllers. Some old disks on 3ware Controllers(9750-4i) with backed synchronous DRBD gave me poor write speeds of 8-12 MB/sec. The same setup with newer hard disk and write-cached LSI Controllers (9271-4i) gave high write speeds of 600 MB/sec.
Is there a possibility to reset the cluster setup(with destroying the current storage) without the need to have all nodes reinstalled?
EDIT: I checked all read/write speeds of every single disk. The Speeds are between 150 MB/s(used SAS-Disks). and one device is even at 45 MB/s. I assume the slow device is slowing down the whole cluster. I'll drop all disks below 100 MB/s and test again. Since I'm using only used/junk hardware, I do not expect miracles.
Since I'm using replica count 3 it's even slower. (replica count 2 seems to be no secure option for me. 2 concurrent disk fails in a cluster of 20-40 disks does not seem such a rare case. It's only a question of time when two disks with the same replicas of data will fail. maybe the use of raid-5 osds is better than increasing the replica count).
EDIT-2: Regarding the test above: The test is done with very different HDDs and Controllers. Some old disks on 3ware Controllers(9750-4i) with backed synchronous DRBD gave me poor write speeds of 8-12 MB/sec. The same setup with newer hard disk and write-cached LSI Controllers (9271-4i) gave high write speeds of 600 MB/sec.
admin
2,930 Posts
Quote from admin on September 21, 2017, 12:15 pmThe creation of the cluster is done once during node deployment of 3rd node. You can then delete/add osds and non management nodes but you cannot delete or change your management nodes their data is hard coded in the ceph monitors. To rebuild the cluster it is probably easier to re install from start.
Re disk types. It is much better to use the same model/size of disks and not mix. A slow disk will slow your entire cluster and, even a disk double the size of others will be overloaded twice as much and will become a bottleneck. This is not a problem in large clusters but on small clusters it will have a large effect. The PetaSAN benchmark will also tell you which disk is slowing you.
The creation of the cluster is done once during node deployment of 3rd node. You can then delete/add osds and non management nodes but you cannot delete or change your management nodes their data is hard coded in the ceph monitors. To rebuild the cluster it is probably easier to re install from start.
Re disk types. It is much better to use the same model/size of disks and not mix. A slow disk will slow your entire cluster and, even a disk double the size of others will be overloaded twice as much and will become a bottleneck. This is not a problem in large clusters but on small clusters it will have a large effect. The PetaSAN benchmark will also tell you which disk is slowing you.
fx882
17 Posts
Quote from fx882 on September 21, 2017, 4:07 pm
- Cluster Hardware changed to(12 x OSD with 1 TB HDD SATA/SAS MIXED as single drives - All at least 140 MB/Sec)
- Cluster Reinstalled
- Benchmarks done for 4M with 1, 16 and 64 Threads.
Screenshots of Benchmarks
https://imgur.com/1rwHlyc
https://imgur.com/vDBwfAC
https://imgur.com/MtBDrKqNote I first reinstalled the cluster with Jumbo-Frames enabled. After I have now my switch correctly configured with Jumbo-Frames, the cluster setup completes correctly. Nevertheless the cluster wasn't stable. OSDs failed and came back online. So I had a lot of PG failures, that could not automatically be fixed. So I reinstalled again without Jumbo Frames. Now the cluster is working stable again. Maybe a switch issue? (HP ProCurve J4904A Switch 2848, not exactly brand new)
EDIT Regarding my own little benchmark: Write rates are 2 times as before / Read rates are at the same level
./script-read.sh
1048576000 bytes (1.0 GB, 1000 MiB) copied, 21.3062 s, 49.2 MB/s in 0m21.307s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 19.9498 s, 52.6 MB/s in 0m19.951s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 20.1992 s, 51.9 MB/s in 0m20.200s
./script-write.sh
1048576000 bytes (1.0 GB, 1000 MiB) copied, 40.2073 s, 26.1 MB/s in 0m40.208s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 40.2507 s, 26.1 MB/s in 0m40.252s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 33.9868 s, 30.9 MB/s in 0m33.988s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 37.7985 s, 27.7 MB/s in 0m37.799s
- Cluster Hardware changed to(12 x OSD with 1 TB HDD SATA/SAS MIXED as single drives - All at least 140 MB/Sec)
- Cluster Reinstalled
- Benchmarks done for 4M with 1, 16 and 64 Threads.
Screenshots of Benchmarks
Note I first reinstalled the cluster with Jumbo-Frames enabled. After I have now my switch correctly configured with Jumbo-Frames, the cluster setup completes correctly. Nevertheless the cluster wasn't stable. OSDs failed and came back online. So I had a lot of PG failures, that could not automatically be fixed. So I reinstalled again without Jumbo Frames. Now the cluster is working stable again. Maybe a switch issue? (HP ProCurve J4904A Switch 2848, not exactly brand new)
EDIT Regarding my own little benchmark: Write rates are 2 times as before / Read rates are at the same level
./script-read.sh
1048576000 bytes (1.0 GB, 1000 MiB) copied, 21.3062 s, 49.2 MB/s in 0m21.307s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 19.9498 s, 52.6 MB/s in 0m19.951s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 20.1992 s, 51.9 MB/s in 0m20.200s
./script-write.sh
1048576000 bytes (1.0 GB, 1000 MiB) copied, 40.2073 s, 26.1 MB/s in 0m40.208s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 40.2507 s, 26.1 MB/s in 0m40.252s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 33.9868 s, 30.9 MB/s in 0m33.988s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 37.7985 s, 27.7 MB/s in 0m37.799s
admin
2,930 Posts
Quote from admin on September 22, 2017, 5:57 amThe cluster does scale when you add more io clients, though it saturates quickly since your 16 and 64 numbers are similar. However even when it saturates its resources (cpu/disks/net) are not fully utilized (max network was 38%)...your cluster as is will probably give 2x the max numbers shown. The reason the numbers have saturated yet your fourth node is not fully utilized is that we are using the first 3 nodes to simulate client traffic on top of being server nodes, their resources will be increased for this simulation and probably they saturated their backend-1 1G net work due to the double client/server traffic. The blue info message at top of benchmark page recommends running client simulation on non OSD nodes to get accurate results but the test you did give us a good idea: you should get higher results (probably double) as is and if you need to increase more since your disks/cpu are underutilized you may want add/bond your 1G backend nics.
For you script test, i will be useful to run from outside Xen VMs, it should be easy to setup a windows client with mpio and run a file copy of use a tool like crystaldiskmark or iometer to see single client performance on other platforms.
The cluster does scale when you add more io clients, though it saturates quickly since your 16 and 64 numbers are similar. However even when it saturates its resources (cpu/disks/net) are not fully utilized (max network was 38%)...your cluster as is will probably give 2x the max numbers shown. The reason the numbers have saturated yet your fourth node is not fully utilized is that we are using the first 3 nodes to simulate client traffic on top of being server nodes, their resources will be increased for this simulation and probably they saturated their backend-1 1G net work due to the double client/server traffic. The blue info message at top of benchmark page recommends running client simulation on non OSD nodes to get accurate results but the test you did give us a good idea: you should get higher results (probably double) as is and if you need to increase more since your disks/cpu are underutilized you may want add/bond your 1G backend nics.
For you script test, i will be useful to run from outside Xen VMs, it should be easy to setup a windows client with mpio and run a file copy of use a tool like crystaldiskmark or iometer to see single client performance on other platforms.
fx882
17 Posts
Quote from fx882 on September 22, 2017, 8:15 amThe blue info message at top of benchmark page recommends running client simulation on non OSD nodes to get accurate results
Yes. I read that. Since I want to spare the extra effort for setting up a fifth node, I did it that way. And of course I misunderstood the test screen. I assumed the 3 nodes where the one to be tested.
Further I did not mention that I changed from 3 to 2 replicas. I changed back to 3 replicas and did my tests again. The write speed is a bit lower as expected.
READ
1048576000 bytes (1.0 GB, 1000 MiB) copied, 20.9689 s, 50.0 MB/s in 0m20.970s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 20.5114 s, 51.1 MB/s in 0m20.512s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 20.5808 s, 50.9 MB/s in 0m20.581sWRITE
1048576000 bytes (1.0 GB, 1000 MiB) copied, 46.8314 s, 22.4 MB/s in 0m46.832s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 48.0568 s, 21.8 MB/s in 0m48.058s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 46.0883 s, 22.8 MB/s in 0m46.089s
The blue info message at top of benchmark page recommends running client simulation on non OSD nodes to get accurate results
Yes. I read that. Since I want to spare the extra effort for setting up a fifth node, I did it that way. And of course I misunderstood the test screen. I assumed the 3 nodes where the one to be tested.
Further I did not mention that I changed from 3 to 2 replicas. I changed back to 3 replicas and did my tests again. The write speed is a bit lower as expected.
READ
1048576000 bytes (1.0 GB, 1000 MiB) copied, 20.9689 s, 50.0 MB/s in 0m20.970s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 20.5114 s, 51.1 MB/s in 0m20.512s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 20.5808 s, 50.9 MB/s in 0m20.581s
WRITE
1048576000 bytes (1.0 GB, 1000 MiB) copied, 46.8314 s, 22.4 MB/s in 0m46.832s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 48.0568 s, 21.8 MB/s in 0m48.058s
1048576000 bytes (1.0 GB, 1000 MiB) copied, 46.0883 s, 22.8 MB/s in 0m46.089s
admin
2,930 Posts
Quote from admin on September 22, 2017, 9:10 amYes the test you did gives a good enough idea. You should be able to get higher that these values with actual clients and can increase it by doubling nics in your backend networks using nic bonding so you get 2G instead of 1Gbps interfaces.
After knowing what you could get from the cluster, try to run multiple clients such as running many of your test scripts at the same time, from different processes, vms and preferably running on different physical machines and see if they scale to reach the cluster peak numbers (by adding the numbers seen by each client). Also as i suggested earlier it will be very beneficial to test client on platforms other that Xen vms such as using Windows to see if there is differences between them which can point to client iSCSI configuration issues.
Yes the test you did gives a good enough idea. You should be able to get higher that these values with actual clients and can increase it by doubling nics in your backend networks using nic bonding so you get 2G instead of 1Gbps interfaces.
After knowing what you could get from the cluster, try to run multiple clients such as running many of your test scripts at the same time, from different processes, vms and preferably running on different physical machines and see if they scale to reach the cluster peak numbers (by adding the numbers seen by each client). Also as i suggested earlier it will be very beneficial to test client on platforms other that Xen vms such as using Windows to see if there is differences between them which can point to client iSCSI configuration issues.