ForumGeneral DiscussionVery slow write speed on low end …

You need to log in to create posts and topics. Login · Register

Very slow write speed on low end 3 node cluster

sds80
14 Posts

April 4, 2018, 9:21 am

Hello.

3 nodes petasan cluster.

Number of replicas - 2.

Cluster hardware:

Supermicro 6015t:

2xCPU Xeon 2Gz
8Gb RAM
2xLAN 1Gbit
4xHDD (system - 250Gb, 500Gb, 1Tb respectively+ 3 OSD 2Tb on each node) - all SATA 7200

Network:

Mgmt, iSCSI1, Backend1 - eth0

iSCSI2, Backend2 - eth1

All links are 1Gbit.

Testing:

Direct copy via shell 'scp' from one node to other:

root@peta1:~# scp seq.txt root@10.10.2.2:/root/seq.txt 53% 766MB 98.0MB/s 00:06 ETA

root@peta1:~# scp seq.txt root@10.10.2.3:/root/ seq.txt 60% 853MB 96.5MB/s 00:05 ETA

root@peta1:~# scp seq.txt root@peta2:/root/seq.txt 55% 793MB 100.3MB/s 00:06 ETA

root@peta1:~# scp seq.txt root@peta3:/root/seq.txt 34% 486MB 99.6MB/s 00:09 ETA

Where 10.10.2.0/24 is a Backend2 network on eth1

and peta2 is on Mgmt network on eth0.

Next i created iSCSI disk with 2 active paths on iSCSI1 subnet (cause iSCSI2 subnet is not visible for clients).

Copy large file from iSCSI disk to client (physical windows 7 host) - 90MB/sec

Copy large file from client to iSCSI disk - 368kB/sec !!! (speed is continuesly decreased from 60MB/sec and stabilized at this number about a half of hour after )

Commit latency - 100ms average

Apply latency - 400ms average !!

IOPS - 1-3 !!!!

I tried on other vm client - speed is exactly the same.

I tried Petasan 1.5 and 2.0 - results are similar.

I don't have RAID Controllers with write cache or SSD disks for journal.

Question is - if i buy SSD and the config is to be:

System - HDD SATA 7200 500Gb
Journal - SSD 60Gb SATA
OSD - HDD SATA 7200 2Tb
OSD - HDD SATA 7200 2Tb

is this config significantly increase write speed?

admin
2,967 Posts

April 4, 2018, 12:33 pm

While doing your copy write test...can you observe from the node stats the disk% utilization (%busy) as well as disk iops on all OSDs, are they all roughly the same ? are any/all maxed out ?

Can you perform the following cluster benchmark: 4k iops for both 1 and 64 threads

Yes SSD journal will at least double your write speed + adding more hdds will linearly scale performance, but i would recommend we first find out why you are getting low numbers with existing hardware.

sds80
14 Posts

April 5, 2018, 7:45 am

Copying (from client to iSCSI disk) of large 11Gb file in progress more then a hour... average speed - 365KB/sec

and tests:

admin
2,967 Posts

April 5, 2018, 3:25 pm

The system is very slow but i am not able to see a specific bottleneck. Even the system disk is overly busy, probably just from the act of gathering node stats which is something unrelated to Ceph.

If you have other hardware, i would recommend you try it. Else it may help to know that having multiple copy commands will give higher total speed (approx 10 times as per the 4k tests), adding ssd journal will at least double the write speed, also adding more disk will linearly increase total write performance.

sds80
14 Posts

April 6, 2018, 2:38 am

Quote from sds80 on April 6, 2018, 2:38 am

Quote from admin on April 5, 2018, 3:25 pm

Else it may help to know that having multiple copy commands will give higher total speed (approx 10 times as per the 4k tests)

I'm not sure i understand that does it mean - 'having multiple copy commands'. How does achieve this in production?

, adding ssd journal will at least double the write speed

if the speed is doubled - it will be lower than 1Mb/sec

how you can predict ssd work at maks - triple, 5x, 10x

there is some uncertainty about SSD - is it really can significantly increase write speed for at least 10Mb/sec ?

p/s

i'am used Seagate Barracuda Compute ST2000DM006 for OSD

Crystal Disk Info :

CrystalDiskMark 6.0.0 x64 (C) 2007-2017 hiyohiyo
Crystal Dew World : https://crystalmark.info/
-----------------------------------------------------------------------
* MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]
* KB = 1000 bytes, KiB = 1024 bytes

Sequential Read (Q= 32,T= 1) : 220.823 MB/s
Sequential Write (Q= 32,T= 1) : 215.783 MB/s
Random Read 4KiB (Q= 8,T= 8) : 1.567 MB/s [ 382.6 IOPS]
Random Write 4KiB (Q= 8,T= 8) : 1.440 MB/s [ 351.6 IOPS]
Random Read 4KiB (Q= 32,T= 1) : 1.591 MB/s [ 388.4 IOPS]
Random Write 4KiB (Q= 32,T= 1) : 1.336 MB/s [ 326.2 IOPS]
Random Read 4KiB (Q= 1,T= 1) : 0.639 MB/s [ 156.0 IOPS]
Random Write 4KiB (Q= 1,T= 1) : 1.441 MB/s [ 351.8 IOPS]

Test : 1024 MiB [D: 0.0% (0.1/1863.0 GiB)] (x5) [Interval=5 sec]
Date : 2018/04/06 11:48:16
OS : Windows 7 Professional SP1 [6.1 Build 7601] (x64)

So basically my previous results of write speed about 400Kb/sec is in very good conjunction with test results of Random Write 1400Kb/sec (the divider is quantity of OSD per node). Is my assumption right?

ps2

and now i tested on VM local disk (host with 4x1Tb Seagate SATA on LSI RS2BL080 512mb with battery, RAID 10)

-----------------------------------------------------------------------
CrystalDiskMark 6.0.0 (C) 2007-2017 hiyohiyo
Crystal Dew World : https://crystalmark.info/
-----------------------------------------------------------------------
* MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]
* KB = 1000 bytes, KiB = 1024 bytes

Sequential Read (Q= 32,T= 1) : 245.564 MB/s
Sequential Write (Q= 32,T= 1) : 171.507 MB/s
Random Read 4KiB (Q= 8,T= 8) : 1.801 MB/s [ 439.7 IOPS]
Random Write 4KiB (Q= 8,T= 8) : 4.457 MB/s [ 1088.1 IOPS]
Random Read 4KiB (Q= 32,T= 1) : 1.938 MB/s [ 473.1 IOPS]
Random Write 4KiB (Q= 32,T= 1) : 4.880 MB/s [ 1191.4 IOPS]
Random Read 4KiB (Q= 1,T= 1) : 0.217 MB/s [ 53.0 IOPS]
Random Write 4KiB (Q= 1,T= 1) : 3.806 MB/s [ 929.2 IOPS]

Test : 1024 MiB [C: 63.1% (63.1/99.9 GiB)] (x5) [Interval=5 sec]
Date : 2018/04/06 14:32:34
OS : Windows 7 Professional SP1 [6.1 Build 7601] (x86)

random write speed is 4x faster (thanks to 'write-back cache' on RAID controller).

so i hope see some similar results on petasan cluster after adding SSD's as journal disks

Quote from admin on April 5, 2018, 3:25 pm

Else it may help to know that having multiple copy commands will give higher total speed (approx 10 times as per the 4k tests)

I'm not sure i understand that does it mean - 'having multiple copy commands'. How does achieve this in production?

, adding ssd journal will at least double the write speed

if the speed is doubled - it will be lower than 1Mb/sec

how you can predict ssd work at maks - triple, 5x, 10x

there is some uncertainty about SSD - is it really can significantly increase write speed for at least 10Mb/sec ?

p/s

i'am used Seagate Barracuda Compute ST2000DM006 for OSD

Crystal Disk Info :

CrystalDiskMark 6.0.0 x64 (C) 2007-2017 hiyohiyo
Crystal Dew World : https://crystalmark.info/
-----------------------------------------------------------------------
* MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]
* KB = 1000 bytes, KiB = 1024 bytes

Sequential Read (Q= 32,T= 1) : 220.823 MB/s
Sequential Write (Q= 32,T= 1) : 215.783 MB/s
Random Read 4KiB (Q= 8,T= 8) : 1.567 MB/s [ 382.6 IOPS]
Random Write 4KiB (Q= 8,T= 8) : 1.440 MB/s [ 351.6 IOPS]
Random Read 4KiB (Q= 32,T= 1) : 1.591 MB/s [ 388.4 IOPS]
Random Write 4KiB (Q= 32,T= 1) : 1.336 MB/s [ 326.2 IOPS]
Random Read 4KiB (Q= 1,T= 1) : 0.639 MB/s [ 156.0 IOPS]
Random Write 4KiB (Q= 1,T= 1) : 1.441 MB/s [ 351.8 IOPS]

Test : 1024 MiB [D: 0.0% (0.1/1863.0 GiB)] (x5) [Interval=5 sec]
Date : 2018/04/06 11:48:16
OS : Windows 7 Professional SP1 [6.1 Build 7601] (x64)

So basically my previous results of write speed about 400Kb/sec is in very good conjunction with test results of Random Write 1400Kb/sec (the divider is quantity of OSD per node). Is my assumption right?

ps2

and now i tested on VM local disk (host with 4x1Tb Seagate SATA on LSI RS2BL080 512mb with battery, RAID 10)

-----------------------------------------------------------------------
CrystalDiskMark 6.0.0 (C) 2007-2017 hiyohiyo
Crystal Dew World : https://crystalmark.info/
-----------------------------------------------------------------------
* MB/s = 1,000,000 bytes/s [SATA/600 = 600,000,000 bytes/s]
* KB = 1000 bytes, KiB = 1024 bytes

Sequential Read (Q= 32,T= 1) : 245.564 MB/s
Sequential Write (Q= 32,T= 1) : 171.507 MB/s
Random Read 4KiB (Q= 8,T= 8) : 1.801 MB/s [ 439.7 IOPS]
Random Write 4KiB (Q= 8,T= 8) : 4.457 MB/s [ 1088.1 IOPS]
Random Read 4KiB (Q= 32,T= 1) : 1.938 MB/s [ 473.1 IOPS]
Random Write 4KiB (Q= 32,T= 1) : 4.880 MB/s [ 1191.4 IOPS]
Random Read 4KiB (Q= 1,T= 1) : 0.217 MB/s [ 53.0 IOPS]
Random Write 4KiB (Q= 1,T= 1) : 3.806 MB/s [ 929.2 IOPS]

Test : 1024 MiB [C: 63.1% (63.1/99.9 GiB)] (x5) [Interval=5 sec]
Date : 2018/04/06 14:32:34
OS : Windows 7 Professional SP1 [6.1 Build 7601] (x86)

random write speed is 4x faster (thanks to 'write-back cache' on RAID controller).

so i hope see some similar results on petasan cluster after adding SSD's as journal disks

admin
2,967 Posts

April 6, 2018, 11:17 am

Hi,

I am not familiar with the hardware you are using, the 400 ms apply latency is very high for simple copy command. i do not know if this indicates a hardware issue or not. I had suspected some slow disk, but this does not appear the case. Note your RAM is too low (we recommend 16 G for iSCSI, 2 G for mon + 2 G per OSD ), can you check your RAM and also cpu load. If have any power save settings in BIOS, disable it.

Can you also check network latency between nodes via

measure latency
ping IP_ADDRESS

time to send 100K packets
ping -c 100000 -f IP_ADDRESS

Yes the write back cache will increase 4-5x speed this is typical, the ssd journal will increase by 2x. Also PetaSAN 1.5 Filestore may be less stressful on your system.

I'm not sure i understand that does it mean - 'having multiple copy commands'. How does achieve this in production?

If you copy/write several different files at the same time, your total combined speed will increase 10x, This is from the 4k benchmark results of your current setup, the more disks you add the more combined write speed you will get. Production loads typically will have more than 1 operation running at the same time.

sds80
14 Posts

April 9, 2018, 8:25 am

Quote from sds80 on April 9, 2018, 8:25 am
ping < 1ms

root@peta1:~# ping -c 100000 -f peta2
PING peta2 (192.168.120.231) 56(84) bytes of data.

--- peta2 ping statistics ---
100000 packets transmitted, 100000 received, 0% packet loss, time 16003ms
rtt min/avg/max/mdev = 0.031/0.140/0.440/0.036 ms, ipg/ewma 0.160/0.193 ms
root@peta1:~# ping -c 100000 -f peta3
PING peta3 (192.168.120.232) 56(84) bytes of data.

--- peta3 ping statistics ---
100000 packets transmitted, 100000 received, 0% packet loss, time 16563ms
rtt min/avg/max/mdev = 0.031/0.146/0.983/0.032 ms, ipg/ewma 0.165/0.131 ms
root@peta1:~# ping -c 100000 -f 10.10.2.2
PING 10.10.2.2 (10.10.2.2) 56(84) bytes of data.

--- 10.10.2.2 ping statistics ---
100000 packets transmitted, 100000 received, 0% packet loss, time 17046ms
rtt min/avg/max/mdev = 0.033/0.151/1.121/0.026 ms, ipg/ewma 0.170/0.181 ms
root@peta1:~# ping -c 100000 -f 10.10.2.3
PING 10.10.2.3 (10.10.2.3) 56(84) bytes of data.

--- 10.10.2.3 ping statistics ---
100000 packets transmitted, 100000 received, 0% packet loss, time 17124ms
rtt min/avg/max/mdev = 0.038/0.152/1.268/0.026 ms, ipg/ewma 0.171/0.131 ms

Now copying 600MB file from client to iSCSI disk:

CPU load < 10%

RAM load ~ 50%

Peta1:

Tasks: 190 total, 1 running, 189 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.2 us, 0.2 sy, 0.0 ni, 98.4 id, 1.2 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 8174052 total, 3228312 free, 1497108 used, 3448632 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 6315952 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1122 root 20 0 71100 37680 24084 S 1.0 0.5 39:04.22 consul
1478 ceph 20 0 1263660 239760 25316 S 0.7 2.9 10:49.24 ceph-osd
180 root 0 -20 0 0 0 S 0.3 0.0 0:03.19 kworker/2:1H
1296 root 20 0 2404292 43604 8856 S 0.3 0.5 10:45.50 ceph-mon
1388 root 20 0 514996 43868 16004 S 0.3 0.5 31:12.54 python
1685 ceph 20 0 1272128 243268 26880 S 0.3 3.0 11:08.88 ceph-osd
1875 ceph 20 0 1300100 213140 25068 S 0.3 2.6 10:50.16 ceph-osd

peta2:

Tasks: 185 total, 1 running, 184 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.8 us, 0.5 sy, 0.0 ni, 93.2 id, 4.5 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 8174052 total, 3574384 free, 1135816 used, 3463852 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 6671324 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1361 root 20 0 902232 42676 6444 S 4.3 0.5 27:09.39 glusterf+
1393 root 20 0 2040748 59176 20428 S 1.7 0.7 44:24.17 python
1115 root 20 0 72156 40640 23828 S 1.3 0.5 61:45.75 consul
1283 root 20 0 1134300 46296 19432 S 0.7 0.6 10:35.14 iscsi_se+
1293 root 20 0 2671572 48732 17432 S 0.7 0.6 7:58.37 ceph-mon
167 root 0 -20 0 0 0 S 0.3 0.0 0:07.90 kworker/+
881 root 20 0 65784 26820 3760 S 0.3 0.3 14:46.05 collectl
1604 ceph 20 0 1192932 224080 25036 S 0.3 2.7 11:45.62 ceph-osd

peta3:

Tasks: 170 total, 1 running, 169 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.3 us, 0.2 sy, 0.0 ni, 95.6 id, 3.9 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 8174052 total, 5005044 free, 843396 used, 2325612 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 7016088 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
942 root 20 0 69528 37256 23512 S 1.0 0.5 29:05.36 consul
1493 root 20 0 902232 47356 6488 S 0.7 0.6 21:01.17 glusterfsd
420 root 20 0 0 0 0 S 0.3 0.0 0:22.38 jbd2/sda3-8
1271 root 20 0 509472 42480 16104 S 0.3 0.5 8:42.85 iscsi_service.p
1288 root 20 0 1608372 60460 14584 S 0.3 0.7 4:05.56 ceph-mon
1381 ceph 20 0 1196788 205648 24348 S 0.3 2.5 7:15.08 ceph-osd
1477 root 20 0 514996 42252 16012 S 0.3 0.5 22:03.03 python
1 root 20 0 37820 6008 4088 S 0.0 0.1 0:04.25 systemd

There is no sugnificant CPU or RAM load. I don't think they have some relation to low write speed.

ping < 1ms

root@peta1:~# ping -c 100000 -f peta2
PING peta2 (192.168.120.231) 56(84) bytes of data.

--- peta2 ping statistics ---
100000 packets transmitted, 100000 received, 0% packet loss, time 16003ms
rtt min/avg/max/mdev = 0.031/0.140/0.440/0.036 ms, ipg/ewma 0.160/0.193 ms
root@peta1:~# ping -c 100000 -f peta3
PING peta3 (192.168.120.232) 56(84) bytes of data.

--- peta3 ping statistics ---
100000 packets transmitted, 100000 received, 0% packet loss, time 16563ms
rtt min/avg/max/mdev = 0.031/0.146/0.983/0.032 ms, ipg/ewma 0.165/0.131 ms
root@peta1:~# ping -c 100000 -f 10.10.2.2
PING 10.10.2.2 (10.10.2.2) 56(84) bytes of data.

--- 10.10.2.2 ping statistics ---
100000 packets transmitted, 100000 received, 0% packet loss, time 17046ms
rtt min/avg/max/mdev = 0.033/0.151/1.121/0.026 ms, ipg/ewma 0.170/0.181 ms
root@peta1:~# ping -c 100000 -f 10.10.2.3
PING 10.10.2.3 (10.10.2.3) 56(84) bytes of data.

--- 10.10.2.3 ping statistics ---
100000 packets transmitted, 100000 received, 0% packet loss, time 17124ms
rtt min/avg/max/mdev = 0.038/0.152/1.268/0.026 ms, ipg/ewma 0.171/0.131 ms

Now copying 600MB file from client to iSCSI disk:

CPU load < 10%

RAM load ~ 50%

Peta1:

Tasks: 190 total, 1 running, 189 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.2 us, 0.2 sy, 0.0 ni, 98.4 id, 1.2 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 8174052 total, 3228312 free, 1497108 used, 3448632 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 6315952 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1122 root 20 0 71100 37680 24084 S 1.0 0.5 39:04.22 consul
1478 ceph 20 0 1263660 239760 25316 S 0.7 2.9 10:49.24 ceph-osd
180 root 0 -20 0 0 0 S 0.3 0.0 0:03.19 kworker/2:1H
1296 root 20 0 2404292 43604 8856 S 0.3 0.5 10:45.50 ceph-mon
1388 root 20 0 514996 43868 16004 S 0.3 0.5 31:12.54 python
1685 ceph 20 0 1272128 243268 26880 S 0.3 3.0 11:08.88 ceph-osd
1875 ceph 20 0 1300100 213140 25068 S 0.3 2.6 10:50.16 ceph-osd

peta2:

Tasks: 185 total, 1 running, 184 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.8 us, 0.5 sy, 0.0 ni, 93.2 id, 4.5 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 8174052 total, 3574384 free, 1135816 used, 3463852 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 6671324 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1361 root 20 0 902232 42676 6444 S 4.3 0.5 27:09.39 glusterf+
1393 root 20 0 2040748 59176 20428 S 1.7 0.7 44:24.17 python
1115 root 20 0 72156 40640 23828 S 1.3 0.5 61:45.75 consul
1283 root 20 0 1134300 46296 19432 S 0.7 0.6 10:35.14 iscsi_se+
1293 root 20 0 2671572 48732 17432 S 0.7 0.6 7:58.37 ceph-mon
167 root 0 -20 0 0 0 S 0.3 0.0 0:07.90 kworker/+
881 root 20 0 65784 26820 3760 S 0.3 0.3 14:46.05 collectl
1604 ceph 20 0 1192932 224080 25036 S 0.3 2.7 11:45.62 ceph-osd

peta3:

Tasks: 170 total, 1 running, 169 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.3 us, 0.2 sy, 0.0 ni, 95.6 id, 3.9 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 8174052 total, 5005044 free, 843396 used, 2325612 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 7016088 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
942 root 20 0 69528 37256 23512 S 1.0 0.5 29:05.36 consul
1493 root 20 0 902232 47356 6488 S 0.7 0.6 21:01.17 glusterfsd
420 root 20 0 0 0 0 S 0.3 0.0 0:22.38 jbd2/sda3-8
1271 root 20 0 509472 42480 16104 S 0.3 0.5 8:42.85 iscsi_service.p
1288 root 20 0 1608372 60460 14584 S 0.3 0.7 4:05.56 ceph-mon
1381 ceph 20 0 1196788 205648 24348 S 0.3 2.5 7:15.08 ceph-osd
1477 root 20 0 514996 42252 16012 S 0.3 0.5 22:03.03 python
1 root 20 0 37820 6008 4088 S 0.0 0.1 0:04.25 systemd

There is no sugnificant CPU or RAM load. I don't think they have some relation to low write speed.

sds80
14 Posts

April 11, 2018, 5:24 am

Ok, SSD's is on the way.

Meanwhile i want to check some paremeters in my cluster.conf file.

For example in my current config:

# Generic Entry Level Hardware, use defaults
osd_op_threads=2
filestore_op_threads=2
filestore_queue_max_bytes=104857600
filestore_queue_max_ops=50
filestore_max_sync_interval=5
journal_max_write_entries=100
journal_max_write_bytes=10485760
ms_dispatch_throttle_bytes=104857600
objecter_infilght_op_bytes=104857600
objecter_inflight_ops=1024

if i used SSD for journal what changes in this parameters needed?

for example:

filestore min sync interval 
filestore max sync interval

admin
2,967 Posts

April 11, 2018, 8:15 am

I would not recommend you changing them.

They can be changed from the cluster tuning page during deployment via the show details icon. or you can manually change them node by node. Again not recommended but for few people to change theses.

sds80
14 Posts

April 16, 2018, 9:48 am

Something weird.

SSD's is taken zero effect.

Large 7G file copy - 360kB/sec average

6000 small files copy (2G summary) - 213kB/sec