Performance Problems: nvme as bootleneck
alienn
37 Posts
June 2, 2019, 6:36 pmQuote from alienn on June 2, 2019, 6:36 pmHi,
I have a three node cluster. Each node has:
- 1x Intel Xeon Silver 4110
- 128GB of ram
- 1x Intel 900P 280GB as nvme journal
- 8x 10TB Nearline SAS drives as OSDs
The nvme has 20GB journaling partitions.
When I run the benchmark test I get something aroung 4-5k read/write and about 1.5GB/s throughput. The osds are quite bored while the journaling drive is always 100% and thus the bottleneck.
And this is something I don't get. The 900p should be capable of reaching something aroung 500k iops read/write and about 2.5GB/s throughput. Why can it be the bottleneck in the iops test?
This behaviour can be seen on all three nodes.
I wonder if there are some tuning parameters that can be set to achieve better performance iops wise...
Hi,
I have a three node cluster. Each node has:
- 1x Intel Xeon Silver 4110
- 128GB of ram
- 1x Intel 900P 280GB as nvme journal
- 8x 10TB Nearline SAS drives as OSDs
The nvme has 20GB journaling partitions.
When I run the benchmark test I get something aroung 4-5k read/write and about 1.5GB/s throughput. The osds are quite bored while the journaling drive is always 100% and thus the bottleneck.
And this is something I don't get. The 900p should be capable of reaching something aroung 500k iops read/write and about 2.5GB/s throughput. Why can it be the bottleneck in the iops test?
This behaviour can be seen on all three nodes.
I wonder if there are some tuning parameters that can be set to achieve better performance iops wise...
Last edited on June 7, 2019, 12:17 pm by alienn · #1
admin
2,930 Posts
June 2, 2019, 8:48 pmQuote from admin on June 2, 2019, 8:48 pmGenerally you have 24 spinning disks, you are getting 4-5k random client iops ie 200 client iops per disk. Note the journal is not a cache, it speeds up rocksdb metadata loockups (to know the address of the objects) but still a read will be read from the spinning disk as regular operation, for writes it does save in the wal journal first then flushs to the spinning disk which gives the io scheduler some chance to optimize, note that for every client write iop there are several iops for your different replicas + db writes, so without a journal you will not be able to get 4-5k write iops.
1) Are you using replicated or EC pools and if so what size ?
2) Can you run the iops benchmark using 2 client nodes at 64 threads for 5 min. What %cpu and %disk do you get ?
3) The 5 min test will allow a couple of sample on the dashboard charts (sample per minute), do the cpu% and disk% agree with the benchmark page ? what are the individual raw disk iops you see on the charts ?
4) We use iostat/sar for stats measurements, it is possible the Intel Optane 100% utilization is a wrong iostat reading, can you measure using the following tools (installed with PetaSAN) while the benchmark is running :
atop
collectl -sD
sar -d -p 2 5
5) can you run the iops test for 1 min using 1 thread only to determine your latency ?
6) In an earlier post you mention you use Areca, do you have write-back cache (with BBU) enabled ?
7) if you have spare un-used disks (Optane+SAS) you can run the PetaSAN raw disk test (destroys data) from the blue console menu, it will help to know what system is capable of see-ing.
8) The 20 GB journal size is very low with bluestore, the default size in PetaSAN is 60 GB. How much are the OSDs filled (approx TB) ?
Generally you have 24 spinning disks, you are getting 4-5k random client iops ie 200 client iops per disk. Note the journal is not a cache, it speeds up rocksdb metadata loockups (to know the address of the objects) but still a read will be read from the spinning disk as regular operation, for writes it does save in the wal journal first then flushs to the spinning disk which gives the io scheduler some chance to optimize, note that for every client write iop there are several iops for your different replicas + db writes, so without a journal you will not be able to get 4-5k write iops.
1) Are you using replicated or EC pools and if so what size ?
2) Can you run the iops benchmark using 2 client nodes at 64 threads for 5 min. What %cpu and %disk do you get ?
3) The 5 min test will allow a couple of sample on the dashboard charts (sample per minute), do the cpu% and disk% agree with the benchmark page ? what are the individual raw disk iops you see on the charts ?
4) We use iostat/sar for stats measurements, it is possible the Intel Optane 100% utilization is a wrong iostat reading, can you measure using the following tools (installed with PetaSAN) while the benchmark is running :
atop
collectl -sD
sar -d -p 2 5
5) can you run the iops test for 1 min using 1 thread only to determine your latency ?
6) In an earlier post you mention you use Areca, do you have write-back cache (with BBU) enabled ?
7) if you have spare un-used disks (Optane+SAS) you can run the PetaSAN raw disk test (destroys data) from the blue console menu, it will help to know what system is capable of see-ing.
8) The 20 GB journal size is very low with bluestore, the default size in PetaSAN is 60 GB. How much are the OSDs filled (approx TB) ?
Last edited on June 2, 2019, 11:12 pm by admin · #2
alienn
37 Posts
June 3, 2019, 3:04 pmQuote from alienn on June 3, 2019, 3:04 pmHi,
here is the first part of my reply. And thanks for the fast response. 🙂
- I'm using a replicated pool with three replicas. The size is about 100TB
- I'll send these infos later on
- I'll send these infos later on
- See here
- I'm using areca with passthrough disks, bbu and write back cache
- I'm sorry. There are no unused disks available
- The cluster is quite new and the osd are still quite empty (about 6% usage).
What really irks me is the utilization of the osds and the journal. I'll try to collect the other information as soon as possible. Here are some more screenshots of the dashboard: Link
Hi,
here is the first part of my reply. And thanks for the fast response. 🙂
- I'm using a replicated pool with three replicas. The size is about 100TB
- I'll send these infos later on
- I'll send these infos later on
- See here
- I'm using areca with passthrough disks, bbu and write back cache
- I'm sorry. There are no unused disks available
- The cluster is quite new and the osd are still quite empty (about 6% usage).
What really irks me is the utilization of the osds and the journal. I'll try to collect the other information as soon as possible. Here are some more screenshots of the dashboard: Link
Last edited on June 3, 2019, 3:15 pm by alienn · #3
admin
2,930 Posts
June 3, 2019, 4:16 pmQuote from admin on June 3, 2019, 4:16 pmThe positive thing is the write latency of 0.7 ms ( 1384 iops for single thread ) is very good. 3 ms read latency for spinning disk is also good. So single thread is quite good, something is not scalling well.
I suspect the reading of 100% busy on the Intel Optane is incorrect in iostat, it is not showing in atop, please do run the collectl and iostat disk stats command in my prev post manually, maybe the journal is not the problem.
I am suspicious of the cpu % busy shown in atop with ceph-osd processes at 100%. Can you re-run the 5 min test with 64 threads on re-check atop readings. Also run the collectl and iostat/sar cpu stats:
collectl -sC
collectl -sc
sar 2 5
sar -P ALL 2 5
And also show the cpu% util from the dashboard charts.
The positive thing is the write latency of 0.7 ms ( 1384 iops for single thread ) is very good. 3 ms read latency for spinning disk is also good. So single thread is quite good, something is not scalling well.
I suspect the reading of 100% busy on the Intel Optane is incorrect in iostat, it is not showing in atop, please do run the collectl and iostat disk stats command in my prev post manually, maybe the journal is not the problem.
I am suspicious of the cpu % busy shown in atop with ceph-osd processes at 100%. Can you re-run the 5 min test with 64 threads on re-check atop readings. Also run the collectl and iostat/sar cpu stats:
collectl -sC
collectl -sc
sar 2 5
sar -P ALL 2 5
And also show the cpu% util from the dashboard charts.
Last edited on June 3, 2019, 4:38 pm by admin · #4
admin
2,930 Posts
June 5, 2019, 9:51 pmQuote from admin on June 5, 2019, 9:51 pmi think you are hitting this issue of incorrect %utilization on some nvmes on recent kernels so the nvme is not the bottleneck.
https://access.redhat.com/solutions/3901291
Note that 24 spinning disk should give a theoretical 3-4 k raw device iops: for a read op you further need a database read lookup op for object location, for writes you further multiply this by 3 for your replicas, so you would expect much less net client iops for a pure spinning disk solution. Getting 7K writes and 11K reads using journal + Areca is not that bad, but i would expect a bit more boost from the Areca. I do see similar setups giving 15K write 25-30K read but they usually have a higher number of spinning disks per host: 16 or 20. Typically you would have no more than 10 SSDs per host but for magnetic it is usually higher.
Getting the info requested will give us better picture of whether there is something we can optimize, please make sure you run the test using 2 clients with 64 threads each for 5 min as mentioned. One additional info that can help is to monitor the client 2 nodes with atop: these nodes act as a dual role as both a PetaSAN server as well as client simulator, sometimes this additional client role saturates those nodes and hence does not give full cluster performance, the more client nodes the better but in your case the max to use is only 2, we only report resource load on machines not running the client role, so it is possible in some cases to observe less that actual performance and loads that are not saturated.
i think you are hitting this issue of incorrect %utilization on some nvmes on recent kernels so the nvme is not the bottleneck.
https://access.redhat.com/solutions/3901291
Note that 24 spinning disk should give a theoretical 3-4 k raw device iops: for a read op you further need a database read lookup op for object location, for writes you further multiply this by 3 for your replicas, so you would expect much less net client iops for a pure spinning disk solution. Getting 7K writes and 11K reads using journal + Areca is not that bad, but i would expect a bit more boost from the Areca. I do see similar setups giving 15K write 25-30K read but they usually have a higher number of spinning disks per host: 16 or 20. Typically you would have no more than 10 SSDs per host but for magnetic it is usually higher.
Getting the info requested will give us better picture of whether there is something we can optimize, please make sure you run the test using 2 clients with 64 threads each for 5 min as mentioned. One additional info that can help is to monitor the client 2 nodes with atop: these nodes act as a dual role as both a PetaSAN server as well as client simulator, sometimes this additional client role saturates those nodes and hence does not give full cluster performance, the more client nodes the better but in your case the max to use is only 2, we only report resource load on machines not running the client role, so it is possible in some cases to observe less that actual performance and loads that are not saturated.
Last edited on June 5, 2019, 9:59 pm by admin · #5
alienn
37 Posts
June 7, 2019, 12:17 pmQuote from alienn on June 7, 2019, 12:17 pmThanks for all the input. I'll provide the missing data after my vacation. I'll be back in two weeks. Thanks for your patience.
Cheers,
Nicki
Thanks for all the input. I'll provide the missing data after my vacation. I'll be back in two weeks. Thanks for your patience.
Cheers,
Nicki
Performance Problems: nvme as bootleneck
alienn
37 Posts
Quote from alienn on June 2, 2019, 6:36 pmHi,
I have a three node cluster. Each node has:
- 1x Intel Xeon Silver 4110
- 128GB of ram
- 1x Intel 900P 280GB as nvme journal
- 8x 10TB Nearline SAS drives as OSDs
The nvme has 20GB journaling partitions.
When I run the benchmark test I get something aroung 4-5k read/write and about 1.5GB/s throughput. The osds are quite bored while the journaling drive is always 100% and thus the bottleneck.
And this is something I don't get. The 900p should be capable of reaching something aroung 500k iops read/write and about 2.5GB/s throughput. Why can it be the bottleneck in the iops test?
This behaviour can be seen on all three nodes.
I wonder if there are some tuning parameters that can be set to achieve better performance iops wise...
Hi,
I have a three node cluster. Each node has:
- 1x Intel Xeon Silver 4110
- 128GB of ram
- 1x Intel 900P 280GB as nvme journal
- 8x 10TB Nearline SAS drives as OSDs
The nvme has 20GB journaling partitions.
When I run the benchmark test I get something aroung 4-5k read/write and about 1.5GB/s throughput. The osds are quite bored while the journaling drive is always 100% and thus the bottleneck.
And this is something I don't get. The 900p should be capable of reaching something aroung 500k iops read/write and about 2.5GB/s throughput. Why can it be the bottleneck in the iops test?
This behaviour can be seen on all three nodes.
I wonder if there are some tuning parameters that can be set to achieve better performance iops wise...
admin
2,930 Posts
Quote from admin on June 2, 2019, 8:48 pmGenerally you have 24 spinning disks, you are getting 4-5k random client iops ie 200 client iops per disk. Note the journal is not a cache, it speeds up rocksdb metadata loockups (to know the address of the objects) but still a read will be read from the spinning disk as regular operation, for writes it does save in the wal journal first then flushs to the spinning disk which gives the io scheduler some chance to optimize, note that for every client write iop there are several iops for your different replicas + db writes, so without a journal you will not be able to get 4-5k write iops.
1) Are you using replicated or EC pools and if so what size ?
2) Can you run the iops benchmark using 2 client nodes at 64 threads for 5 min. What %cpu and %disk do you get ?
3) The 5 min test will allow a couple of sample on the dashboard charts (sample per minute), do the cpu% and disk% agree with the benchmark page ? what are the individual raw disk iops you see on the charts ?
4) We use iostat/sar for stats measurements, it is possible the Intel Optane 100% utilization is a wrong iostat reading, can you measure using the following tools (installed with PetaSAN) while the benchmark is running :
atop
collectl -sD
sar -d -p 2 5
5) can you run the iops test for 1 min using 1 thread only to determine your latency ?
6) In an earlier post you mention you use Areca, do you have write-back cache (with BBU) enabled ?
7) if you have spare un-used disks (Optane+SAS) you can run the PetaSAN raw disk test (destroys data) from the blue console menu, it will help to know what system is capable of see-ing.8) The 20 GB journal size is very low with bluestore, the default size in PetaSAN is 60 GB. How much are the OSDs filled (approx TB) ?
Generally you have 24 spinning disks, you are getting 4-5k random client iops ie 200 client iops per disk. Note the journal is not a cache, it speeds up rocksdb metadata loockups (to know the address of the objects) but still a read will be read from the spinning disk as regular operation, for writes it does save in the wal journal first then flushs to the spinning disk which gives the io scheduler some chance to optimize, note that for every client write iop there are several iops for your different replicas + db writes, so without a journal you will not be able to get 4-5k write iops.
1) Are you using replicated or EC pools and if so what size ?
2) Can you run the iops benchmark using 2 client nodes at 64 threads for 5 min. What %cpu and %disk do you get ?
3) The 5 min test will allow a couple of sample on the dashboard charts (sample per minute), do the cpu% and disk% agree with the benchmark page ? what are the individual raw disk iops you see on the charts ?
4) We use iostat/sar for stats measurements, it is possible the Intel Optane 100% utilization is a wrong iostat reading, can you measure using the following tools (installed with PetaSAN) while the benchmark is running :
atop
collectl -sD
sar -d -p 2 5
5) can you run the iops test for 1 min using 1 thread only to determine your latency ?
6) In an earlier post you mention you use Areca, do you have write-back cache (with BBU) enabled ?
7) if you have spare un-used disks (Optane+SAS) you can run the PetaSAN raw disk test (destroys data) from the blue console menu, it will help to know what system is capable of see-ing.
8) The 20 GB journal size is very low with bluestore, the default size in PetaSAN is 60 GB. How much are the OSDs filled (approx TB) ?
alienn
37 Posts
Quote from alienn on June 3, 2019, 3:04 pmHi,
here is the first part of my reply. And thanks for the fast response. 🙂
- I'm using a replicated pool with three replicas. The size is about 100TB
- I'll send these infos later on
- I'll send these infos later on
- See here
- I'm using areca with passthrough disks, bbu and write back cache
- I'm sorry. There are no unused disks available
- The cluster is quite new and the osd are still quite empty (about 6% usage).
What really irks me is the utilization of the osds and the journal. I'll try to collect the other information as soon as possible. Here are some more screenshots of the dashboard: Link
Hi,
here is the first part of my reply. And thanks for the fast response. 🙂
- I'm using a replicated pool with three replicas. The size is about 100TB
- I'll send these infos later on
- I'll send these infos later on
- See here
- I'm using areca with passthrough disks, bbu and write back cache
- I'm sorry. There are no unused disks available
- The cluster is quite new and the osd are still quite empty (about 6% usage).
What really irks me is the utilization of the osds and the journal. I'll try to collect the other information as soon as possible. Here are some more screenshots of the dashboard: Link
admin
2,930 Posts
Quote from admin on June 3, 2019, 4:16 pmThe positive thing is the write latency of 0.7 ms ( 1384 iops for single thread ) is very good. 3 ms read latency for spinning disk is also good. So single thread is quite good, something is not scalling well.
I suspect the reading of 100% busy on the Intel Optane is incorrect in iostat, it is not showing in atop, please do run the collectl and iostat disk stats command in my prev post manually, maybe the journal is not the problem.
I am suspicious of the cpu % busy shown in atop with ceph-osd processes at 100%. Can you re-run the 5 min test with 64 threads on re-check atop readings. Also run the collectl and iostat/sar cpu stats:
collectl -sC
collectl -scsar 2 5
sar -P ALL 2 5And also show the cpu% util from the dashboard charts.
The positive thing is the write latency of 0.7 ms ( 1384 iops for single thread ) is very good. 3 ms read latency for spinning disk is also good. So single thread is quite good, something is not scalling well.
I suspect the reading of 100% busy on the Intel Optane is incorrect in iostat, it is not showing in atop, please do run the collectl and iostat disk stats command in my prev post manually, maybe the journal is not the problem.
I am suspicious of the cpu % busy shown in atop with ceph-osd processes at 100%. Can you re-run the 5 min test with 64 threads on re-check atop readings. Also run the collectl and iostat/sar cpu stats:
collectl -sC
collectl -sc
sar 2 5
sar -P ALL 2 5
And also show the cpu% util from the dashboard charts.
admin
2,930 Posts
Quote from admin on June 5, 2019, 9:51 pmi think you are hitting this issue of incorrect %utilization on some nvmes on recent kernels so the nvme is not the bottleneck.
https://access.redhat.com/solutions/3901291
Note that 24 spinning disk should give a theoretical 3-4 k raw device iops: for a read op you further need a database read lookup op for object location, for writes you further multiply this by 3 for your replicas, so you would expect much less net client iops for a pure spinning disk solution. Getting 7K writes and 11K reads using journal + Areca is not that bad, but i would expect a bit more boost from the Areca. I do see similar setups giving 15K write 25-30K read but they usually have a higher number of spinning disks per host: 16 or 20. Typically you would have no more than 10 SSDs per host but for magnetic it is usually higher.
Getting the info requested will give us better picture of whether there is something we can optimize, please make sure you run the test using 2 clients with 64 threads each for 5 min as mentioned. One additional info that can help is to monitor the client 2 nodes with atop: these nodes act as a dual role as both a PetaSAN server as well as client simulator, sometimes this additional client role saturates those nodes and hence does not give full cluster performance, the more client nodes the better but in your case the max to use is only 2, we only report resource load on machines not running the client role, so it is possible in some cases to observe less that actual performance and loads that are not saturated.
i think you are hitting this issue of incorrect %utilization on some nvmes on recent kernels so the nvme is not the bottleneck.
https://access.redhat.com/solutions/3901291
Note that 24 spinning disk should give a theoretical 3-4 k raw device iops: for a read op you further need a database read lookup op for object location, for writes you further multiply this by 3 for your replicas, so you would expect much less net client iops for a pure spinning disk solution. Getting 7K writes and 11K reads using journal + Areca is not that bad, but i would expect a bit more boost from the Areca. I do see similar setups giving 15K write 25-30K read but they usually have a higher number of spinning disks per host: 16 or 20. Typically you would have no more than 10 SSDs per host but for magnetic it is usually higher.
Getting the info requested will give us better picture of whether there is something we can optimize, please make sure you run the test using 2 clients with 64 threads each for 5 min as mentioned. One additional info that can help is to monitor the client 2 nodes with atop: these nodes act as a dual role as both a PetaSAN server as well as client simulator, sometimes this additional client role saturates those nodes and hence does not give full cluster performance, the more client nodes the better but in your case the max to use is only 2, we only report resource load on machines not running the client role, so it is possible in some cases to observe less that actual performance and loads that are not saturated.
alienn
37 Posts
Quote from alienn on June 7, 2019, 12:17 pmThanks for all the input. I'll provide the missing data after my vacation. I'll be back in two weeks. Thanks for your patience.
Cheers,
Nicki
Thanks for all the input. I'll provide the missing data after my vacation. I'll be back in two weeks. Thanks for your patience.
Cheers,
Nicki