Cache disks and failure behaviour
jnickel
12 Posts
December 19, 2019, 9:01 pmQuote from jnickel on December 19, 2019, 9:01 pmHi,
My understanding is that if you have SSDs setup as Journal devices and they fail, all OSDs that are using those SSDs for Journaling are also failed - so they would all go offline.
What is the behaviour for OSDs that are using a SSD as a write cache if the SSD fails? Do they continue to work, just slower? Do they go offline like the Journal behaviour above?
Thanks,
Jim
Hi,
My understanding is that if you have SSDs setup as Journal devices and they fail, all OSDs that are using those SSDs for Journaling are also failed - so they would all go offline.
What is the behaviour for OSDs that are using a SSD as a write cache if the SSD fails? Do they continue to work, just slower? Do they go offline like the Journal behaviour above?
Thanks,
Jim
admin
2,930 Posts
December 19, 2019, 11:09 pmQuote from admin on December 19, 2019, 11:09 pmYes if the cache disk fails, all OSDs connected to it will fail, this is because it is a writeback cache rather than a read cache. You can define 1 to 8 partitions to be created on the cache device, each serving an OSD, so if you want to be conservative you can create a 1 partition on a small capacity SSD serving a large HDD OSD. Note however that Ceph is built with errors in mind, so having multiple OSD failures when a journal/wal-db fails is not a panic situation (that is within the same node or failure domain), this also applies to write cache devices.
Yes if the cache disk fails, all OSDs connected to it will fail, this is because it is a writeback cache rather than a read cache. You can define 1 to 8 partitions to be created on the cache device, each serving an OSD, so if you want to be conservative you can create a 1 partition on a small capacity SSD serving a large HDD OSD. Note however that Ceph is built with errors in mind, so having multiple OSD failures when a journal/wal-db fails is not a panic situation (that is within the same node or failure domain), this also applies to write cache devices.
Last edited on December 19, 2019, 11:11 pm by admin · #2
jnickel
12 Posts
December 19, 2019, 11:36 pmQuote from jnickel on December 19, 2019, 11:36 pmSo...where are you better off putting the SSDs? As Journals? As Cache devices? Should you have some of each? What will give you the best performance in a hybrid system where you have some spinning disk and SSDs to speed things up?
My understanding is a ratio of about 4:1 for HDD/SSD. Or are you better to go with SSD as a percentage of total HDD size/capacity? Like 20% SSD size based on total capacity of HDD size? For example:
10 TB of spinning disk - should that be around 2 TB of SSD? 1TB SSD? Less? More?
Jim
So...where are you better off putting the SSDs? As Journals? As Cache devices? Should you have some of each? What will give you the best performance in a hybrid system where you have some spinning disk and SSDs to speed things up?
My understanding is a ratio of about 4:1 for HDD/SSD. Or are you better to go with SSD as a percentage of total HDD size/capacity? Like 20% SSD size based on total capacity of HDD size? For example:
10 TB of spinning disk - should that be around 2 TB of SSD? 1TB SSD? Less? More?
Jim
admin
2,930 Posts
December 19, 2019, 11:50 pmQuote from admin on December 19, 2019, 11:50 pmIf you can have both, it will be ideal else if you have to choose, it is difficult to say as it depends on your io pattern, cache is good when you have small random writes with high concurrency.
If you can have both, it will be ideal else if you have to choose, it is difficult to say as it depends on your io pattern, cache is good when you have small random writes with high concurrency.
jnickel
12 Posts
December 19, 2019, 11:53 pmQuote from jnickel on December 19, 2019, 11:53 pmWhat about the question around SSD ratio?
What about the question around SSD ratio?
admin
2,930 Posts
December 20, 2019, 12:10 amQuote from admin on December 20, 2019, 12:10 amif i have to say a number it is 1:4 with SSDs, and 1:8 for nvmes. It really depends on the model of cache device and its relative speed (sync writes) to the hdds, there is a wide speed difference between SSDs models in sync write performance. You can test yourself the sync write speed in PetSAN blue console menu. You should also test a cache setup and look at the %util/busy of the cache device relative to the hdd from the charts. if it is not busy while your hdds are, then it means you can add more to it.
if i have to say a number it is 1:4 with SSDs, and 1:8 for nvmes. It really depends on the model of cache device and its relative speed (sync writes) to the hdds, there is a wide speed difference between SSDs models in sync write performance. You can test yourself the sync write speed in PetSAN blue console menu. You should also test a cache setup and look at the %util/busy of the cache device relative to the hdd from the charts. if it is not busy while your hdds are, then it means you can add more to it.
jnickel
12 Posts
December 20, 2019, 12:48 amQuote from jnickel on December 20, 2019, 12:48 amMy test environment is some old gear I have at home....I have 3 Supermicro chassis with 12 GB of RAM each, 2 x 1 GB ethernet and 1 Mellanox Infinihost 10 GB infiniband card. The HBA is an older LSI card and I have initially put in 2 x 3 TB hard drives and 1 x 4 TB hard drive. These are enterprise class drives that came out of a VNX 5400. Then I have 2 x 200 GB SSD drives - also enterprise class.
I setup the PetaSAN 2.4 with the Mellanox cards for the backend connections and then used the 1 GB ethernet for management and iSCSI.
Connected to this is are 3 x VMware ESXi host. I setup my multipathing with RoundRobin across all 3 PetaSAN hosts and set my ESXi to use a policy of 1 IOP per path before switching.
My previous test on 2.3.1 from a Windows host using either FileStore or BlueStore was that the best I could do (with or without SSD as Journals) was about 50 MBps read/write from the Windows guest.
Now with 2.4 and using the SSDs as cache (no external Journal devices) and using BlueStore is a write speed of 70 MBps and a read speed of 117 MBps (I think I am hitting the max the 1 GB ethernet can do).
Much better! Really nice improvement. I assume it will only get better if I use more disks and add more OSDs, but this is the level of performance I was waiting for so I could really start to use PetaSAN.
NICE JOB!
Jim
My test environment is some old gear I have at home....I have 3 Supermicro chassis with 12 GB of RAM each, 2 x 1 GB ethernet and 1 Mellanox Infinihost 10 GB infiniband card. The HBA is an older LSI card and I have initially put in 2 x 3 TB hard drives and 1 x 4 TB hard drive. These are enterprise class drives that came out of a VNX 5400. Then I have 2 x 200 GB SSD drives - also enterprise class.
I setup the PetaSAN 2.4 with the Mellanox cards for the backend connections and then used the 1 GB ethernet for management and iSCSI.
Connected to this is are 3 x VMware ESXi host. I setup my multipathing with RoundRobin across all 3 PetaSAN hosts and set my ESXi to use a policy of 1 IOP per path before switching.
My previous test on 2.3.1 from a Windows host using either FileStore or BlueStore was that the best I could do (with or without SSD as Journals) was about 50 MBps read/write from the Windows guest.
Now with 2.4 and using the SSDs as cache (no external Journal devices) and using BlueStore is a write speed of 70 MBps and a read speed of 117 MBps (I think I am hitting the max the 1 GB ethernet can do).
Much better! Really nice improvement. I assume it will only get better if I use more disks and add more OSDs, but this is the level of performance I was waiting for so I could really start to use PetaSAN.
NICE JOB!
Jim
admin
2,930 Posts
December 20, 2019, 8:57 amQuote from admin on December 20, 2019, 8:57 amThanks for the nice feedback 🙂
How are you testing the speed : ui benchmark / file copy / how many threads / what block sizes / seq or random ? also is this from VM or from baremetal Windows client ? All these make a difference in what you get...
Thanks for the nice feedback 🙂
How are you testing the speed : ui benchmark / file copy / how many threads / what block sizes / seq or random ? also is this from VM or from baremetal Windows client ? All these make a difference in what you get...
Last edited on December 20, 2019, 8:59 am by admin · #8
jnickel
12 Posts
December 20, 2019, 3:54 pmQuote from jnickel on December 20, 2019, 3:54 pmI am testing from a VM that is running Windows Server 2008 R2. I am using the ATTO benchmark and I gave you the final numbers with the biggest block sizes. I just used the defaults for testing. Overall, it was better than the previous attempt. On my previous attempt with the 2.3.1 PetaSAN, I even had more disks, but didn't have the cache feature.
Transfer size goes from .5 KB to 8192 KB. Total size of the test was 256 MB. DirectIO was checked. Overlapped I/O was checked. Queue Depth was set at 4. I believe it uses sequential only.
I know that this is not a great test as it only tests throughput from a single VM and doesn't test random I/O, but I have found that I can gauge the overall speed I am likely to get from using this. If I can come close to maxing out a single 1gb ethernet, then I know that the speed of the system under load usually gets me what I am looking for. I have ordered 10 GB switches and NICs so I can set up everything in a bigger fashion. This is just for my homelab, but I still want a bullet proof iSCSI for my VMware environment. I have been using Starwind until now with RAID cards, but I like this better.
I am testing from a VM that is running Windows Server 2008 R2. I am using the ATTO benchmark and I gave you the final numbers with the biggest block sizes. I just used the defaults for testing. Overall, it was better than the previous attempt. On my previous attempt with the 2.3.1 PetaSAN, I even had more disks, but didn't have the cache feature.
Transfer size goes from .5 KB to 8192 KB. Total size of the test was 256 MB. DirectIO was checked. Overlapped I/O was checked. Queue Depth was set at 4. I believe it uses sequential only.
I know that this is not a great test as it only tests throughput from a single VM and doesn't test random I/O, but I have found that I can gauge the overall speed I am likely to get from using this. If I can come close to maxing out a single 1gb ethernet, then I know that the speed of the system under load usually gets me what I am looking for. I have ordered 10 GB switches and NICs so I can set up everything in a bigger fashion. This is just for my homelab, but I still want a bullet proof iSCSI for my VMware environment. I have been using Starwind until now with RAID cards, but I like this better.
Cache disks and failure behaviour
jnickel
12 Posts
Quote from jnickel on December 19, 2019, 9:01 pmHi,
My understanding is that if you have SSDs setup as Journal devices and they fail, all OSDs that are using those SSDs for Journaling are also failed - so they would all go offline.
What is the behaviour for OSDs that are using a SSD as a write cache if the SSD fails? Do they continue to work, just slower? Do they go offline like the Journal behaviour above?
Thanks,
Jim
Hi,
My understanding is that if you have SSDs setup as Journal devices and they fail, all OSDs that are using those SSDs for Journaling are also failed - so they would all go offline.
What is the behaviour for OSDs that are using a SSD as a write cache if the SSD fails? Do they continue to work, just slower? Do they go offline like the Journal behaviour above?
Thanks,
Jim
admin
2,930 Posts
Quote from admin on December 19, 2019, 11:09 pmYes if the cache disk fails, all OSDs connected to it will fail, this is because it is a writeback cache rather than a read cache. You can define 1 to 8 partitions to be created on the cache device, each serving an OSD, so if you want to be conservative you can create a 1 partition on a small capacity SSD serving a large HDD OSD. Note however that Ceph is built with errors in mind, so having multiple OSD failures when a journal/wal-db fails is not a panic situation (that is within the same node or failure domain), this also applies to write cache devices.
Yes if the cache disk fails, all OSDs connected to it will fail, this is because it is a writeback cache rather than a read cache. You can define 1 to 8 partitions to be created on the cache device, each serving an OSD, so if you want to be conservative you can create a 1 partition on a small capacity SSD serving a large HDD OSD. Note however that Ceph is built with errors in mind, so having multiple OSD failures when a journal/wal-db fails is not a panic situation (that is within the same node or failure domain), this also applies to write cache devices.
jnickel
12 Posts
Quote from jnickel on December 19, 2019, 11:36 pmSo...where are you better off putting the SSDs? As Journals? As Cache devices? Should you have some of each? What will give you the best performance in a hybrid system where you have some spinning disk and SSDs to speed things up?
My understanding is a ratio of about 4:1 for HDD/SSD. Or are you better to go with SSD as a percentage of total HDD size/capacity? Like 20% SSD size based on total capacity of HDD size? For example:
10 TB of spinning disk - should that be around 2 TB of SSD? 1TB SSD? Less? More?
Jim
So...where are you better off putting the SSDs? As Journals? As Cache devices? Should you have some of each? What will give you the best performance in a hybrid system where you have some spinning disk and SSDs to speed things up?
My understanding is a ratio of about 4:1 for HDD/SSD. Or are you better to go with SSD as a percentage of total HDD size/capacity? Like 20% SSD size based on total capacity of HDD size? For example:
10 TB of spinning disk - should that be around 2 TB of SSD? 1TB SSD? Less? More?
Jim
admin
2,930 Posts
Quote from admin on December 19, 2019, 11:50 pmIf you can have both, it will be ideal else if you have to choose, it is difficult to say as it depends on your io pattern, cache is good when you have small random writes with high concurrency.
If you can have both, it will be ideal else if you have to choose, it is difficult to say as it depends on your io pattern, cache is good when you have small random writes with high concurrency.
jnickel
12 Posts
Quote from jnickel on December 19, 2019, 11:53 pmWhat about the question around SSD ratio?
What about the question around SSD ratio?
admin
2,930 Posts
Quote from admin on December 20, 2019, 12:10 amif i have to say a number it is 1:4 with SSDs, and 1:8 for nvmes. It really depends on the model of cache device and its relative speed (sync writes) to the hdds, there is a wide speed difference between SSDs models in sync write performance. You can test yourself the sync write speed in PetSAN blue console menu. You should also test a cache setup and look at the %util/busy of the cache device relative to the hdd from the charts. if it is not busy while your hdds are, then it means you can add more to it.
if i have to say a number it is 1:4 with SSDs, and 1:8 for nvmes. It really depends on the model of cache device and its relative speed (sync writes) to the hdds, there is a wide speed difference between SSDs models in sync write performance. You can test yourself the sync write speed in PetSAN blue console menu. You should also test a cache setup and look at the %util/busy of the cache device relative to the hdd from the charts. if it is not busy while your hdds are, then it means you can add more to it.
jnickel
12 Posts
Quote from jnickel on December 20, 2019, 12:48 amMy test environment is some old gear I have at home....I have 3 Supermicro chassis with 12 GB of RAM each, 2 x 1 GB ethernet and 1 Mellanox Infinihost 10 GB infiniband card. The HBA is an older LSI card and I have initially put in 2 x 3 TB hard drives and 1 x 4 TB hard drive. These are enterprise class drives that came out of a VNX 5400. Then I have 2 x 200 GB SSD drives - also enterprise class.
I setup the PetaSAN 2.4 with the Mellanox cards for the backend connections and then used the 1 GB ethernet for management and iSCSI.
Connected to this is are 3 x VMware ESXi host. I setup my multipathing with RoundRobin across all 3 PetaSAN hosts and set my ESXi to use a policy of 1 IOP per path before switching.
My previous test on 2.3.1 from a Windows host using either FileStore or BlueStore was that the best I could do (with or without SSD as Journals) was about 50 MBps read/write from the Windows guest.
Now with 2.4 and using the SSDs as cache (no external Journal devices) and using BlueStore is a write speed of 70 MBps and a read speed of 117 MBps (I think I am hitting the max the 1 GB ethernet can do).
Much better! Really nice improvement. I assume it will only get better if I use more disks and add more OSDs, but this is the level of performance I was waiting for so I could really start to use PetaSAN.
NICE JOB!
Jim
My test environment is some old gear I have at home....I have 3 Supermicro chassis with 12 GB of RAM each, 2 x 1 GB ethernet and 1 Mellanox Infinihost 10 GB infiniband card. The HBA is an older LSI card and I have initially put in 2 x 3 TB hard drives and 1 x 4 TB hard drive. These are enterprise class drives that came out of a VNX 5400. Then I have 2 x 200 GB SSD drives - also enterprise class.
I setup the PetaSAN 2.4 with the Mellanox cards for the backend connections and then used the 1 GB ethernet for management and iSCSI.
Connected to this is are 3 x VMware ESXi host. I setup my multipathing with RoundRobin across all 3 PetaSAN hosts and set my ESXi to use a policy of 1 IOP per path before switching.
My previous test on 2.3.1 from a Windows host using either FileStore or BlueStore was that the best I could do (with or without SSD as Journals) was about 50 MBps read/write from the Windows guest.
Now with 2.4 and using the SSDs as cache (no external Journal devices) and using BlueStore is a write speed of 70 MBps and a read speed of 117 MBps (I think I am hitting the max the 1 GB ethernet can do).
Much better! Really nice improvement. I assume it will only get better if I use more disks and add more OSDs, but this is the level of performance I was waiting for so I could really start to use PetaSAN.
NICE JOB!
Jim
admin
2,930 Posts
Quote from admin on December 20, 2019, 8:57 amThanks for the nice feedback 🙂
How are you testing the speed : ui benchmark / file copy / how many threads / what block sizes / seq or random ? also is this from VM or from baremetal Windows client ? All these make a difference in what you get...
Thanks for the nice feedback 🙂
How are you testing the speed : ui benchmark / file copy / how many threads / what block sizes / seq or random ? also is this from VM or from baremetal Windows client ? All these make a difference in what you get...
jnickel
12 Posts
Quote from jnickel on December 20, 2019, 3:54 pmI am testing from a VM that is running Windows Server 2008 R2. I am using the ATTO benchmark and I gave you the final numbers with the biggest block sizes. I just used the defaults for testing. Overall, it was better than the previous attempt. On my previous attempt with the 2.3.1 PetaSAN, I even had more disks, but didn't have the cache feature.
Transfer size goes from .5 KB to 8192 KB. Total size of the test was 256 MB. DirectIO was checked. Overlapped I/O was checked. Queue Depth was set at 4. I believe it uses sequential only.
I know that this is not a great test as it only tests throughput from a single VM and doesn't test random I/O, but I have found that I can gauge the overall speed I am likely to get from using this. If I can come close to maxing out a single 1gb ethernet, then I know that the speed of the system under load usually gets me what I am looking for. I have ordered 10 GB switches and NICs so I can set up everything in a bigger fashion. This is just for my homelab, but I still want a bullet proof iSCSI for my VMware environment. I have been using Starwind until now with RAID cards, but I like this better.
I am testing from a VM that is running Windows Server 2008 R2. I am using the ATTO benchmark and I gave you the final numbers with the biggest block sizes. I just used the defaults for testing. Overall, it was better than the previous attempt. On my previous attempt with the 2.3.1 PetaSAN, I even had more disks, but didn't have the cache feature.
Transfer size goes from .5 KB to 8192 KB. Total size of the test was 256 MB. DirectIO was checked. Overlapped I/O was checked. Queue Depth was set at 4. I believe it uses sequential only.
I know that this is not a great test as it only tests throughput from a single VM and doesn't test random I/O, but I have found that I can gauge the overall speed I am likely to get from using this. If I can come close to maxing out a single 1gb ethernet, then I know that the speed of the system under load usually gets me what I am looking for. I have ordered 10 GB switches and NICs so I can set up everything in a bigger fashion. This is just for my homelab, but I still want a bullet proof iSCSI for my VMware environment. I have been using Starwind until now with RAID cards, but I like this better.