Optimal sizing for caches
stevehughes
10 Posts
May 14, 2020, 12:52 amQuote from stevehughes on May 14, 2020, 12:52 amHi, new user here. I have been 'playing' with Ceph in various incarnations with a view to using it as iSCSI production storage for our vSphere cluster. PetaSAN looks like it might be the solution we are looking for.
We run a mixture of workloads with a range of I/O requirements, from hosted Exchange and SQL servers down to online backup. About 90% of our data would be cold. Our current SAN handles this nicely through tiering (whereby the hot data ends up promoted to SSD and the cold data demoted to the HDDs). We have experimented in Ceph using a cache hot tier with mixed results. We note that RedHat have deprecated tiering and that it is not supported in PetaSAN. Therefore I'm pondering the best use of the supported caching technologies (which I assume would be the WAL at OSD level along with dm-writecache) to achieve both fast ack for writes and a good level of caching for reads.
We have 3 Ceph nodes to start with. Our bulk storage is 8x 10TB spinners in each box. Each box also contains a pair of 2TB S4610 SATA SSDs and a single 4TB P4610 NVMe, but we will add whatever is needed for an optimal solution.
Our current experiments with Ceph have used the SATA SSDs for DB/WAL (carved into 440G per OSD with 4 OSDs per SSD), and the NVME for cache tier.
I understand that dm-writecache, in addition to caching writes, will also serve as a read cache for the data that has been written through it. I assume that the WAL handles only writes.
I'm wondering about the optimal sizing of caches. I think that if the dm-writecache is large enough it will also do a reasonable job of accelerating reads since usually the blocks with frequent reads are also those with frequent writes. Are there any general recommendations around sizing the dm-writecache? I'm also wondering whether increasing the size of the WAL beyond about 400G would provide useful benefit - we might just end up with two large write caches, which seems less than useful.
Either way, I think we want to avoid putting all the DB/WALs on a single SSD, even if it's NVMe. A failure of the device would kill all the OSDs in the box.
I would appreciate any and all thoughts.
Cheers,
Steve
Hi, new user here. I have been 'playing' with Ceph in various incarnations with a view to using it as iSCSI production storage for our vSphere cluster. PetaSAN looks like it might be the solution we are looking for.
We run a mixture of workloads with a range of I/O requirements, from hosted Exchange and SQL servers down to online backup. About 90% of our data would be cold. Our current SAN handles this nicely through tiering (whereby the hot data ends up promoted to SSD and the cold data demoted to the HDDs). We have experimented in Ceph using a cache hot tier with mixed results. We note that RedHat have deprecated tiering and that it is not supported in PetaSAN. Therefore I'm pondering the best use of the supported caching technologies (which I assume would be the WAL at OSD level along with dm-writecache) to achieve both fast ack for writes and a good level of caching for reads.
We have 3 Ceph nodes to start with. Our bulk storage is 8x 10TB spinners in each box. Each box also contains a pair of 2TB S4610 SATA SSDs and a single 4TB P4610 NVMe, but we will add whatever is needed for an optimal solution.
Our current experiments with Ceph have used the SATA SSDs for DB/WAL (carved into 440G per OSD with 4 OSDs per SSD), and the NVME for cache tier.
I understand that dm-writecache, in addition to caching writes, will also serve as a read cache for the data that has been written through it. I assume that the WAL handles only writes.
I'm wondering about the optimal sizing of caches. I think that if the dm-writecache is large enough it will also do a reasonable job of accelerating reads since usually the blocks with frequent reads are also those with frequent writes. Are there any general recommendations around sizing the dm-writecache? I'm also wondering whether increasing the size of the WAL beyond about 400G would provide useful benefit - we might just end up with two large write caches, which seems less than useful.
Either way, I think we want to avoid putting all the DB/WALs on a single SSD, even if it's NVMe. A failure of the device would kill all the OSDs in the box.
I would appreciate any and all thoughts.
Cheers,
Steve
Last edited on May 14, 2020, 12:54 am by stevehughes · #1
admin
2,930 Posts
May 14, 2020, 12:40 pmQuote from admin on May 14, 2020, 12:40 pmGenerally we recommend anywhere from 128G to 512G per partition. The higher the better but for most workloads going higher is not too much better. Each active cache partition requires 2% of its size as RAM, so a 128 GB partition requires 3 GB of RAM. A cache drive can be configured by the admin to have 1 to 8 partitions, we recommend 2-4.
Generally we recommend anywhere from 128G to 512G per partition. The higher the better but for most workloads going higher is not too much better. Each active cache partition requires 2% of its size as RAM, so a 128 GB partition requires 3 GB of RAM. A cache drive can be configured by the admin to have 1 to 8 partitions, we recommend 2-4.
stevehughes
10 Posts
May 15, 2020, 12:57 amQuote from stevehughes on May 15, 2020, 12:57 amThanks. We've gone through the installation and setup of our three nodes.
We ran into a problem with the automatic creation of the OSDs. During the setup we chose the 8x spinners as OSD disks, the 2x 2TB SSDs as cache and the 1x 4TB NVMe as journal (not that I'm comfortable having 8 OSDs dependent on a single journal device but it will do for our initial testing). The setup program created a 60G partition on the journal drive for each OSD, and it created a 600G partition on the cache drives for each OSD (3 partitions on each SSD) . It didn't ask us how large we wanted to make the partitions or how many partitions to put on each cache device. As a result it ran out of cache after creating 6 out of the 8 OSDs.
Since the setup process doesn't ask what size it should make the journal and cache for each OSD we assumed that it would carve up the space automatically to serve the required number of OSDs, but it hasn't done that. So my questions at this stage are:
- How is PetaSAN making decisions regarding the size of the journal and cache for each OSD, and can/should the default sizing be adjusted by us when creating the OSDs?
- Is a 60G journal per OSD optimal, and is there any benefit in making it larger?
- Is it possible to bypass the automatic creation of OSD at install time and do it manually later?
Thanks,
Steve
Thanks. We've gone through the installation and setup of our three nodes.
We ran into a problem with the automatic creation of the OSDs. During the setup we chose the 8x spinners as OSD disks, the 2x 2TB SSDs as cache and the 1x 4TB NVMe as journal (not that I'm comfortable having 8 OSDs dependent on a single journal device but it will do for our initial testing). The setup program created a 60G partition on the journal drive for each OSD, and it created a 600G partition on the cache drives for each OSD (3 partitions on each SSD) . It didn't ask us how large we wanted to make the partitions or how many partitions to put on each cache device. As a result it ran out of cache after creating 6 out of the 8 OSDs.
Since the setup process doesn't ask what size it should make the journal and cache for each OSD we assumed that it would carve up the space automatically to serve the required number of OSDs, but it hasn't done that. So my questions at this stage are:
- How is PetaSAN making decisions regarding the size of the journal and cache for each OSD, and can/should the default sizing be adjusted by us when creating the OSDs?
- Is a 60G journal per OSD optimal, and is there any benefit in making it larger?
- Is it possible to bypass the automatic creation of OSD at install time and do it manually later?
Thanks,
Steve
admin
2,930 Posts
May 15, 2020, 2:18 amQuote from admin on May 15, 2020, 2:18 amI am not sure the exact details, bit it should determine the count and size of cache partition based on the number of cache and hdd disk, it should have built supported the 8 hdds.
The 60 G journal/db partition is correct size. Not recommended to change this.
Yes absolutely, you can bypass auto creation then add OSDs after you created the cluster, you will have full control.
I am not sure the exact details, bit it should determine the count and size of cache partition based on the number of cache and hdd disk, it should have built supported the 8 hdds.
The 60 G journal/db partition is correct size. Not recommended to change this.
Yes absolutely, you can bypass auto creation then add OSDs after you created the cluster, you will have full control.
Last edited on May 15, 2020, 2:19 am by admin · #4
Optimal sizing for caches
stevehughes
10 Posts
Quote from stevehughes on May 14, 2020, 12:52 amHi, new user here. I have been 'playing' with Ceph in various incarnations with a view to using it as iSCSI production storage for our vSphere cluster. PetaSAN looks like it might be the solution we are looking for.
We run a mixture of workloads with a range of I/O requirements, from hosted Exchange and SQL servers down to online backup. About 90% of our data would be cold. Our current SAN handles this nicely through tiering (whereby the hot data ends up promoted to SSD and the cold data demoted to the HDDs). We have experimented in Ceph using a cache hot tier with mixed results. We note that RedHat have deprecated tiering and that it is not supported in PetaSAN. Therefore I'm pondering the best use of the supported caching technologies (which I assume would be the WAL at OSD level along with dm-writecache) to achieve both fast ack for writes and a good level of caching for reads.
We have 3 Ceph nodes to start with. Our bulk storage is 8x 10TB spinners in each box. Each box also contains a pair of 2TB S4610 SATA SSDs and a single 4TB P4610 NVMe, but we will add whatever is needed for an optimal solution.
Our current experiments with Ceph have used the SATA SSDs for DB/WAL (carved into 440G per OSD with 4 OSDs per SSD), and the NVME for cache tier.
I understand that dm-writecache, in addition to caching writes, will also serve as a read cache for the data that has been written through it. I assume that the WAL handles only writes.
I'm wondering about the optimal sizing of caches. I think that if the dm-writecache is large enough it will also do a reasonable job of accelerating reads since usually the blocks with frequent reads are also those with frequent writes. Are there any general recommendations around sizing the dm-writecache? I'm also wondering whether increasing the size of the WAL beyond about 400G would provide useful benefit - we might just end up with two large write caches, which seems less than useful.
Either way, I think we want to avoid putting all the DB/WALs on a single SSD, even if it's NVMe. A failure of the device would kill all the OSDs in the box.
I would appreciate any and all thoughts.
Cheers,
Steve
Hi, new user here. I have been 'playing' with Ceph in various incarnations with a view to using it as iSCSI production storage for our vSphere cluster. PetaSAN looks like it might be the solution we are looking for.
We run a mixture of workloads with a range of I/O requirements, from hosted Exchange and SQL servers down to online backup. About 90% of our data would be cold. Our current SAN handles this nicely through tiering (whereby the hot data ends up promoted to SSD and the cold data demoted to the HDDs). We have experimented in Ceph using a cache hot tier with mixed results. We note that RedHat have deprecated tiering and that it is not supported in PetaSAN. Therefore I'm pondering the best use of the supported caching technologies (which I assume would be the WAL at OSD level along with dm-writecache) to achieve both fast ack for writes and a good level of caching for reads.
We have 3 Ceph nodes to start with. Our bulk storage is 8x 10TB spinners in each box. Each box also contains a pair of 2TB S4610 SATA SSDs and a single 4TB P4610 NVMe, but we will add whatever is needed for an optimal solution.
Our current experiments with Ceph have used the SATA SSDs for DB/WAL (carved into 440G per OSD with 4 OSDs per SSD), and the NVME for cache tier.
I understand that dm-writecache, in addition to caching writes, will also serve as a read cache for the data that has been written through it. I assume that the WAL handles only writes.
I'm wondering about the optimal sizing of caches. I think that if the dm-writecache is large enough it will also do a reasonable job of accelerating reads since usually the blocks with frequent reads are also those with frequent writes. Are there any general recommendations around sizing the dm-writecache? I'm also wondering whether increasing the size of the WAL beyond about 400G would provide useful benefit - we might just end up with two large write caches, which seems less than useful.
Either way, I think we want to avoid putting all the DB/WALs on a single SSD, even if it's NVMe. A failure of the device would kill all the OSDs in the box.
I would appreciate any and all thoughts.
Cheers,
Steve
admin
2,930 Posts
Quote from admin on May 14, 2020, 12:40 pmGenerally we recommend anywhere from 128G to 512G per partition. The higher the better but for most workloads going higher is not too much better. Each active cache partition requires 2% of its size as RAM, so a 128 GB partition requires 3 GB of RAM. A cache drive can be configured by the admin to have 1 to 8 partitions, we recommend 2-4.
Generally we recommend anywhere from 128G to 512G per partition. The higher the better but for most workloads going higher is not too much better. Each active cache partition requires 2% of its size as RAM, so a 128 GB partition requires 3 GB of RAM. A cache drive can be configured by the admin to have 1 to 8 partitions, we recommend 2-4.
stevehughes
10 Posts
Quote from stevehughes on May 15, 2020, 12:57 amThanks. We've gone through the installation and setup of our three nodes.
We ran into a problem with the automatic creation of the OSDs. During the setup we chose the 8x spinners as OSD disks, the 2x 2TB SSDs as cache and the 1x 4TB NVMe as journal (not that I'm comfortable having 8 OSDs dependent on a single journal device but it will do for our initial testing). The setup program created a 60G partition on the journal drive for each OSD, and it created a 600G partition on the cache drives for each OSD (3 partitions on each SSD) . It didn't ask us how large we wanted to make the partitions or how many partitions to put on each cache device. As a result it ran out of cache after creating 6 out of the 8 OSDs.
Since the setup process doesn't ask what size it should make the journal and cache for each OSD we assumed that it would carve up the space automatically to serve the required number of OSDs, but it hasn't done that. So my questions at this stage are:
- How is PetaSAN making decisions regarding the size of the journal and cache for each OSD, and can/should the default sizing be adjusted by us when creating the OSDs?
- Is a 60G journal per OSD optimal, and is there any benefit in making it larger?
- Is it possible to bypass the automatic creation of OSD at install time and do it manually later?
Thanks,
Steve
Thanks. We've gone through the installation and setup of our three nodes.
We ran into a problem with the automatic creation of the OSDs. During the setup we chose the 8x spinners as OSD disks, the 2x 2TB SSDs as cache and the 1x 4TB NVMe as journal (not that I'm comfortable having 8 OSDs dependent on a single journal device but it will do for our initial testing). The setup program created a 60G partition on the journal drive for each OSD, and it created a 600G partition on the cache drives for each OSD (3 partitions on each SSD) . It didn't ask us how large we wanted to make the partitions or how many partitions to put on each cache device. As a result it ran out of cache after creating 6 out of the 8 OSDs.
Since the setup process doesn't ask what size it should make the journal and cache for each OSD we assumed that it would carve up the space automatically to serve the required number of OSDs, but it hasn't done that. So my questions at this stage are:
- How is PetaSAN making decisions regarding the size of the journal and cache for each OSD, and can/should the default sizing be adjusted by us when creating the OSDs?
- Is a 60G journal per OSD optimal, and is there any benefit in making it larger?
- Is it possible to bypass the automatic creation of OSD at install time and do it manually later?
Thanks,
Steve
admin
2,930 Posts
Quote from admin on May 15, 2020, 2:18 amI am not sure the exact details, bit it should determine the count and size of cache partition based on the number of cache and hdd disk, it should have built supported the 8 hdds.
The 60 G journal/db partition is correct size. Not recommended to change this.
Yes absolutely, you can bypass auto creation then add OSDs after you created the cluster, you will have full control.
I am not sure the exact details, bit it should determine the count and size of cache partition based on the number of cache and hdd disk, it should have built supported the 8 hdds.
The 60 G journal/db partition is correct size. Not recommended to change this.
Yes absolutely, you can bypass auto creation then add OSDs after you created the cluster, you will have full control.