Support for SSD Affinity in a mixed media pool
marko
5 Posts
March 8, 2018, 5:00 amQuote from marko on March 8, 2018, 5:00 amGreat post from ceph-users about assigning primary affinity to SSDs. Basically you build a pool that primarily is SSDs, while the 2nd and 3rd writes go to (SSD journaled or WAL'd) HDDs. Reads always hit the SSDs.
(edit: adding illustration for clarity)
So you have in your pool 18 1TB SSD drives (spread across 3 hosts, with a RAID1 boot/OS volume, 6 SSDs each). Then (for the 2nd and 3rd replicas that are "mostly write" except in the event of a primary SSD failure for rebuild) perhaps 3 hosts, that each have 6 2TB drives with two decent SSDs in RAID1 for OS+WAL/Rocksdb (to allow your writes to be ACK'd right away as if the array is all flash).
You would get 18TB of flash storage for far less than half the cost than if you did the entire pool in SSD. (And the added theoretical write durability HDDs bring for your 2nd and 3rd replicas).
This is different than cache tiering as all your data is on SSD (unless you lose an SSD then it will backfill from HDD).
Explained in this Ceph-users post: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-June/018487.html
It would be fantastic to have this kind of functionality accessible in the PetaSAN GUI, especially since I'm assuming many peoples' use case here is vmware/other hypervisor storage. This makes using SSDs almost affordable since you're really only paying for the SSDs once instead of 3x.
You'd still need to put an SSD or two in each HDD host for WAL/RocksDB but that's pretty insignificant cost wise.
Great post from ceph-users about assigning primary affinity to SSDs. Basically you build a pool that primarily is SSDs, while the 2nd and 3rd writes go to (SSD journaled or WAL'd) HDDs. Reads always hit the SSDs.
(edit: adding illustration for clarity)
So you have in your pool 18 1TB SSD drives (spread across 3 hosts, with a RAID1 boot/OS volume, 6 SSDs each). Then (for the 2nd and 3rd replicas that are "mostly write" except in the event of a primary SSD failure for rebuild) perhaps 3 hosts, that each have 6 2TB drives with two decent SSDs in RAID1 for OS+WAL/Rocksdb (to allow your writes to be ACK'd right away as if the array is all flash).
You would get 18TB of flash storage for far less than half the cost than if you did the entire pool in SSD. (And the added theoretical write durability HDDs bring for your 2nd and 3rd replicas).
This is different than cache tiering as all your data is on SSD (unless you lose an SSD then it will backfill from HDD).
Explained in this Ceph-users post: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-June/018487.html
It would be fantastic to have this kind of functionality accessible in the PetaSAN GUI, especially since I'm assuming many peoples' use case here is vmware/other hypervisor storage. This makes using SSDs almost affordable since you're really only paying for the SSDs once instead of 3x.
You'd still need to put an SSD or two in each HDD host for WAL/RocksDB but that's pretty insignificant cost wise.
Last edited on March 8, 2018, 7:16 am by admin · #1
admin
2,921 Posts
March 8, 2018, 1:02 pmQuote from admin on March 8, 2018, 1:02 pmThanks for the suggestion, it does look interesting. It is not directly related to pool and crush map customization but is related so i will see if we can also add this.
Thanks for the suggestion, it does look interesting. It is not directly related to pool and crush map customization but is related so i will see if we can also add this.
RafS
32 Posts
May 25, 2018, 7:34 amQuote from RafS on May 25, 2018, 7:34 amIt would be very nice indeed if we could have this in petasan.
It would be very nice indeed if we could have this in petasan.
Support for SSD Affinity in a mixed media pool
marko
5 Posts
Quote from marko on March 8, 2018, 5:00 amGreat post from ceph-users about assigning primary affinity to SSDs. Basically you build a pool that primarily is SSDs, while the 2nd and 3rd writes go to (SSD journaled or WAL'd) HDDs. Reads always hit the SSDs.
(edit: adding illustration for clarity)
So you have in your pool 18 1TB SSD drives (spread across 3 hosts, with a RAID1 boot/OS volume, 6 SSDs each). Then (for the 2nd and 3rd replicas that are "mostly write" except in the event of a primary SSD failure for rebuild) perhaps 3 hosts, that each have 6 2TB drives with two decent SSDs in RAID1 for OS+WAL/Rocksdb (to allow your writes to be ACK'd right away as if the array is all flash).
You would get 18TB of flash storage for far less than half the cost than if you did the entire pool in SSD. (And the added theoretical write durability HDDs bring for your 2nd and 3rd replicas).
This is different than cache tiering as all your data is on SSD (unless you lose an SSD then it will backfill from HDD).
Explained in this Ceph-users post: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-June/018487.html
It would be fantastic to have this kind of functionality accessible in the PetaSAN GUI, especially since I'm assuming many peoples' use case here is vmware/other hypervisor storage. This makes using SSDs almost affordable since you're really only paying for the SSDs once instead of 3x.
You'd still need to put an SSD or two in each HDD host for WAL/RocksDB but that's pretty insignificant cost wise.
Great post from ceph-users about assigning primary affinity to SSDs. Basically you build a pool that primarily is SSDs, while the 2nd and 3rd writes go to (SSD journaled or WAL'd) HDDs. Reads always hit the SSDs.
(edit: adding illustration for clarity)
So you have in your pool 18 1TB SSD drives (spread across 3 hosts, with a RAID1 boot/OS volume, 6 SSDs each). Then (for the 2nd and 3rd replicas that are "mostly write" except in the event of a primary SSD failure for rebuild) perhaps 3 hosts, that each have 6 2TB drives with two decent SSDs in RAID1 for OS+WAL/Rocksdb (to allow your writes to be ACK'd right away as if the array is all flash).
You would get 18TB of flash storage for far less than half the cost than if you did the entire pool in SSD. (And the added theoretical write durability HDDs bring for your 2nd and 3rd replicas).
This is different than cache tiering as all your data is on SSD (unless you lose an SSD then it will backfill from HDD).
Explained in this Ceph-users post: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-June/018487.html
It would be fantastic to have this kind of functionality accessible in the PetaSAN GUI, especially since I'm assuming many peoples' use case here is vmware/other hypervisor storage. This makes using SSDs almost affordable since you're really only paying for the SSDs once instead of 3x.
You'd still need to put an SSD or two in each HDD host for WAL/RocksDB but that's pretty insignificant cost wise.
admin
2,921 Posts
Quote from admin on March 8, 2018, 1:02 pmThanks for the suggestion, it does look interesting. It is not directly related to pool and crush map customization but is related so i will see if we can also add this.
Thanks for the suggestion, it does look interesting. It is not directly related to pool and crush map customization but is related so i will see if we can also add this.
RafS
32 Posts
Quote from RafS on May 25, 2018, 7:34 amIt would be very nice indeed if we could have this in petasan.
It would be very nice indeed if we could have this in petasan.