Forums - PetaSAN

ForumGeneral Discussionradosgw deduplication
You need to log in to create posts and topics. Login · Register
radosgw deduplication

f.cuseo
84 Posts

July 11, 2023, 4:58 pm
Quote from f.cuseo on July 11, 2023, 4:58 pm
Hello.

Are you planning to test/include the ceph-dedup-tool and deduplication for s3 storage ?
Thank you, Fabrizio

Hello.

Are you planning to test/include the ceph-dedup-tool and deduplication for s3 storage ?
Thank you, Fabrizio

#1

admin
2,967 Posts

July 12, 2023, 8:46 am
Quote from admin on July 12, 2023, 8:46 am
It is not something we have in the near future. We have not tested it and from the docs it is experimental feature. If we get more requests we can test how stable it is.

It is not something we have in the near future. We have not tested it and from the docs it is experimental feature. If we get more requests we can test how stable it is.

#2

f.cuseo
84 Posts

July 12, 2023, 12:35 pm
Quote from f.cuseo on July 12, 2023, 12:35 pm
Thank you for your answer.
If you want, I can test the feature (that, i know, is not stable, and ceph-dedup-tool is included in ceph-test) if you can add a petasan-test repository that include it.

Thank you for your answer.
If you want, I can test the feature (that, i know, is not stable, and ceph-dedup-tool is included in ceph-test) if you can add a petasan-test repository that include it.

#3

admin
2,967 Posts

July 13, 2023, 5:41 pm
Quote from admin on July 13, 2023, 5:41 pm
Thanks a lot for your offer, give us till next week, we will upload this in our repository and hope to get your feedback !

Thanks a lot for your offer, give us till next week, we will upload this in our repository and hope to get your feedback !

#4

admin
2,967 Posts

July 15, 2023, 7:32 pm
Quote from admin on July 15, 2023, 7:32 pm
wget https://www.petasan.org/misc/320/ceph-dedup-tool.gz
gunzip ceph-dedup-tool.gz
chmod +x ceph-dedup-tool
./ceph-dedup-tool

wget https://www.petasan.org/misc/320/ceph-dedup-tool.gz
gunzip ceph-dedup-tool.gz
chmod +x ceph-dedup-tool
./ceph-dedup-tool

#5

f.cuseo
84 Posts

July 17, 2023, 1:36 pm
Quote from f.cuseo on July 17, 2023, 1:36 pm
Thank you very much. I will let you know if works 🙂

Thank you very much. I will let you know if works 🙂

#6

f.cuseo
84 Posts

July 19, 2023, 7:43 am
Quote from f.cuseo on July 19, 2023, 7:43 am
Hello.

After 3 days on an experimental cluster (3 x dell 2950 with 2 sockets, 32 Gbyte, 6 x 1Tbyte drive, 2x1Gbit ethernet), on a pool with 2Tbyte used, I have this result (ps: i have stopped the dedup estimate because after 1 day and a very good speed, the check became really slow with 1 object every 3/4 seconds).

If i am reading in the right way, i could save 45% of space with a chunk size of 65536.

I will try to dedup in the next days when my production cluster will be ready and I can free this one.

{
"chunk_algo": "fastcdc",
"chunk_sizes": [
{
"target_chunk_size": 8192,
"dedup_bytes_ratio": 0.5140197701981345,
"dedup_objects_ratio": 94.522847748071996,
"chunk_size_average": 20646,
"chunk_size_stddev": 11486
},
{
"target_chunk_size": 16384,
"dedup_bytes_ratio": 0.51966609698550759,
"dedup_objects_ratio": 47.995579658742663,
"chunk_size_average": 40661,
"chunk_size_stddev": 22515
},
{
"target_chunk_size": 32768,
"dedup_bytes_ratio": 0.52849744319814296,
"dedup_objects_ratio": 24.625850435220162,
"chunk_size_average": 79249,
"chunk_size_stddev": 43759
},
{
"target_chunk_size": 65536,
"dedup_bytes_ratio": 0.5404845229128048,
"dedup_objects_ratio": 12.85525391628258,
"chunk_size_average": 151811,
"chunk_size_stddev": 83733
},
{
"target_chunk_size": 131072,
"dedup_bytes_ratio": 0.56045861870226754,
"dedup_objects_ratio": 6.9452157231363802,
"chunk_size_average": 280996,
"chunk_size_stddev": 155568
},
{
"target_chunk_size": 262144,
"dedup_bytes_ratio": 0.59483509039284288,
"dedup_objects_ratio": 3.9720278009763188,
"chunk_size_average": 491331,
"chunk_size_stddev": 277709
},
{
"target_chunk_size": 524288,
"dedup_bytes_ratio": 0.65517814092896842,
"dedup_objects_ratio": 2.471225849736352,
"chunk_size_average": 789721,
"chunk_size_stddev": 476233
},
{
"target_chunk_size": 1048576,
"dedup_bytes_ratio": 0.75295096283891727,
"dedup_objects_ratio": 1.6984593450906258,
"chunk_size_average": 1149030,
"chunk_size_stddev": 807422
},
{
"target_chunk_size": 2097152,
"dedup_bytes_ratio": 0.84600300186423216,
"dedup_objects_ratio": 1.2549401244684233,
"chunk_size_average": 1555118,
"chunk_size_stddev": 1302362
},
{
"target_chunk_size": 4194304,
"dedup_bytes_ratio": 0.87509230134712745,
"dedup_objects_ratio": 1.0254982056559403,
"chunk_size_average": 1903056,
"chunk_size_stddev": 1668604
}
],
"summary": {
"examined_objects": 286177,
"examined_bytes": 558497639572
}
}
151092s : read 558501833876 bytes so far...

Hello.

After 3 days on an experimental cluster (3 x dell 2950 with 2 sockets, 32 Gbyte, 6 x 1Tbyte drive, 2x1Gbit ethernet), on a pool with 2Tbyte used, I have this result (ps: i have stopped the dedup estimate because after 1 day and a very good speed, the check became really slow with 1 object every 3/4 seconds).

If i am reading in the right way, i could save 45% of space with a chunk size of 65536.

I will try to dedup in the next days when my production cluster will be ready and I can free this one.

{
"chunk_algo": "fastcdc",
"chunk_sizes": [
{
"target_chunk_size": 8192,
"dedup_bytes_ratio": 0.5140197701981345,
"dedup_objects_ratio": 94.522847748071996,
"chunk_size_average": 20646,
"chunk_size_stddev": 11486
},
{
"target_chunk_size": 16384,
"dedup_bytes_ratio": 0.51966609698550759,
"dedup_objects_ratio": 47.995579658742663,
"chunk_size_average": 40661,
"chunk_size_stddev": 22515
},
{
"target_chunk_size": 32768,
"dedup_bytes_ratio": 0.52849744319814296,
"dedup_objects_ratio": 24.625850435220162,
"chunk_size_average": 79249,
"chunk_size_stddev": 43759
},
{
"target_chunk_size": 65536,
"dedup_bytes_ratio": 0.5404845229128048,
"dedup_objects_ratio": 12.85525391628258,
"chunk_size_average": 151811,
"chunk_size_stddev": 83733
},
{
"target_chunk_size": 131072,
"dedup_bytes_ratio": 0.56045861870226754,
"dedup_objects_ratio": 6.9452157231363802,
"chunk_size_average": 280996,
"chunk_size_stddev": 155568
},
{
"target_chunk_size": 262144,
"dedup_bytes_ratio": 0.59483509039284288,
"dedup_objects_ratio": 3.9720278009763188,
"chunk_size_average": 491331,
"chunk_size_stddev": 277709
},
{
"target_chunk_size": 524288,
"dedup_bytes_ratio": 0.65517814092896842,
"dedup_objects_ratio": 2.471225849736352,
"chunk_size_average": 789721,
"chunk_size_stddev": 476233
},
{
"target_chunk_size": 1048576,
"dedup_bytes_ratio": 0.75295096283891727,
"dedup_objects_ratio": 1.6984593450906258,
"chunk_size_average": 1149030,
"chunk_size_stddev": 807422
},
{
"target_chunk_size": 2097152,
"dedup_bytes_ratio": 0.84600300186423216,
"dedup_objects_ratio": 1.2549401244684233,
"chunk_size_average": 1555118,
"chunk_size_stddev": 1302362
},
{
"target_chunk_size": 4194304,
"dedup_bytes_ratio": 0.87509230134712745,
"dedup_objects_ratio": 1.0254982056559403,
"chunk_size_average": 1903056,
"chunk_size_stddev": 1668604
}
],
"summary": {
"examined_objects": 286177,
"examined_bytes": 558497639572
}
}
151092s : read 558501833876 bytes so far...

#7

admin
2,967 Posts

July 19, 2023, 9:25 am
Quote from admin on July 19, 2023, 9:25 am
Thanks for the feedback. My concern is the slowness as you progress, maybe it could also happen during dedup...my first guess is it could be related to memory as you do not have enough ram. If you have 6 OSDs per node, you would need 24 GB just for the OSDs. You could monitor the ram usage by the app with atop -m

Thanks for the feedback. My concern is the slowness as you progress, maybe it could also happen during dedup...my first guess is it could be related to memory as you do not have enough ram. If you have 6 OSDs per node, you would need 24 GB just for the OSDs. You could monitor the ram usage by the app with atop -m

#8

f.cuseo
84 Posts

July 19, 2023, 9:55 am
Quote from f.cuseo on July 19, 2023, 9:55 am
MEM | tot 31.3G | free 11.6G | cache 1.3G | buff 5.9G | slab 535.5M | shmem 1.9M | vmbal 0.0M | hptot 0.0M |SWP | tot 0.0M | free 0.0M | | | | | vmcom 21.2G | vmlim 15.7G |

I'm starting again dedup evaluate with a single chunk size (that i think is the best choice from saving and fragmentation).

PS: i was wrong, i have 32 gbyte ram, and 5 x 1Tbyte OSD each host.

I have not clear if the deduplication process is always a batch process; if yes, of course with millions of files and a big cluster (I am starting with a 12 host with 12 x 8Tbyte OSD and rados-gw), it is not an option 🙁

MEM | tot 31.3G | free 11.6G | cache 1.3G | buff 5.9G | slab 535.5M | shmem 1.9M | vmbal 0.0M | hptot 0.0M |SWP | tot 0.0M | free 0.0M | | | | | vmcom 21.2G | vmlim 15.7G |

I'm starting again dedup evaluate with a single chunk size (that i think is the best choice from saving and fragmentation).

PS: i was wrong, i have 32 gbyte ram, and 5 x 1Tbyte OSD each host.

I have not clear if the deduplication process is always a batch process; if yes, of course with millions of files and a big cluster (I am starting with a 12 host with 12 x 8Tbyte OSD and rados-gw), it is not an option 🙁

#9

admin
2,967 Posts

July 19, 2023, 11:21 am
Quote from admin on July 19, 2023, 11:21 am
yes, it is not clear and probably it is a batch process. i also think the slowing could be memory related, since you need to store the chunk info like hashes which is probably done in ram.

yes, it is not clear and probably it is a batch process. i also think the slowing could be memory related, since you need to store the chunk info like hashes which is probably done in ram.

#10

Post Reply: radosgw deduplication

Cancel