Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

radosgw deduplication

Hello.

Are you planning to test/include the ceph-dedup-tool and deduplication for s3 storage ?
Thank you, Fabrizio

 

It is not something we have in the near future. We have not tested it and from the docs it is experimental feature. If we get more requests we can test how stable it is.

Thank you for your answer.
If you want, I can test the feature (that, i know, is not stable, and ceph-dedup-tool is included in ceph-test) if you can add a petasan-test repository that include it.

 

Thanks a lot for your offer, give us till next week, we will upload this in our repository and hope to get your feedback !

wget https://www.petasan.org/misc/320/ceph-dedup-tool.gz
gunzip ceph-dedup-tool.gz
chmod +x ceph-dedup-tool
./ceph-dedup-tool

Thank you very much. I will let you know if works 🙂

 

Hello.

After 3 days on an experimental cluster (3 x dell 2950 with 2 sockets, 32 Gbyte, 6 x 1Tbyte drive, 2x1Gbit ethernet), on a pool with 2Tbyte used, I have this result (ps: i have stopped the dedup estimate because after 1 day and a very good speed, the check became really slow with 1 object every 3/4 seconds).

If i am reading in the right way, i could save 45% of space with a chunk size of 65536.

I will try to dedup in the next days when my production cluster will be ready and I can free this one.

{
"chunk_algo": "fastcdc",
"chunk_sizes": [
{
"target_chunk_size": 8192,
"dedup_bytes_ratio": 0.5140197701981345,
"dedup_objects_ratio": 94.522847748071996,
"chunk_size_average": 20646,
"chunk_size_stddev": 11486
},
{
"target_chunk_size": 16384,
"dedup_bytes_ratio": 0.51966609698550759,
"dedup_objects_ratio": 47.995579658742663,
"chunk_size_average": 40661,
"chunk_size_stddev": 22515
},
{
"target_chunk_size": 32768,
"dedup_bytes_ratio": 0.52849744319814296,
"dedup_objects_ratio": 24.625850435220162,
"chunk_size_average": 79249,
"chunk_size_stddev": 43759
},
{
"target_chunk_size": 65536,
"dedup_bytes_ratio": 0.5404845229128048,
"dedup_objects_ratio": 12.85525391628258,
"chunk_size_average": 151811,
"chunk_size_stddev": 83733
},
{
"target_chunk_size": 131072,
"dedup_bytes_ratio": 0.56045861870226754,
"dedup_objects_ratio": 6.9452157231363802,
"chunk_size_average": 280996,
"chunk_size_stddev": 155568
},
{
"target_chunk_size": 262144,
"dedup_bytes_ratio": 0.59483509039284288,
"dedup_objects_ratio": 3.9720278009763188,
"chunk_size_average": 491331,
"chunk_size_stddev": 277709
},
{
"target_chunk_size": 524288,
"dedup_bytes_ratio": 0.65517814092896842,
"dedup_objects_ratio": 2.471225849736352,
"chunk_size_average": 789721,
"chunk_size_stddev": 476233
},
{
"target_chunk_size": 1048576,
"dedup_bytes_ratio": 0.75295096283891727,
"dedup_objects_ratio": 1.6984593450906258,
"chunk_size_average": 1149030,
"chunk_size_stddev": 807422
},
{
"target_chunk_size": 2097152,
"dedup_bytes_ratio": 0.84600300186423216,
"dedup_objects_ratio": 1.2549401244684233,
"chunk_size_average": 1555118,
"chunk_size_stddev": 1302362
},
{
"target_chunk_size": 4194304,
"dedup_bytes_ratio": 0.87509230134712745,
"dedup_objects_ratio": 1.0254982056559403,
"chunk_size_average": 1903056,
"chunk_size_stddev": 1668604
}
],
"summary": {
"examined_objects": 286177,
"examined_bytes": 558497639572
}
}
151092s : read 558501833876 bytes so far...

Thanks for the feedback. My concern is the slowness as you progress, maybe it could also happen during dedup...my first guess is it could be related to memory as you do not have enough ram. If you have 6 OSDs per node, you would need 24 GB just for the OSDs. You could monitor the ram usage by the app with atop -m

MEM | tot    31.3G  | free   11.6G  | cache   1.3G  | buff    5.9G |  slab  535.5M |  shmem   1.9M |  vmbal   0.0M |  hptot   0.0M |SWP | tot     0.0M  | free    0.0M  |               |              |               |               |  vmcom  21.2G |  vmlim  15.7G |

I'm starting again dedup evaluate with a single chunk size (that i think is the best choice from saving and fragmentation).

PS: i was wrong, i have 32 gbyte ram, and 5 x 1Tbyte OSD each host.

I have not clear if the deduplication process is always a batch process; if yes, of course with millions of files and a big cluster (I am starting with a 12 host with 12 x 8Tbyte OSD and rados-gw), it is not an option 🙁

 

 

yes, it is not clear and probably it is a batch process. i also think the slowing could be memory related, since you need to store the chunk info like hashes which is probably done in ram.