1. A Thorough Investigation of Content-Defined Chunking Algorithms for Data Deduplication
- Author
-
Gregoriadis, Marcel, Balduf, Leonhard, Scheuermann, Björn, and Pouwelse, Johan
- Subjects
Computer Science - Distributed, Parallel, and Cluster Computing - Abstract
Data deduplication emerged as a powerful solution for reducing storage and bandwidth costs in cloud settings by eliminating redundancies at the level of chunks. This has spurred the development of numerous Content-Defined Chunking (CDC) algorithms over the past two decades. Despite advancements, the current state-of-the-art remains obscure, as a thorough and impartial analysis and comparison is lacking. We conduct a rigorous theoretical analysis and impartial experimental comparison of several leading CDC algorithms. Using four realistic datasets, we evaluate these algorithms against four key metrics: throughput, deduplication ratio, average chunk size, and chunk-size variance. Our analyses, in many instances, extend the findings of their original publications by reporting new results and putting existing ones into context. Moreover, we highlight limitations that have previously gone unnoticed. Our findings provide valuable insights that inform the selection and optimization of CDC algorithms for practical applications in data deduplication., Comment: Submitted to IEEE Transactions on Cloud Computing for possible publication
- Published
- 2024