Back to Search Start Over

Cluster and Single-Node Analysis of Long-Term Deduplication Patterns

Authors :
Vasily Tarasov
Geoff Kuenning
Sonam Mandal
Zhen 'Jason' Sun
Philip Shilane
Erez Zadok
Nong Xiao
Source :
ACM Transactions on Storage. 14:1-27
Publication Year :
2018
Publisher :
Association for Computing Machinery (ACM), 2018.

Abstract

Deduplication has become essential in disk-based backup systems, but there have been few long-term studies of backup workloads. Most past studies either were of a small static snapshot or covered only a short period that was not representative of how a backup system evolves over time. For this article, we first collected 21 months of data from a shared user file system; 33 users and over 4,000 snapshots are covered. We then analyzed the dataset, examining a variety of essential characteristics across two dimensions: single-node deduplication and cluster deduplication. For single-node deduplication analysis, our primary focus was individual-user data. Despite apparently similar roles and behavior among all of our users, we found significant differences in their deduplication ratios. Moreover, the data that some users share with others had a much higher deduplication ratio than average. For cluster deduplication analysis, we implemented seven published data-routing algorithms and created a detailed comparison of their performance with respect to deduplication ratio, load distribution, and communication overhead. We found that per-file routing achieves a higher deduplication ratio than routing by super-chunk (multiple consecutive chunks), but it also leads to high data skew (imbalance of space usage across nodes). We also found that large chunking sizes are better for cluster deduplication, as they significantly reduce data-routing overhead, while their negative impact on deduplication ratios is small and acceptable. We draw interesting conclusions from both single-node and cluster deduplication analysis and make recommendations for future deduplication systems design.

Details

ISSN :
15533093 and 15533077
Volume :
14
Database :
OpenAIRE
Journal :
ACM Transactions on Storage
Accession number :
edsair.doi...........d19cce5ee06fa18143f283336cc58a3a
Full Text :
https://doi.org/10.1145/3183890