Back to Search
Start Over
Cluster and Single-Node Analysis of Long-Term Deduplication Patterns
- Source :
- ACM Transactions on Storage. 14:1-27
- Publication Year :
- 2018
- Publisher :
- Association for Computing Machinery (ACM), 2018.
-
Abstract
- Deduplication has become essential in disk-based backup systems, but there have been few long-term studies of backup workloads. Most past studies either were of a small static snapshot or covered only a short period that was not representative of how a backup system evolves over time. For this article, we first collected 21 months of data from a shared user file system; 33 users and over 4,000 snapshots are covered. We then analyzed the dataset, examining a variety of essential characteristics across two dimensions: single-node deduplication and cluster deduplication. For single-node deduplication analysis, our primary focus was individual-user data. Despite apparently similar roles and behavior among all of our users, we found significant differences in their deduplication ratios. Moreover, the data that some users share with others had a much higher deduplication ratio than average. For cluster deduplication analysis, we implemented seven published data-routing algorithms and created a detailed comparison of their performance with respect to deduplication ratio, load distribution, and communication overhead. We found that per-file routing achieves a higher deduplication ratio than routing by super-chunk (multiple consecutive chunks), but it also leads to high data skew (imbalance of space usage across nodes). We also found that large chunking sizes are better for cluster deduplication, as they significantly reduce data-routing overhead, while their negative impact on deduplication ratios is small and acceptable. We draw interesting conclusions from both single-node and cluster deduplication analysis and make recommendations for future deduplication systems design.
- Subjects :
- File system
Computer science
Skew
020206 networking & telecommunications
020207 software engineering
Load distribution
02 engineering and technology
computer.software_genre
Single node
Hardware and Architecture
Backup
Data_FILES
0202 electrical engineering, electronic engineering, information engineering
Data deduplication
Snapshot (computer storage)
Systems design
Data mining
computer
Subjects
Details
- ISSN :
- 15533093 and 15533077
- Volume :
- 14
- Database :
- OpenAIRE
- Journal :
- ACM Transactions on Storage
- Accession number :
- edsair.doi...........d19cce5ee06fa18143f283336cc58a3a
- Full Text :
- https://doi.org/10.1145/3183890