201. Crocus: Enabling Computing Resource Orchestration for Inline Cluster-Wide Deduplication on Scalable Storage Systems
- Author
-
Prince Hamandawana, Youngjae Kim, Chang-Gyu Lee, Sungyong Park, and Awais Khan
- Subjects
020203 distributed computing ,biology ,Computer science ,business.industry ,Distributed computing ,Hash function ,02 engineering and technology ,biology.organism_classification ,Scheduling (computing) ,Computational Theory and Mathematics ,Hardware and Architecture ,Signal Processing ,Computer data storage ,Scalability ,Data_FILES ,0202 electrical engineering, electronic engineering, information engineering ,Data deduplication ,business ,Crocus - Abstract
Inline deduplication dramatically improves storage space utilization. However, it degrades I/O throughput due to compute-intensive deduplication operations such as chunking, fingerprinting or hashing of chunk content, and redundant lookup I/Os over the network in the I/O path. In particular, the fingerprint or hash generation of content contributes largely to the degraded I/O throughput and is computationally expensive. In this article, we propose Crocus , a framework that enables compute resource orchestration to enhance cluster-wide deduplication performance. In particular, Crocus takes into account all compute resources such as local and remote {CPU, GPU} by managing decentralized compute pools. An opportunistic Load-Aware Fingerprint Scheduler (LAFS), distributes and offloads compute-intensive deduplication operations in a load-aware fashion to compute pools. Crocus is highly generic and can be adopted in both inline and offline deduplication with different storage tier configurations. We implemented Crocus in Ceph scale-out storage system. Our extensive evaluation shows that Crocus reduces the fingerprinting overhead by 86 percent with 4KB chunk size compared to Ceph with baseline deduplication while maintaining high disk-space savings. Our proposed LAFS scheduler, when tested in different internal and external contention scenarios also showed 54 percent improvement over a fixed or static scheduling approach.
- Published
- 2020