1. ALLI: A High-Performance Approach to Data Deduplication in Hadoop using Enhanced Hashing and Two-Level Indexing Techniques.
- Author
-
Zakzouk, Ammar, Oumran, Bassim, and Hasan, Hasan
- Abstract
There are many systems like Hadoop that have been developed to effectively handle big data. However, these systems face challenges related to duplicate files, which consume additional resources for both storage and processing. Several approaches have been developed to eliminate duplicate files using hash algorithms. However, these algorithms have struggled to achieve a balance between execution speed and collision probability. Furthermore, the methods employed for storing hash values lead to lengthy match times and an elevated risk of collisions. In this paper, we propose ALLI, an approach designed to accelerate execution time and reduce collision probability during both the hashing and matching stages. ALLI combines the Arithmetic Logic Hash Algorithm (ALHA) for generating 1024-bit hash values and Two-Level Indexing in HBase (TLI-HBase) for efficient storage of hash values. Experiments conducted on four different datasets demonstrate that ALLI outperforms existing file-level deduplication techniques, achieving execution times that are twice as fast as those of other approaches. Moreover, the results indicate that ALHA is 2 to 3 times faster than other hash algorithms while also reducing collision probability even further. Additionally, TLI-HBase improves performance during the matching stage by significantly reducing the number of hash value comparisons compared to other storage methods. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF