Back to Search
Start Over
DHkmeans-ℓdiversity: distributed hierarchical K-means for satisfaction of the ℓ-diversity privacy model using Apache Spark
- Source :
- The Journal of Supercomputing. 78:2616-2650
- Publication Year :
- 2021
- Publisher :
- Springer Science and Business Media LLC, 2021.
-
Abstract
- One of the main steps in the data lifecycle is to publish it for data analysts to discover hidden patterns. But, data publishing may lead to unwanted disclosure of personal information and cause privacy problems. Data anonymization techniques preserve privacy models to prevent the disclosure of individuals’ private information in published data. In this paper, a distributed in-memory method is proposed on the Apache Spark framework to preserve the l-diversity privacy model. This method anonymizes large-scale data in a three-phase process, which includes, seed selection, data clustering for $$\ell$$ -diversity, and finalizing phase. In this method, a hierarchical kmeans-based data clustering algorithm has been designed for data anonymization. One of the major challenges of anonymization methods is to establish a better trade-off between data utility and privacy. Therefore, for calculating the distance between records and forming more cohesive ldiverse-clusters, the authors have designed two Manhattan-based and Euclidean-based distance functions to satisfy the requirements of the l-diversity model. Given the 100-fold speed of the Spark compared to MapReduce, the proposed method is presented using in-memory RDD programming in Apache Spark, to address the runtime, scalability, and performance in large-scale data anonymization as it exists in the previous MapReduce-based algorithms. Our method provides general knowledge to use parallel in-memory computation of Spark in big data anonymization. In experiments, this method has obtained lower information loss and loses about 1% to 2% accuracy and FMeasure criteria; therefore, it establishes a better trade-off than the state-of-the-art MapReduce-based Mondrian methods
- Subjects :
- Data anonymization
Computer science
business.industry
Big data
k-means clustering
Data publishing
computer.software_genre
Theoretical Computer Science
Hardware and Architecture
Spark (mathematics)
Scalability
Data mining
Cluster analysis
business
computer
Personally identifiable information
Software
Information Systems
Subjects
Details
- ISSN :
- 15730484 and 09208542
- Volume :
- 78
- Database :
- OpenAIRE
- Journal :
- The Journal of Supercomputing
- Accession number :
- edsair.doi...........45711140349e4d4f5a60e0cf9eab57a4