Start Over

DHkmeans-ℓdiversity: distributed hierarchical K-means for satisfaction of the ℓ-diversity privacy model using Apache Spark

Authors :: Amir Sheikhahmadi
Hana Khamfroush
Farough Ashkouti
Keyhan Khamforoosh
Source :: The Journal of Supercomputing. 78:2616-2650
Publication Year :: 2021
Publisher :: Springer Science and Business Media LLC, 2021.
Abstract: One of the main steps in the data lifecycle is to publish it for data analysts to discover hidden patterns. But, data publishing may lead to unwanted disclosure of personal information and cause privacy problems. Data anonymization techniques preserve privacy models to prevent the disclosure of individuals’ private information in published data. In this paper, a distributed in-memory method is proposed on the Apache Spark framework to preserve the l-diversity privacy model. This method anonymizes large-scale data in a three-phase process, which includes, seed selection, data clustering for $$\ell$$ -diversity, and finalizing phase. In this method, a hierarchical kmeans-based data clustering algorithm has been designed for data anonymization. One of the major challenges of anonymization methods is to establish a better trade-off between data utility and privacy. Therefore, for calculating the distance between records and forming more cohesive ldiverse-clusters, the authors have designed two Manhattan-based and Euclidean-based distance functions to satisfy the requirements of the l-diversity model. Given the 100-fold speed of the Spark compared to MapReduce, the proposed method is presented using in-memory RDD programming in Apache Spark, to address the runtime, scalability, and performance in large-scale data anonymization as it exists in the previous MapReduce-based algorithms. Our method provides general knowledge to use parallel in-memory computation of Spark in big data anonymization. In experiments, this method has obtained lower information loss and loses about 1% to 2% accuracy and FMeasure criteria; therefore, it establishes a better trade-off than the state-of-the-art MapReduce-based Mondrian methods

Subjects :: Data anonymization
Computer science
business.industry
Big data
k-means clustering
Data publishing
computer.software_genre
Theoretical Computer Science
Hardware and Architecture
Spark (mathematics)
Scalability
Data mining
Cluster analysis
business
computer
Personally identifiable information
Software
Information Systems

Details

ISSN :: 15730484 and 09208542
Volume :: 78
Database :: OpenAIRE
Journal :: The Journal of Supercomputing
Accession number :: edsair.doi...........45711140349e4d4f5a60e0cf9eab57a4

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

DHkmeans-ℓdiversity: distributed hierarchical K-means for satisfaction of the ℓ-diversity privacy model using Apache Spark

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

DHkmeans-ℓdiversity: distributed hierarchical K-means for satisfaction of the ℓ-diversity privacy model using Apache Spark

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources