Back to Search Start Over

Towards privacy preserving unstructured big data publishing.

Authors :
Mehta, Brijesh
Rao, Udai Pratap
Gupta, Ruchika
Conti, Mauro
Source :
Journal of Intelligent & Fuzzy Systems; 2019, Vol. 36 Issue 4, p3471-3482, 12p
Publication Year :
2019

Abstract

Various sources and sophisticated tools are used to gather and process the comparatively large volume of data or big data that sometimes leads to privacy disclosure (at broader or finer level) for the data owner. Privacy preserving data publishing approaches such as k-anonymity, l-diversity, and t-closeness are very well used to de-identify data, however, chances of re-identification of attributes always exist as data is collected from multiple sources such as public web, social media, Internet whereabouts, and sensors that are highly prone to data linkages. In literature, k-anonymity stands out amongst the most popular mainstream data anonymization approaches that can also be used for large sized data. However, applying k-anonymization for variety of data (especially unstructured data) is difficult in the traditional way, due to the fact that it requires the given data to be classified into the personal data, the quasi identifiers, and the sensitive data. We identify existing approaches from the literature of Natural Language Processing(NLP) to convert the unstructured data to structured form in order to apply k-anonymization over the generated structured records. We adopt a two phase Conditional Random Field (CRF) based Named Entity Recognition (NER) approach to represent unstructured data into the structured form. Further, we propose an Improved Scalable k-Anonymization (ImSKA) to anonymize the well represented unstructured data that achieves privacy preserving unstructured big data publishing. We compare both of the propose approaches namely NER and ImSKA with existing approaches and the results show that our proposed solutions outperform the existing approaches in terms of F1 score and Normalized Cardinality Penalty (NCP), respectively. Since, NER approaches are widely used for bio-medical datasets, we have also used a well-known Bio-NER dataset called GENIA corpus for measuring the performance. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
10641246
Volume :
36
Issue :
4
Database :
Complementary Index
Journal :
Journal of Intelligent & Fuzzy Systems
Publication Type :
Academic Journal
Accession number :
135863839
Full Text :
https://doi.org/10.3233/JIFS-181231