150 results on '"de-identification"'
Search Results
2. De-identified Bayesian personal identity matching for privacy-preserving record linkage despite errors: development and validation.
- Author
-
Cardinal RN, Moore A, Burchell M, and Lewis JR
- Subjects
- Humans, Male, Bayes Theorem, State Medicine, Software, Privacy, Medical Record Linkage
- Abstract
Background: Epidemiological research may require linkage of information from multiple organizations. This can bring two problems: (1) the information governance desirability of linkage without sharing direct identifiers, and (2) a requirement to link databases without a common person-unique identifier., Methods: We develop a Bayesian matching technique to solve both. We provide an open-source software implementation capable of de-identified probabilistic matching despite discrepancies, via fuzzy representations and complete mismatches, plus de-identified deterministic matching if required. We validate the technique by testing linkage between multiple medical records systems in a UK National Health Service Trust, examining the effects of decision thresholds on linkage accuracy. We report demographic factors associated with correct linkage., Results: The system supports dates of birth (DOBs), forenames, surnames, three-state gender, and UK postcodes. Fuzzy representations are supported for all except gender, and there is support for additional transformations, such as accent misrepresentation, variation for multi-part surnames, and name re-ordering. Calculated log odds predicted a proband's presence in the sample database with an area under the receiver operating curve of 0.997-0.999 for non-self database comparisons. Log odds were converted to a decision via a consideration threshold θ and a leader advantage threshold δ. Defaults were chosen to penalize misidentification 20-fold versus linkage failure. By default, complete DOB mismatches were disallowed for computational efficiency. At these settings, for non-self database comparisons, the mean probability of a proband being correctly declared to be in the sample was 0.965 (range 0.931-0.994), and the misidentification rate was 0.00249 (range 0.00123-0.00429). Correct linkage was positively associated with male gender, Black or mixed ethnicity, and the presence of diagnostic codes for severe mental illnesses or other mental disorders, and negatively associated with birth year, unknown ethnicity, residential area deprivation, and presence of a pseudopostcode (e.g. indicating homelessness). Accuracy rates would be improved further if person-unique identifiers were also used, as supported by the software. Our two largest databases were linked in 44 min via an interpreted programming language., Conclusions: Fully de-identified matching with high accuracy is feasible without a person-unique identifier and appropriate software is freely available., (© 2023. The Author(s).)
- Published
- 2023
- Full Text
- View/download PDF
3. A scalable software solution for anonymizing high-dimensional biomedical data.
- Author
-
Meurers T, Bild R, Do KM, and Prasser F
- Subjects
- Algorithms, Humans, Information Dissemination, Software, Data Anonymization, Privacy
- Abstract
Background: Data anonymization is an important building block for ensuring privacy and fosters the reuse of data. However, transforming the data in a way that preserves the privacy of subjects while maintaining a high degree of data quality is challenging and particularly difficult when processing complex datasets that contain a high number of attributes. In this article we present how we extended the open source software ARX to improve its support for high-dimensional, biomedical datasets., Findings: For improving ARX's capability to find optimal transformations when processing high-dimensional data, we implement 2 novel search algorithms. The first is a greedy top-down approach and is oriented on a formally implemented bottom-up search. The second is based on a genetic algorithm. We evaluated the algorithms with different datasets, transformation methods, and privacy models. The novel algorithms mostly outperformed the previously implemented bottom-up search. In addition, we extended the GUI to provide a high degree of usability and performance when working with high-dimensional datasets., Conclusion: With our additions we have significantly enhanced ARX's ability to handle high-dimensional data in terms of processing performance as well as usability and thus can further facilitate data sharing., (© The Author(s) 2021. Published by Oxford University Press GigaScience.)
- Published
- 2021
- Full Text
- View/download PDF
4. Security and Privacy when Applying FAIR Principles to Genomic Information.
- Author
-
Delgado J and Llorente S
- Subjects
- Computer Security, Genomics, Privacy
- Abstract
Making data Findable, Accessible, Interoperable and Reusable (FAIR) is a good approach when data needs to be shared. However, security and privacy are still critical aspects. In the FAIRification process, there is a need both for de-identification of data and for license attribution. The paper analyses some of the issues related to this process when the objective is sharing genomic information. The main results are the identification of the already existing standards that could be used for this purpose and how to combine them. Nevertheless, the area is quickly evolving and more specific standards could be specified.
- Published
- 2020
- Full Text
- View/download PDF
5. An Extensible De-Identification Framework for Privacy Protection of Unstructured Health Information: Creating Sustainable Privacy Infrastructures.
- Author
-
Braghin S, Bettencourt-Silva JH, Levacher K, and Antonatos S
- Subjects
- Confidentiality, Medical Records Systems, Computerized, Data Anonymization, Privacy
- Abstract
The volume of unstructured health records has increased exponentially across healthcare settings. Similarly, the number of healthcare providers that wish to exchange records has also increased and, as a result, de-identification and the preservation of privacy features have become increasingly important and necessary. Governance guidelines now require sensitive information to be masked or removed yet this remains a difficult and often ad-hoc task, particularly when dealing with unstructured text. Annotators are typically used to identify such sensitive information but they may only be effective in certain text fragments. There is at present no hybrid, sustainable framework that aggregates different annotators together. This paper proposes a novel framework that leverages a combination of state-of-the-art annotators in order to maximize the effectiveness of the de-identification of health information.
- Published
- 2019
- Full Text
- View/download PDF
6. Managing Privacy and Data Sharing Through the Use of Health Care Information Fiduciaries.
- Author
-
Demuro PR and Petersen C
- Subjects
- Confidentiality, Data Anonymization, Humans, Information Dissemination, Privacy
- Abstract
Policy and regulation seldom keep up with advances in technology. Although data de-identification is seen as a key to protecting one's data, re-identification is often possible. Whether one's data is to be used for care, research, or commercial purposes, individuals are concerned about the use of their information. The authors propose the concept of an information fiduciary for holders of data, describe how it might be applied in a health care context, and outline considerations to determine whether a holder of health care-related information should be regarded as an information fiduciary.
- Published
- 2019
- Full Text
- View/download PDF
7. Identification and classification of DICOM files with burned-in text content.
- Author
-
Vcelak P, Kryl M, Kratochvil M, and Kleckova J
- Subjects
- Algorithms, Confidentiality, Datasets as Topic, Electronic Health Records, Health Insurance Portability and Accountability Act, Humans, United States, Computer Security, Privacy
- Abstract
Background: Protected health information burned in pixel data is not indicated for various reasons in DICOM. It complicates the secondary use of such data. In recent years, there have been several attempts to anonymize or de-identify DICOM files. Existing approaches have different constraints. No completely reliable solution exists. Especially for large datasets, it is necessary to quickly analyse and identify files potentially violating privacy., Methods: Classification is based on adaptive-iterative algorithm designed to identify one of three classes. There are several image transformations, optical character recognition, and filters; then a local decision is made. A confirmed local decision is the final one. The classifier was trained on a dataset composed of 15,334 images of various modalities., Results: The false positive rates are in all cases below 4.00%, and 1.81% in the mission-critical problem of detecting protected health information. The classifier's weighted average recall was 94.85%, the weighted average inverse recall was 97.42% and Cohen's Kappa coefficient was 0.920., Conclusion: The proposed novel approach for classification of burned-in text is highly configurable and able to analyse images from different modalities with a noisy background. The solution was validated and is intended to identify DICOM files that need to have restricted access or be thoroughly de-identified due to privacy issues. Unlike with existing tools, the recognised text, including its coordinates, can be further used for de-identification., (Copyright © 2019 Elsevier B.V. All rights reserved.)
- Published
- 2019
- Full Text
- View/download PDF
8. Protecting patient privacy when sharing patient-level data from clinical trials.
- Author
-
Tucker K, Branson J, Dilleen M, Hollis S, Loughlin P, Nixon MJ, and Williams Z
- Subjects
- Confidentiality, Drug Industry, Humans, Clinical Trials as Topic legislation & jurisprudence, Information Dissemination legislation & jurisprudence, Privacy legislation & jurisprudence
- Abstract
Background: Greater transparency and, in particular, sharing of patient-level data for further scientific research is an increasingly important topic for the pharmaceutical industry and other organisations who sponsor and conduct clinical trials as well as generally in the interests of patients participating in studies. A concern remains, however, over how to appropriately prepare and share clinical trial data with third party researchers, whilst maintaining patient confidentiality. Clinical trial datasets contain very detailed information on each participant. Risk to patient privacy can be mitigated by data reduction techniques. However, retention of data utility is important in order to allow meaningful scientific research. In addition, for clinical trial data, an excessive application of such techniques may pose a public health risk if misleading results are produced. After considering existing guidance, this article makes recommendations with the aim of promoting an approach that balances data utility and privacy risk and is applicable across clinical trial data holders., Discussion: Our key recommendations are as follows: 1. Data anonymisation/de-identification: Data holders are responsible for generating de-identified datasets which are intended to offer increased protection for patient privacy through masking or generalisation of direct and some indirect identifiers. 2. Controlled access to data, including use of a data sharing agreement: A legally binding data sharing agreement should be in place, including agreements not to download or further share data and not to attempt to seek to identify patients. Appropriate levels of security should be used for transferring data or providing access; one solution is use of a secure 'locked box' system which provides additional safeguards. This article provides recommendations on best practices to de-identify/anonymise clinical trial data for sharing with third-party researchers, as well as controlled access to data and data sharing agreements. The recommendations are applicable to all clinical trial data holders. Further work will be needed to identify and evaluate competing possibilities as regulations, attitudes to risk and technologies evolve.
- Published
- 2016
- Full Text
- View/download PDF
9. Selling health data: de-identification, privacy, and speech.
- Author
-
Kaplan B
- Subjects
- Humans, Personally Identifiable Information ethics, United Kingdom, United States, Commerce ethics, Confidentiality ethics, Data Mining ethics, Drug Industry ethics, Drug Prescriptions, Electronic Health Records ethics, Ownership ethics, Privacy
- Abstract
Two court cases that involve selling prescription data for pharmaceutical marketing affect biomedical informatics, patient and clinician privacy, and regulation. Sorrell v. IMS Health Inc. et al. in the United States and R v. Department of Health, Ex Parte Source Informatics Ltd. in the United Kingdom concern privacy and health data protection, data de-identification and reidentification, drug detailing (marketing), commercial benefit from the required disclosure of personal information, clinician privacy and the duty of confidentiality, beneficial and unsavory uses of health data, regulating health technologies, and considering data as speech. Individuals should, at the very least, be aware of how data about them are collected and used. Taking account of how those data are used is needed so societal norms and law evolve ethically as new technologies affect health data privacy and protection.
- Published
- 2015
- Full Text
- View/download PDF
10. Large Language Models for Electronic Health Record De-Identification in English and German.
- Author
-
Sousa, Samuel, Jantscher, Michael, Kröll, Mark, and Kern, Roman
- Subjects
- *
LANGUAGE models , *GENERATIVE artificial intelligence , *ELECTRONIC health records , *GERMAN language , *ENGLISH language - Abstract
Electronic health record (EHR) de-identification is crucial for publishing or sharing medical data without violating the patient's privacy. Protected health information (PHI) is abundant in EHRs, and privacy regulations worldwide mandate de-identification before downstream tasks are performed. The ever-growing data generation in healthcare and the advent of generative artificial intelligence have increased the demand for de-identified EHRs and highlighted privacy issues with large language models (LLMs), especially data transmission to cloud-based LLMs. In this study, we benchmark ten LLMs for de-identifying EHRs in English and German. We then compare de-identification performance for in-context learning and full model fine-tuning and analyze the limitations of LLMs for this task. Our experimental evaluation shows that LLMs effectively de-identify EHRs in both languages. Moreover, in-context learning with a one-shot setting boosts de-identification performance without the costly full fine-tuning of the LLMs. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
11. A survey on UK researchers' views regarding their experiences with the de-identification, anonymisation, release methods and re-identification risk estimation for clinical trial datasets.
- Author
-
Rodriguez, Aryelly, Lewis, Steff C, Eldridge, Sandra, Jackson, Tracy, and Weir, Christopher J
- Subjects
WORK ,RISK assessment ,CROSS-sectional method ,DOCUMENTATION ,DATABASE management ,RESEARCH funding ,CLINICAL trials ,PRIVACY ,DESCRIPTIVE statistics ,RESEARCH ,RESEARCH methodology ,EXPERIENTIAL learning ,MEDICAL ethics - Abstract
Background: There are increasing pressures for anonymised datasets from clinical trials to be shared across the scientific community. However, there is no standardised set of recommendations on how to anonymise and prepare clinical trial datasets for sharing, while an ever-increasing number of anonymised datasets are becoming available for secondary research. Our aim was to explore the current views and experiences of researchers in the United Kingdom about de-identification, anonymisation, release methods and re-identification risk estimation for clinical trial datasets. Methods: We used an online exploratory cross-sectional descriptive survey that consisted of both open-ended and closed questions. Results: We had 38 responses to invitation from June 2022 to October 2022. However, 35 participants (92%) used internal documentation and published guidance to de-identify/anonymise clinical trial datasets. De-identification, followed by anonymisation and then fulfilling data holders' requirements before access was granted (controlled access), was the most common process for releasing the datasets as reported by 18 (47%) participants. However, 11 participants (29%) had previous knowledge of re-identification risk estimation, but they did not use any of the methodologies. Experiences in the process of de-identifying/anonymising the datasets and maintaining such datasets were mostly negative, and the main reported issues were lack of resources, guidance, and training. Conclusion: The majority of responders reported using documented processes for de-identification and anonymisation. However, our survey results clearly indicate that there are still gaps in the areas of guidance, resources and training to fulfil sharing requests of de-identified/anonymised datasets, and that re-identification risk estimation is an underdeveloped area. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
12. Identifying protected health information by transformers-based deep learning approach in Chinese medical text.
- Author
-
Xu, Kun, Song, Yang, and Ma, Jingdong
- Abstract
Purpose: In the context of Chinese clinical texts, this paper aims to propose a deep learning algorithm based on Bidirectional Encoder Representation from Transformers (BERT) to identify privacy information and to verify the feasibility of our method for privacy protection in the Chinese clinical context. Methods: We collected and double-annotated 33,017 discharge summaries from 151 medical institutions on a municipal regional health information platform, developed a BERT-based Bidirectional Long Short-Term Memory Model (BiLSTM) and Conditional Random Field (CRF) model, and tested the performance of privacy identification on the dataset. To explore the performance of different substructures of the neural network, we created five additional baseline models and evaluated the impact of different models on performance. Results: Based on the annotated data, the BERT model pre-trained with the medical corpus showed a significant performance improvement to the BiLSTM-CRF model with a micro-recall of 0.979 and an F1 value of 0.976, which indicates that the model has promising performance in identifying private information in Chinese clinical texts. Conclusions: The BERT-based BiLSTM-CRF model excels in identifying privacy information in Chinese clinical texts, and the application of this model is very effective in protecting patient privacy and facilitating data sharing. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
13. PyFaceWipe: a new defacing tool for almost any MRI contrast.
- Author
-
Mitew, Stanislaw, Yeow, Ling Yun, Ho, Chi Long, Bhanu, Prakash K. N., and Nickalls, Oliver James
- Subjects
OLDER people ,YOUNG adults ,BRAIN imaging ,BRAIN research ,VOLUME (Cubic content) - Abstract
Rationale and objectives: Defacing research MRI brain scans is often a mandatory step. With current defacing software, there are issues with Windows compatibility and researcher doubt regarding the adequacy of preservation of brain voxels in non-T1w scans. To address this, we developed PyFaceWipe, a multiplatform software for multiple MRI contrasts, which was evaluated based on its anonymisation ability and effect on downstream processing. Materials and methods: Multiple MRI brain scan contrasts from the OASIS-3 dataset were defaced with PyFaceWipe and PyDeface and manually assessed for brain voxel preservation, remnant facial features and effect on automated face detection. Original and PyFaceWipe-defaced data from locally acquired T1w structural scans underwent volumetry with FastSurfer and brain atlas generation with ANTS. Results: 214 MRI scans of several contrasts from OASIS-3 were successfully processed with both PyFaceWipe and PyDeface. PyFaceWipe maintained complete brain voxel preservation in all tested contrasts except ASL (45%) and DWI (90%), and PyDeface in all tested contrasts except ASL (95%), BOLD (25%), DWI (40%) and T2* (25%). Manual review of PyFaceWipe showed no failures of facial feature removal. Pinna removal was less successful (6% of T1 scans showed residual complete pinna). PyDeface achieved 5.1% failure rate. Automated detection found no faces in PyFaceWipe-defaced scans, 19 faces in PyDeface scans compared with 78 from the 224 original scans. Brain atlas generation showed no significant difference between atlases created from original and defaced data in both young adulthood and late elderly cohorts. Structural volumetry dice scores were ≥ 0.98 for all structures except for grey matter which had 0.93. PyFaceWipe output was identical across the tested operating systems. Conclusion: PyFaceWipe is a promising multiplatform defacing tool, demonstrating excellent brain voxel preservation and competitive defacing in multiple MRI contrasts, performing favourably against PyDeface. ASL, BOLD, DWI and T2* scans did not produce recognisable 3D renders and hence should not require defacing. Structural volumetry dice scores (≥ 0.98) were higher than previously published FreeSurfer results, except for grey matter which were comparable. The effect is measurable and care should be exercised during studies. ANTS atlas creation showed no significant effect from PyFaceWipe defacing. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
14. Giyilebilir Cihazlardan Gelen Sağlık Verilerinin Kimliksizleştirilmesi Yeterince Güvenli mi?
- Author
-
DURMUŞ, Veli
- Subjects
DATA privacy ,WEARABLE technology ,PUBLIC health research ,PERSONALLY identifiable information ,INFORMATION sharing - Abstract
Copyright of Istanbul Gelisim University Journal of Health Sciences / İstanbul Gelişim Üniversitesi Sağlık Bilimleri Dergisi is the property of Istanbul Gelisim Universitesi Saglik Bilimleri Yuksekokulu and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2024
- Full Text
- View/download PDF
15. Face image de-identification based on feature embedding.
- Author
-
Hanawa, Goki, Ito, Koichi, and Aoki, Takafumi
- Subjects
- *
FACE perception , *SOCIAL networks , *SOCIAL media , *PRIVACY , *BIOMETRY , *HUMAN facial recognition software - Abstract
A large number of images are available on the Internet with the growth of social networking services, and many of them are face photos or contain faces. It is necessary to protect the privacy of face images to prevent their malicious use by face image de-identification techniques that make face recognition difficult, which prevent the collection of specific face images using face recognition. In this paper, we propose a face image de-identification method that generates a de-identified image from an input face image by embedding facial features extracted from that of another person into the input face image. We develop the novel framework for embedding facial features into a face image and loss functions based on images and features to de-identify a face image preserving its appearance. Through a set of experiments using public face image datasets, we demonstrate that the proposed method exhibits higher de-identification performance against unknown face recognition models than conventional methods while preserving the appearance of the input face images. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
16. End-to-end pseudonymization of fine-tuned clinical BERT models: Privacy preservation with maintained data utility.
- Author
-
Vakili, Thomas, Henriksson, Aron, and Dalianis, Hercules
- Subjects
- *
LANGUAGE models , *DATA privacy , *PRIVACY , *NATURAL language processing - Abstract
Many state-of-the-art results in natural language processing (NLP) rely on large pre-trained language models (PLMs). These models consist of large amounts of parameters that are tuned using vast amounts of training data. These factors cause the models to memorize parts of their training data, making them vulnerable to various privacy attacks. This is cause for concern, especially when these models are applied in the clinical domain, where data are very sensitive. Training data pseudonymization is a privacy-preserving technique that aims to mitigate these problems. This technique automatically identifies and replaces sensitive entities with realistic but non-sensitive surrogates. Pseudonymization has yielded promising results in previous studies. However, no previous study has applied pseudonymization to both the pre-training data of PLMs and the fine-tuning data used to solve clinical NLP tasks. This study evaluates the effects on the predictive performance of end-to-end pseudonymization of Swedish clinical BERT models fine-tuned for five clinical NLP tasks. A large number of statistical tests are performed, revealing minimal harm to performance when using pseudonymized fine-tuning data. The results also find no deterioration from end-to-end pseudonymization of pre-training and fine-tuning data. These results demonstrate that pseudonymizing training data to reduce privacy risks can be done without harming data utility for training PLMs. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
17. Fast refacing of MR images with a generative neural network lowers re‐identification risk and preserves volumetric consistency.
- Author
-
Molchanova, Nataliia, Maréchal, Bénédicte, Thiran, Jean‐Philippe, Kober, Tobias, Huelnhagen, Till, and Richiardi, Jonas
- Subjects
- *
GENERATIVE adversarial networks , *MAGNETIC resonance imaging , *COMPUTATIONAL complexity - Abstract
With the rise of open data, identifiability of individuals based on 3D renderings obtained from routine structural magnetic resonance imaging (MRI) scans of the head has become a growing privacy concern. To protect subject privacy, several algorithms have been developed to de‐identify imaging data using blurring, defacing or refacing. Completely removing facial structures provides the best re‐identification protection but can significantly impact post‐processing steps, like brain morphometry. As an alternative, refacing methods that replace individual facial structures with generic templates have a lower effect on the geometry and intensity distribution of original scans, and are able to provide more consistent post‐processing results by the price of higher re‐identification risk and computational complexity. In the current study, we propose a novel method for anonymized face generation for defaced 3D T1‐weighted scans based on a 3D conditional generative adversarial network. To evaluate the performance of the proposed de‐identification tool, a comparative study was conducted between several existing defacing and refacing tools, with two different segmentation algorithms (FAST and Morphobox). The aim was to evaluate (i) impact on brain morphometry reproducibility, (ii) re‐identification risk, (iii) balance between (i) and (ii), and (iv) the processing time. The proposed method takes 9 s for face generation and is suitable for recovering consistent post‐processing results after defacing. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
18. Anonymization and Pseudonymization of FHIR Resources for Secondary Use of Healthcare Data
- Author
-
Emanuele Raso, Pierpaolo Loreti, Michele Ravaziol, and Lorenzo Bracciale
- Subjects
Anonymisation ,de-identification ,FHIR ,healthcare ,pseudonymisation ,privacy ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Along with the creation of medical profiles of patients, Electronic Health Records have several secondary missions, such as health economy and research. The recent, increasing adoption of a common standard, i.e., the Fast Healthcare Interoperability Resources (FHIR), makes it easier to exchange medical data among the several parties involved, for example, in an epidemiological research activity. However, this exchange process is hindered by regulatory frameworks due to privacy issues related to the presence of personal information, which allows patients to be identified directly (or indirectly) from their medical data. When properly used, de-identification techniques can provide crucial support in overcoming these problems. FHIR-DIET aims to bring flexibility and concreteness to the implementation of de-identification of health data, supporting many customised data-processing behaviours that can be easily configured and tailored to match specific use case requirements. Our solution enables faster and easier cooperation between legal and IT professionals to establish and implement de-identification rules. The performance evaluation demonstrates the viability of processing hundreds of FHIR patient information data per second using standard hardware. We believe FHIR-DIET can be a valuable tool to satisfy the current regulation requirements and help to create added-value for the secondary use of healthcare data.
- Published
- 2024
- Full Text
- View/download PDF
19. I Got Your Emotion: Emotion Preserving Face De-identification Using Injection-Based Generative Adversarial Networks
- Author
-
Shopon, Md, Gavrilova, Marina L., Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Bebis, George, editor, Ghiasi, Golnaz, editor, Fang, Yi, editor, Sharf, Andrei, editor, Dong, Yue, editor, Weaver, Chris, editor, Leo, Zhicheng, editor, LaViola Jr., Joseph J., editor, and Kohli, Luv, editor
- Published
- 2023
- Full Text
- View/download PDF
20. ARTPHIL: Reversible De-identification of Free Text Using an Integrated Model
- Author
-
Alabdullah, Bayan, Beloff, Natalia, White, Martin, Akan, Ozgur, Editorial Board Member, Bellavista, Paolo, Editorial Board Member, Cao, Jiannong, Editorial Board Member, Coulson, Geoffrey, Editorial Board Member, Dressler, Falko, Editorial Board Member, Ferrari, Domenico, Editorial Board Member, Gerla, Mario, Editorial Board Member, Kobayashi, Hisashi, Editorial Board Member, Palazzo, Sergio, Editorial Board Member, Sahni, Sartaj, Editorial Board Member, Shen, Xuemin (Sherman), Editorial Board Member, Stan, Mircea, Editorial Board Member, Jia, Xiaohua, Editorial Board Member, Zomaya, Albert Y., Editorial Board Member, Shi, Wenbo, editor, Chen, Xiaofeng, editor, and Choo, Kim-Kwang Raymond, editor
- Published
- 2022
- Full Text
- View/download PDF
21. Biometric System De-identification: Concepts, Applications, and Open Problems
- Author
-
Shopon, Md., Hossain Bari, A. S. M., Bhatia, Yajurv, Narayanaswamy, Pavan Karkekoppa, Tumpa, Sanjida Nasreen, Sieu, Brandon, Gavrilova, Marina, Kacprzyk, Janusz, Series Editor, Jain, Lakhmi C., Series Editor, Lim, Chee-Peng, editor, Chen, Yen-Wei, editor, Vaidya, Ashlesha, editor, and Mahorkar, Charu, editor
- Published
- 2022
- Full Text
- View/download PDF
22. Protection of gait data set for preserving its privacy in deep learning pipeline
- Author
-
Anubha Parashar and Rajveer Singh Shekhawat
- Subjects
de‐identification ,gait anonymization ,gait biometric ,privacy ,reversible deep learning pipeline ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Abstract Human gait is a biometric that is being used in security systems because it is unique for each individual and helps recognise one from a distance without any intervention. To develop such a system, one needs a comprehensive data set specific to the application. If this data set somehow falls in the hands of rogue elements, they can easily access the secured system developed based on the data set. Thus, the protection of the gait data set becomes essential. It has been learnt that systems using deep learning are easily prone to hacking. Hence, maintaining the privacy of gait data sets in the deep learning pipeline becomes more difficult due to adversarial attacks or unauthorised access to the data set. One of the popular techniques for stopping access to the data set is using anonymisation. A reversible gait anonymisation pipeline that modifies gait geometry by morphing the images, that is, texture modifications, is proposed. Such modified data prevent hackers from making use of the data set for adversarial attacks. Nine layers were proposedto effect geometrical modifications, and a fixed gait texture template is used for morphing. Both these modify the gait data set so that any authentic person cannot be identified while maintaining the naturalness of the gait. The proposed method is evaluated using the similarity index as well as the recognition rate. The impact of various geometrical and texture modifications on silhouettes have been investigated to identify the modifications. The crowdsourcing and machine learning experiments were performed on the silhouette for this purpose. The obtained results in both types of experiments showed that texture modification has a stronger impact on the level of privacy protection than geometry shape modifications. In these experiments, the similarity index achieved is above 99%. These findings open new research directions regarding the adversarial attacks and privacy protection related to gait recognition data sets.
- Published
- 2022
- Full Text
- View/download PDF
23. Data De-identification Framework.
- Author
-
Junhyoung Oh and Kyungho Lee
- Subjects
BIG data ,RISK perception ,SITUATIONAL awareness ,CORE competencies ,CUSTOMER services ,INTERNATIONAL organization - Abstract
As technology develops, the amount of information being used has increased a lot. Every company learns big data to provide customized services with its customers. Accordingly, collecting and analyzing data of the data subject has become one of the core competencies of the companies. However, when collecting and using it, the authority of the data subject may be violated. The data often identifies its subject by itself, and even if it is not a personal information that infringes on an individual's authority, the moment it is connected, it becomes important and sensitive personal information that we have never thought of. Therefore, recent privacy regulations such as GDPR(GeneralData ProtectionRegulation) are changing to guarantee more rights of the data subjects. To use data effectively without infringing on the rights of the data subject, the concept of de-identification has been created. Researchers and companies can make personal information less identifiable through appropriate de-identification/pseudonymization and use the data for the purpose of statistical research. De-identification/pseudonymization techniques have been studied a lot, but it is difficult for companies and researchers to know how to de-identify/pseudonymize data. It is difficult to clearly understand how and to what extent each organization should take deidentification measures. Currently, each organization does not systematically analyze and conduct the situation but only takes minimal action while looking at the guidelines distributed by each country. We solved this problem from the perspective of risk management. Several steps are required to secure the dataset starting from pre-processing to releasing the dataset. We can analyze the dataset, analyze the risk, evaluate the risk, and treat the risk appropriately. The outcomes of each step can then be used to take appropriate action on the dataset to eliminate or reduce its risk. Then, we can release the dataset under its own purpose. These series of processes were reconstructed to fit the current situation by analyzing various standards such as ISO/IEC (International Organization for Standardization/International Electrotechnical Commission) 20889, NIST IR (National Institute of Standards and Technology Interagency Reports) 8053, NIST SP (National Institute of Standards and Technology Special Publications) 800-188, and ITU-T (International Telecommunications Union-Telecommunication) X.1148. We propose an integrated framework based on situational awareness model and risk management model. We found that this framework can be specialized for multiple domains, and it is useful because it is based on a variety of case and utility-based ROI calculations. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
24. Pseudonymization and Anonymization of Radiology Data
- Author
-
van Ooijen, Peter M. A., Aryanto, Kadek Yota Ernanda, van Ooijen, Peter M. A., Series Editor, Ranschaert, Erik R., Series Editor, and Trianni, Annalisa, Series Editor
- Published
- 2021
- Full Text
- View/download PDF
25. Data Privacy and Security
- Author
-
Fraser, Ross, Hussey, Pamela, editor, and Kennedy, Margaret Ann, editor
- Published
- 2021
- Full Text
- View/download PDF
26. Voice Privacy with Smart Digital Assistants in Educational Settings
- Author
-
Niknazar, Mohammad, Vempaty, Aditya, Kokku, Ravi, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Cristea, Alexandra I., editor, and Troussas, Christos, editor
- Published
- 2021
- Full Text
- View/download PDF
27. Impact of defacing on automated brain atrophy estimation
- Author
-
Christian Rubbert, Luisa Wolf, Bernd Turowski, Dennis M. Hedderich, Christian Gaser, Robert Dahnke, Julian Caspers, and for the Alzheimer’s Disease Neuroimaging Initiative
- Subjects
Magnetic resonance imaging ,Brain ,Atrophy ,De-identification ,Privacy ,Medical physics. Medical radiology. Nuclear medicine ,R895-920 - Abstract
Abstract Background Defacing has become mandatory for anonymization of brain MRI scans; however, concerns regarding data integrity were raised. Thus, we systematically evaluated the effect of different defacing procedures on automated brain atrophy estimation. Methods In total, 268 Alzheimer’s disease patients were included from ADNI, which included unaccelerated (n = 154), within-session unaccelerated repeat (n = 67) and accelerated 3D T1 imaging (n = 114). Atrophy maps were computed using the open-source software veganbagel for every original, unmodified scan and after defacing using afni_refacer, fsl_deface, mri_deface, mri_reface, PyDeface or spm_deface, and the root-mean-square error (RMSE) between z-scores was calculated. RMSE values derived from unaccelerated and unaccelerated repeat imaging served as a benchmark. Outliers were defined as RMSE > 75th percentile and by using Grubbs’s test. Results Benchmark RMSE was 0.28 ± 0.1 (range 0.12–0.58, 75th percentile 0.33). Outliers were found for unaccelerated and accelerated T1 imaging using the 75th percentile cutoff: afni_refacer (unaccelerated: 18, accelerated: 16), fsl_deface (unaccelerated: 4, accelerated: 18), mri_deface (unaccelerated: 0, accelerated: 15), mri_reface (unaccelerated: 0, accelerated: 2) and spm_deface (unaccelerated: 0, accelerated: 7). PyDeface performed best with no outliers (unaccelerated mean RMSE 0.08 ± 0.05, accelerated mean RMSE 0.07 ± 0.05). The following outliers were found according to Grubbs’s test: afni_refacer (unaccelerated: 16, accelerated: 13), fsl_deface (unaccelerated: 10, accelerated: 21), mri_deface (unaccelerated: 7, accelerated: 20), mri_reface (unaccelerated: 7, accelerated: 6), PyDeface (unaccelerated: 5, accelerated: 8) and spm_deface (unaccelerated: 10, accelerated: 12). Conclusion Most defacing approaches have an impact on atrophy estimation, especially in accelerated 3D T1 imaging. Only PyDeface showed good results with negligible impact on atrophy estimation.
- Published
- 2022
- Full Text
- View/download PDF
28. Ensemble Approaches to Recognize Protected Health Information in Radiology Reports.
- Author
-
Horng, Hannah, Steinkamp, Jackson, Kahn Jr., Charles E., and Cook, Tessa S.
- Subjects
PRIVACY ,DECISION trees ,CONSENSUS (Social sciences) ,EVALUATION of medical care ,REPORT writing ,RADIOGRAPHY ,MACHINE learning ,MEDICAL ethics ,HEALTH ,INFORMATION resources ,DESCRIPTIVE statistics ,PREDICTION models ,PATIENT care ,ALGORITHMS - Abstract
Natural language processing (NLP) techniques for electronic health records have shown great potential to improve the quality of medical care. The text of radiology reports frequently constitutes a large fraction of EHR data, and can provide valuable information about patients' diagnoses, medical history, and imaging findings. The lack of a major public repository for radiological reports severely limits the development, testing, and application of new NLP tools. De-identification of protected health information (PHI) presents a major challenge to building such repositories, as many automated tools for de-identification were trained or designed for clinical notes and do not perform sufficiently well to build a public database of radiology reports. We developed and evaluated six ensemble models based on three publically available de-identification tools: MIT de-id, NeuroNER, and Philter. A set of 1023 reports was set aside as the testing partition. Two individuals with medical training annotated the test set for PHI; differences were resolved by consensus. Ensemble methods included simple voting schemes (1-Vote, 2-Votes, and 3-Votes), a decision tree, a naïve Bayesian classifier, and Adaboost boosting. The 1-Vote ensemble achieved recall of 998 / 1043 (95.7%); the 3-Votes ensemble had precision of 1035 / 1043 (99.2%). F1 scores were: 93.4% for the decision tree, 71.2% for the naïve Bayesian classifier, and 87.5% for the boosting method. Basic voting algorithms and machine learning classifiers incorporating the predictions of multiple tools can outperform each tool acting alone in de-identifying radiology reports. Ensemble methods hold substantial potential to improve automated de-identification tools for radiology reports to make such reports more available for research use to improve patient care and outcomes. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
29. Protection of gait data set for preserving its privacy in deep learning pipeline.
- Author
-
Parashar, Anubha and Shekhawat, Rajveer Singh
- Subjects
DATA protection ,GAIT in humans ,DATABASES ,DEEP learning ,PRIVACY ,INTERNET privacy ,MACHINE learning - Abstract
Human gait is a biometric that is being used in security systems because it is unique for each individual and helps recognise one from a distance without any intervention. To develop such a system, one needs a comprehensive data set specific to the application. If this data set somehow falls in the hands of rogue elements, they can easily access the secured system developed based on the data set. Thus, the protection of the gait data set becomes essential. It has been learnt that systems using deep learning are easily prone to hacking. Hence, maintaining the privacy of gait data sets in the deep learning pipeline becomes more difficult due to adversarial attacks or unauthorised access to the data set. One of the popular techniques for stopping access to the data set is using anonymisation. A reversible gait anonymisation pipeline that modifies gait geometry by morphing the images, that is, texture modifications, is proposed. Such modified data prevent hackers from making use of the data set for adversarial attacks. Nine layers were proposedto effect geometrical modifications, and a fixed gait texture template is used for morphing. Both these modify the gait data set so that any authentic person cannot be identified while maintaining the naturalness of the gait. The proposed method is evaluated using the similarity index as well as the recognition rate. The impact of various geometrical and texture modifications on silhouettes have been investigated to identify the modifications. The crowdsourcing and machine learning experiments were performed on the silhouette for this purpose. The obtained results in both types of experiments showed that texture modification has a stronger impact on the level of privacy protection than geometry shape modifications. In these experiments, the similarity index achieved is above 99%. These findings open new research directions regarding the adversarial attacks and privacy protection related to gait recognition data sets. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
30. Ephemeral pseudonym based de-identification system to reduce impact of inference attacks in healthcare information system.
- Author
-
Rai, Bipin Kumar
- Subjects
- *
PRIVACY , *DATABASES , *DATA security , *MEDICAL ethics , *TERMS & phrases , *DATA analytics , *ELECTRONIC health records , *RISK management in business , *ALGORITHMS - Abstract
As healthcare data is extremely sensitive, it poses a risk of invading individuals' privacy if stored or exported without proper security measures. De-identification entails pseudonymization or anonymization of data, which are methods for temporarily or permanently removing an individual's identity. These methods are most suitable to keep user healthcare data private. Inference attacks are a commonly overlooked weakness of de-identification techniques. In this paper, I discuss a method for de-identifying Electronic Healthcare Records (EHR) using chained hashing to generate short-lived pseudonyms to reduce the impact of inference attacks, as well as a mechanism for re-identification based on information self-determination. It also removes the weaknesses of existing de-identification algorithms and resolve them by using appropriate real-time de-identification algorithm, Ephemeral Pseudonym Generation Algorithm (EPGA). [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
31. A Data Utility-Driven Benchmark for De-identification Methods
- Author
-
Tomashchuk, Oleksandr, Van Landuyt, Dimitri, Pletea, Daniel, Wuyts, Kim, Joosen, Wouter, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Gritzalis, Stefanos, editor, Weippl, Edgar R., editor, Katsikas, Sokratis K., editor, Anderst-Kotsis, Gabriele, editor, Tjoa, A Min, editor, and Khalil, Ismail, editor
- Published
- 2019
- Full Text
- View/download PDF
32. Impact of defacing on automated brain atrophy estimation.
- Author
-
Rubbert, Christian, Wolf, Luisa, Turowski, Bernd, Hedderich, Dennis M., Gaser, Christian, Dahnke, Robert, and Caspers, Julian
- Subjects
CEREBRAL atrophy ,ALZHEIMER'S patients ,THREE-dimensional imaging ,MAGNETIC resonance imaging - Abstract
Background: Defacing has become mandatory for anonymization of brain MRI scans; however, concerns regarding data integrity were raised. Thus, we systematically evaluated the effect of different defacing procedures on automated brain atrophy estimation. Methods: In total, 268 Alzheimer's disease patients were included from ADNI, which included unaccelerated (n = 154), within-session unaccelerated repeat (n = 67) and accelerated 3D T1 imaging (n = 114). Atrophy maps were computed using the open-source software veganbagel for every original, unmodified scan and after defacing using afni_refacer, fsl_deface, mri_deface, mri_reface, PyDeface or spm_deface, and the root-mean-square error (RMSE) between z-scores was calculated. RMSE values derived from unaccelerated and unaccelerated repeat imaging served as a benchmark. Outliers were defined as RMSE > 75th percentile and by using Grubbs's test. Results: Benchmark RMSE was 0.28 ± 0.1 (range 0.12–0.58, 75th percentile 0.33). Outliers were found for unaccelerated and accelerated T1 imaging using the 75th percentile cutoff: afni_refacer (unaccelerated: 18, accelerated: 16), fsl_deface (unaccelerated: 4, accelerated: 18), mri_deface (unaccelerated: 0, accelerated: 15), mri_reface (unaccelerated: 0, accelerated: 2) and spm_deface (unaccelerated: 0, accelerated: 7). PyDeface performed best with no outliers (unaccelerated mean RMSE 0.08 ± 0.05, accelerated mean RMSE 0.07 ± 0.05). The following outliers were found according to Grubbs's test: afni_refacer (unaccelerated: 16, accelerated: 13), fsl_deface (unaccelerated: 10, accelerated: 21), mri_deface (unaccelerated: 7, accelerated: 20), mri_reface (unaccelerated: 7, accelerated: 6), PyDeface (unaccelerated: 5, accelerated: 8) and spm_deface (unaccelerated: 10, accelerated: 12). Conclusion: Most defacing approaches have an impact on atrophy estimation, especially in accelerated 3D T1 imaging. Only PyDeface showed good results with negligible impact on atrophy estimation. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
33. VPP_AHA: Visual Privacy Protection via Adaptive Histogram Adjustment.
- Author
-
Yao, Xiaoming
- Subjects
CAMOUFLAGE (Biology) ,PRIVACY ,DIGITAL images - Abstract
A novel visual privacy protection (VPP) algorithm is proposed for camouflage of the visual identifiers in digital images. The challenge is to suffice the de-identification with a different look. This issue has been previously addressed by two classes of algorithms, i.e. modifying target features with or without a reference. The former ran well by finding a surrogate from randomized hybrid features between the target and the reference, and the latter made it by blurring or hiding the details of the region. Inspired by camouflage in animals for self-concealment, this paper presents a simple and efficient algorithm that imitates such behavior via adaptive histogram shift adjustment without a reference. The image is first blurred and segmented into several flat surfaces, then adaptively re-joined or split by shifts randomly chosen to produce the de-identified output. It is found that the success of the surface-based feature redaction depends on the shift diversity, the saliency, and coverage of the surfaces segmented, which are used as adjusting parameters for the camouflage. Extensive examples of real and synthetic images have demonstrated that our results compare favorably to those obtained by existing VPP methods, with required security, robustness, and selective reversibility. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
34. Protecting and Utilizing Health and Medical Big Data: Policy Perspectives from Korea
- Author
-
Dongjin Lee, Mijeong Park, Seungwon Chang, and Haksoo Ko
- Subjects
big data ,de-identification ,data protection ,privacy ,research ,Computer applications to medicine. Medical informatics ,R858-859.7 - Abstract
ObjectivesWe analyzed Korea's data privacy regime in the context of protecting and utilizing health and medical big data and tried to draw policy implications from the analyses.MethodsWe conducted comparative analyses of the legal and regulatory environments governing health and medical big data with a view to drawing policy implications for Korea. The legal and regulatory regimes considered include the following: the European Union, the United Kingdom, France, the United States, and Japan. We reviewed relevant statutory materials as well as various non-statutory materials and guidelines issued by public authorities. Where available, we also examined policy measures implemented by government agencies.ResultsIn this study, we investigated how various jurisdictions deal with legal and regulatory issues that may arise from the use of health and medical information with regard to the protection of data subjects' rights and the protection of personal information. We compared and analyzed various forms of legislation in various jurisdictions and also considered technical methods, such as de-identification. The main findings include the following: there is a need to streamline the relationship between the general data privacy regime and the regulatory regime governing health and medical big data; the regulatory and institutional structure for data governance should be more clearly delineated; and regulation should encourage the development of suitable methodologies for the de-identification of data and, in doing so, a principle-based and risk-based approach should be taken.ConclusionsFollowing our comparative legal analyses, implications were drawn. The main conclusion is that the relationship between the legal requirements imposed for purposes of personal information protection and the regulatory requirements governing the use of health and medical data is complicated and multi-faceted and, as such, their relationship should be more clearly streamlined and delineated.
- Published
- 2019
- Full Text
- View/download PDF
35. Protecting Privacy and Transforming COVID-19 Case Surveillance Datasets for Public Use.
- Author
-
Lee, Brian, Dupervil, Brandi, Deputy, Nicholas P., Duck, Wil, Soroka, Stephen, Bottichio, Lyndsay, Silk, Benjamin, Price, Jason, Sweeney, Patricia, Fuld, Jennifer, Weber, J. Todd, and Pollock, Dan
- Subjects
- *
PRIVACY , *PUBLIC health surveillance , *USER-centered system design , *COVID-19 , *MANAGEMENT of medical records , *IDENTIFICATION , *PATIENTS , *SURVEYS , *MEDICAL ethics , *ACCESS to information , *INFORMATION retrieval , *ALGORITHMS - Abstract
Objectives: Federal open-data initiatives that promote increased sharing of federally collected data are important for transparency, data quality, trust, and relationships with the public and state, tribal, local, and territorial partners. These initiatives advance understanding of health conditions and diseases by providing data to researchers, scientists, and policymakers for analysis, collaboration, and use outside the Centers for Disease Control and Prevention (CDC), particularly for emerging conditions such as COVID-19, for which data needs are constantly evolving. Since the beginning of the pandemic, CDC has collected person-level, de-identified data from jurisdictions and currently has more than 8 million records. We describe how CDC designed and produces 2 de-identified public datasets from these collected data. Methods: We included data elements based on usefulness, public request, and privacy implications; we suppressed some field values to reduce the risk of re-identification and exposure of confidential information. We created datasets and verified them for privacy and confidentiality by using data management platform analytic tools and R scripts. Results: Unrestricted data are available to the public through Data.CDC.gov, and restricted data, with additional fields, are available with a data-use agreement through a private repository on GitHub.com. Practice Implications: Enriched understanding of the available public data, the methods used to create these data, and the algorithms used to protect the privacy of de-identified people allow for improved data use. Automating data-generation procedures improves the volume and timeliness of sharing data. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
36. Resilience of clinical text de-identified with "hiding in plain sight" to hostile reidentification attacks by human readers.
- Author
-
Carrell, David S, Malin, Bradley A, Cronkite, David J, Aberdeen, John S, Clark, Cheryl, Li, Muqun (Rachel), Bastakoty, Dikshya, Nyemba, Steve, and Hirschman, Lynette
- Abstract
Objective Effective, scalable de-identification of personally identifying information (PII) for information-rich clinical text is critical to support secondary use, but no method is 100% effective. The hiding-in-plain-sight (HIPS) approach attempts to solve this "residual PII problem." HIPS replaces PII tagged by a de-identification system with realistic but fictitious (resynthesized) content, making it harder to detect remaining unredacted PII. Materials and Methods Using 2000 representative clinical documents from 2 healthcare settings (4000 total), we used a novel method to generate 2 de-identified 100-document corpora (200 documents total) in which PII tagged by a typical automated machine-learned tagger was replaced by HIPS-resynthesized content. Four readers conducted aggressive reidentification attacks to isolate leaked PII: 2 readers from within the originating institution and 2 external readers. Results Overall, mean recall of leaked PII was 26.8% and mean precision was 37.2%. Mean recall was 9% (mean precision = 37%) for patient ages, 32% (mean precision = 26%) for dates, 25% (mean precision = 37%) for doctor names, 45% (mean precision = 55%) for organization names, and 23% (mean precision = 57%) for patient names. Recall was 32% (precision = 40%) for internal and 22% (precision =33%) for external readers. Discussion and Conclusions Approximately 70% of leaked PII "hiding" in a corpus de-identified with HIPS resynthesis is resilient to detection by human readers in a realistic, aggressive reidentification attack scenario—more than double the rate reported in previous studies but less than the rate reported for an attack assisted by machine learning methods. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
37. Flexible data anonymization using ARX—Current status and challenges ahead.
- Author
-
Prasser, Fabian, Eicher, Johanna, Spengler, Helmut, Bild, Raffael, and Kuhn, Klaus A.
- Subjects
DATA protection ,PERSONALLY identifiable information ,DATA quality ,SOCIAL networks ,ARTIFICIAL intelligence - Abstract
Summary: The race for innovation has turned into a race for data. Rapid developments of new technologies, especially in the field of artificial intelligence, are accompanied by new ways of accessing, integrating, and analyzing sensitive personal data. Examples include financial transactions, social network activities, location traces, and medical records. As a consequence, adequate and careful privacy management has become a significant challenge. New data protection regulations, for example in the EU and China, are direct responses to these developments. Data anonymization is an important building block of data protection concepts, as it allows to reduce privacy risks by altering data. The development of anonymization tools involves significant challenges, however. For instance, the effectiveness of different anonymization techniques depends on context, and thus tools need to support a large set of methods to ensure that the usefulness of data is not overly affected by risk‐reducing transformations. In spite of these requirements, existing solutions typically only support a small set of methods. In this work, we describe how we have extended an open source data anonymization tool to support almost arbitrary combinations of a wide range of techniques in a scalable manner. We then review the spectrum of methods supported and discuss their compatibility within the novel framework. The results of an extensive experimental comparison show that our approach outperforms related solutions in terms of scalability and output data quality—while supporting a much broader range of techniques. Finally, we discuss practical experiences with ARX and present remaining issues and challenges ahead. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
38. De-Identification of Radiomics Data Retaining Longitudinal Temporal Information.
- Author
-
Kundu, Surajit, Chakraborty, Santam, Chatterjee, Sanjoy, Das, Syamantak, Achari, Rimpa Basu, Mukhopadhyay, Jayanta, Das, Partha Pratim, Mallick, Indranil, Arunsingh, Moses, Bhattacharyyaa, Tapesh, and Ray, Soumendranath
- Subjects
- *
COMPUTED tomography , *DIGITAL image processing , *MEDICAL ethics , *ONCOLOGY , *PRIVACY , *RADIOTHERAPY - Abstract
We propose a de-identification system which runs in a standalone mode. The system takes care of the de-identification of radiation oncology patient's clinical and annotated imaging data including RTSTRUCT, RTPLAN, and RTDOSE. The clinical data consists of diagnosis, stages, outcome, and treatment information of the patient. The imaging data could be the diagnostic, therapy planning, and verification images. Archival of the longitudinal radiation oncology verification images like cone beam CT scans along with the initial imaging and clinical data are preserved in the process. During the de-identification, the system keeps the reference of original data identity in encrypted form. These could be useful for the re-identification if necessary. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
39. A Precautionary Approach to Big Data Privacy
- Author
-
Narayanan, Arvind, Huey, Joanna, Felten, Edward W., Casanovas, Pompeu, Series editor, Sartor, Giovanni, Series editor, Gutwirth, Serge, Sub Series editor, Leenes, Ronald, editor, and De Hert, Paul, editor
- Published
- 2016
- Full Text
- View/download PDF
40. Solving Artificial Intelligence’s Privacy Problem
- Author
-
Yves-Alexandre de Montjoye, Ali Farzanehfar, Julien Hendrickx, and Luc Rocher
- Subjects
privacy ,data anonymization ,pseudonymization ,de-identification ,large data-sets ,identity ,Social Sciences - Abstract
Artificial Intelligence (AI) has potential to fundamentally change the way we work, live, and interact. There is however no general AI out there and the accuracy of current machine learning models largely depend on the data on which they have been trained on. For the coming decades, the development of AI will depend on access to ever larger and richer medical and behavioral datasets. We now have strong evidence that the tool we have used historically to find a balance between using the data in aggregate and protecting people’s privacy, de-identification, does not scale to big data datasets. The development and deployment of modern privacy-enhancing technologies (PET), allowing data controllers to make data available in a safe and transparent way, will be key to unlocking the great potential of AI.
- Published
- 2017
41. De-identification of Electronic Health Records Using Machine Learning Algorithms
- Author
-
Mostafa Langarizadeh and Azam Orooji
- Subjects
confidentiality ,privacy ,de-identification ,machine learning ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Medical technology ,R855-855.5 - Abstract
Introduction: Electronic Health Record (EHR) contains valuable clinical information that can be useful for activities such as public health surveillance, quality improvement, and research. However, EHRs often contain identifiable health information that their presence limits the use of the records for sharing and secondary usages. De-identification is one of the common methods for protecting the confidentiality of patient information. This systematic review has focused on recently published studies on the usage of de-identification methods based on Machine Learning (ML) approaches for removing all identifiable information from electronic health records. Methods: A systematic review was performed in electronic databases like PubMed and ScienceDirect between 2006 and 2016. Studies were assessed for adherence to the CASP checklists and reviewed independently by two investigators. Finally, 12 articles were matched with inclusion criteria. Results: The selected studies have been discussed in terms of used methods and knowledge resources, types of identifiers detected, types of clinical documents, challenges and achieved results. The results showed that ML-based de-identification is a widely invoked approach to protect patient privacy when disclosing clinical data for secondary purposes, such as research. Also, the combination of the ML algorithms and some techniques such as pattern matching and regular expression matching could decrease need to train data. Conclusion: There is a lot of identifiable information in medical records. This study showed ML- based de-identification methods can intensively reduce the disclosure risk of information.
- Published
- 2017
42. Privacy and Confidentiality in Service Science and Big Data Analytics
- Author
-
O’Keefe, Christine M., Rannenberg, Kai, Editor-in-chief, Sakarovitch, Jacques, Series editor, Goedicke, Michael, Series editor, Tatnall, Arthur, Series editor, Neuhold, Erich J., Series editor, Pras, Aiko, Series editor, Tröltzsch, Fredi, Series editor, Pries-Heje, Jan, Series editor, Whitehouse, Diane, Series editor, Reis, Ricardo, Series editor, Murayama, Yuko, Series editor, Dillon, Tharam, Series editor, Gulliksen, Jan, Series editor, Rauterberg, Matthias, Series editor, Camenisch, Jan, editor, Fischer-Hübner, Simone, editor, and Hansen, Marit, editor
- Published
- 2015
- Full Text
- View/download PDF
43. Data Privacy and Security
- Author
-
Fraser, Ross, Hannah, Kathryn J., editor, Hussey, Pamela, editor, Kennedy, Margaret A., editor, and Ball, Marion J., editor
- Published
- 2015
- Full Text
- View/download PDF
44. An Extensible De-Identification Framework for Privacy Protection of Unstructured Health Information: Creating Sustainable Privacy Infrastructures.
- Author
-
Braghin, Stefano, Bettencourt-Silva, Joao H, Levacher, Killian, and Antonatos, Spiros
- Subjects
MEDICAL informatics ,NATURAL language processing ,MEDICAL records ,SUSTAINABLE development - Abstract
The volume of unstructured health records has increased exponentially across healthcare settings. Similarly, the number of healthcare providers that wish to exchange records has also increased and, as a result, de-identification and the preservation of privacy features have become increasingly important and necessary. Governance guidelines now require sensitive information to be masked or removed yet this remains a difficult and often ad-hoc task, particularly when dealing with unstructured text. Annotators are typically used to identify such sensitive information but they may only be effective in certain text fragments. There is at present no hybrid, sustainable framework that aggregates different annotators together. This paper proposes a novel framework that leverages a combination of state-of-the-art annotators in order to maximize the effectiveness of the de-identification of health information. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
45. Misconceptions in Privacy Protection and Regulation.
- Author
-
Culnane, Chris and Leins, Kobi
- Subjects
- *
RIGHT of privacy , *ACCESS to information , *DATA protection , *INFORMATION policy - Abstract
Privacy protection legislation and policy is heavily dependent on the notion of de-identification. Repeated examples of its failure in real-world use have had little impact on the popularity of its usage in policy and legislation. In this paper we will examine some of the misconceptions that have occurred to attempt to explain why, in spite of all the evidence, we continue to rely on a technique that has been shown not to work, and further, which is purported to protect privacy when it clearly does not. With a particular focus on Australia, we shall look at how misconceptions regarding de-identification are perpetuated. We highlight that continuing to discuss the fiction of de-identified data as a form of privacy actively undermines privacy and privacy norms. Further, we note that 'de-identification of data' should not be presented as a form of privacy protection by policy makers, and that greater legislative protections of privacy are urgently needed given the volumes of data being collected, connected and mined. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
46. High Accuracy Open-Source Clinical Data De-dentification: The CliniDeID Solution.
- Author
-
MEYSTRE, Stéphane and HEIDER, Paul
- Subjects
COMPUTER software ,PRIVACY ,MEDICAL information storage & retrieval systems ,CONFERENCES & conventions ,ARTIFICIAL intelligence ,MEDICAL ethics ,DESCRIPTIVE statistics ,RESEARCH funding ,ELECTRONIC health records ,ALGORITHMS - Abstract
Clinical data de-identification offers patient data privacy protection and eases reuse of clinical data. As an open-source solution to de-identify unstructured clinical text with high accuracy, CliniDeID applies an ensemble method combining deep and shallow machine learning with rule-based algorithms. It reached high recall and precision when recently evaluated with a selection of clinical text corpora. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
47. De-identified Bayesian personal identity matching for privacy-preserving record linkage despite errors: development and validation
- Author
-
Cardinal, Rudolf N, Moore, Anna, Burchell, Martin, Lewis, Jonathan R, Cardinal, Rudolf N [0000-0002-8751-5167], Moore, Anna [0000-0001-9614-3812], Burchell, Martin [0000-0003-2447-8263], Lewis, Jonathan R [0000-0003-1821-3824], and Apollo - University of Cambridge Repository
- Subjects
Male ,Psychiatry ,Electronic health records ,Health Policy ,Research ,Bayes Theorem ,Health Informatics ,Privacy-preserving record linkage ,State Medicine ,Open-source software ,Computer Science Applications ,Electronic medical records ,Privacy ,Electronic patient records ,Bayesian probabilistic linkage ,Humans ,De-identification ,Mental health ,Medical Record Linkage ,Pseudonymisation ,Identity matching ,Software - Abstract
Background Epidemiological research may require linkage of information from multiple organizations. This can bring two problems: (1) the information governance desirability of linkage without sharing direct identifiers, and (2) a requirement to link databases without a common person-unique identifier. Methods We develop a Bayesian matching technique to solve both. We provide an open-source software implementation capable of de-identified probabilistic matching despite discrepancies, via fuzzy representations and complete mismatches, plus de-identified deterministic matching if required. We validate the technique by testing linkage between multiple medical records systems in a UK National Health Service Trust, examining the effects of decision thresholds on linkage accuracy. We report demographic factors associated with correct linkage. Results The system supports dates of birth (DOBs), forenames, surnames, three-state gender, and UK postcodes. Fuzzy representations are supported for all except gender, and there is support for additional transformations, such as accent misrepresentation, variation for multi-part surnames, and name re-ordering. Calculated log odds predicted a proband’s presence in the sample database with an area under the receiver operating curve of 0.997–0.999 for non-self database comparisons. Log odds were converted to a decision via a consideration threshold θ and a leader advantage threshold δ. Defaults were chosen to penalize misidentification 20-fold versus linkage failure. By default, complete DOB mismatches were disallowed for computational efficiency. At these settings, for non-self database comparisons, the mean probability of a proband being correctly declared to be in the sample was 0.965 (range 0.931–0.994), and the misidentification rate was 0.00249 (range 0.00123–0.00429). Correct linkage was positively associated with male gender, Black or mixed ethnicity, and the presence of diagnostic codes for severe mental illnesses or other mental disorders, and negatively associated with birth year, unknown ethnicity, residential area deprivation, and presence of a pseudopostcode (e.g. indicating homelessness). Accuracy rates would be improved further if person-unique identifiers were also used, as supported by the software. Our two largest databases were linked in 44 min via an interpreted programming language. Conclusions Fully de-identified matching with high accuracy is feasible without a person-unique identifier and appropriate software is freely available.
- Published
- 2023
48. Differentially Private Data and Data De- Identification
- Author
-
Kenney, Noah M.
- Subjects
data ,de-identification ,differential privacy ,computer science ,privacy - Abstract
This paper analyzes both differential privacy and data de-identification. While differential privacy seeks to create differentially private data through the use of mathematics, data de-identification seeks to anonymize data in such a way that it cannot be re-identified at a later date. In addition, we analyze the challenges of both methods of approaching privacy, including the possibility of data re-identification and verification of privacy, before addressing possible methods of mitigating these challenges. Such methods include setting outer bounds of data, utilizing shared central databases with larger datasets, and grouping data into fewer data category buckets. The merits and benefits of both methods are discussed as well.
- Published
- 2023
- Full Text
- View/download PDF
49. Names, Nicknames, and Spelling Errors: Protecting Participant Identity in Learning Analytics of Online Discussions
- Author
-
Elaine Farrow, Johanna D. Moore, and Dragan Gasevic
- Subjects
anonymisation ,learning analytics ,de-identification ,pseudonymisation ,ethical issues ,personal name ,redaction ,privacy - Abstract
Messages exchanged between participants in online discussion forums often contain personal names and other details that need to be redacted before the data is used for research purposes in learning analytics. However, removing the names entirely makes it harder to track the exchange of ideas between individuals within a message thread and across threads, and thereby reduces the value of this type of conversational data. In contrast, the consistent use of pseudonyms allows contributions from individuals to be tracked across messages, while also hiding the real identities of the contributors. Several factors can make it difficult to identify all instances of personal names that refer to the same individual, including spelling errors and the use of shortened forms. We developed a semi-automated approach for replacing personal names with consistent pseudonyms. We evaluated our approach on a data set of over 1, 700 messages exchanged during a distance-learning course, and compared it to a general-purpose pseudonymisation tool that used deep neural networks to identify names to be redacted. We found that our tailored approach out-performed the general-purpose tool in both precision and recall, correctly identifying all but 31 substitutions out of 2, 888.
- Published
- 2023
- Full Text
- View/download PDF
50. Improving privacy preservation policy in the modern information age.
- Author
-
Davis, John S. and Osoba, Osonde
- Abstract
Anonymization or de-identification techniques are methods for protecting the privacy of human subjects in sensitive data sets while preserving the utility of those data sets. In the case of health data, anonymization techniques may be used to remove or mask patient identities while allowing the health data content to be used by the medical and pharmaceutical research community. The efficacy of anonymization methods has come under repeated attacks and several researchers have shown that anonymized data can be re-identified to reveal the identity of the data subjects via approaches such as "linking." Nevertheless, even given these deficiencies, many government privacy policies depend on anonymization techniques as the primary approach to preserving privacy. In this report, we survey the anonymization landscape and consider the range of anonymization approaches that can be used to de-identify data containing personally identifiable information. We then review several notable government privacy policies that leverage anonymization. In particular, we review the European Union's General Data Protection Regulation (GDPR) and show that it takes a more goal-oriented approach to data privacy. It defines data privacy in terms of desired outcome (i.e., as a defense against risk of personal data disclosure), and is agnostic to the actual method of privacy preservation. And GDPR goes further to frame its privacy preservation regulations relative to the state of the art, the cost of implementation, the incurred risks, and the context of data processing. This has potential implications for the GDPR's robustness to future technological innovations - very much in contrast to privacy regulations that depend explicitly on more definite technical specifications. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.