1,093 results on '"de-identification"'
Search Results
2. ‘It's not personal, it's strictly business’: Behavioural insurance and the impacts of non-personal data on individuals, groups and societies
- Author
-
Bednarz, Zofia, Lewis, Kelly, and Sadowski, Jathan
- Published
- 2025
- Full Text
- View/download PDF
3. Implementation and validation of face de-identification (de-facing) in ADNI4.
- Author
-
Schwarz, Christopher, Choe, Mark, Rossi, Stephanie, Das, Sandhitsu, Ittyerah, Ranjit, Fletcher, Evan, Maillard, Pauline, Singh, Baljeet, Harvey, Danielle, Malone, Ian, Prosser, Lloyd, Senjem, Matthew, Matoush, Leonard, Ward, Chadwick, Prakaashana, Carl, Landau, Susan, Koeppe, Robert, Lee, JiaQie, Decarli, Charles, Weiner, Michael, Jack, Clifford, Jagust, William, Yushkevich, Paul, and Tosun, Duygu
- Subjects
ADNI ,anonymization ,de‐facing ,de‐identification ,face recognition ,Humans ,Alzheimer Disease ,Magnetic Resonance Imaging ,Brain ,Neuroimaging ,Reproducibility of Results ,Face ,Algorithms - Abstract
INTRODUCTION: Recent technological advances have increased the risk that de-identified brain images could be re-identified from face imagery. The Alzheimers Disease Neuroimaging Initiative (ADNI) is a leading source of publicly available de-identified brain imaging, who quickly acted to protect participants privacy. METHODS: An independent expert committee evaluated 11 face-deidentification (de-facing) methods and selected four for formal testing. RESULTS: Effects of de-facing on brain measurements were comparable across methods and sufficiently small to recommend de-facing in ADNI. The committee ultimately recommended mri_reface for advantages in reliability, and for some practical considerations. ADNI leadership approved the committees recommendation, beginning in ADNI4. DISCUSSION: ADNI4 de-faces all applicable brain images before subsequent pre-processing, analyses, and public release. Trained analysts inspect de-faced images to confirm complete face removal and complete non-modification of brain. This paper details the history of the algorithm selection process and extensive validation, then describes the production workflows for de-facing in ADNI. HIGHLIGHTS: ADNI is implementing de-facing of MRI and PET beginning in ADNI4. De-facing alters face imagery in brain images to help protect privacy. Four algorithms were extensively compared for ADNI and mri_reface was chosen. Validation confirms mri_reface is robust and effective for ADNI sequences. Validation confirms mri_reface negligibly affects ADNI brain measurements.
- Published
- 2024
4. A Comparative Study of GPT3.5 Fine Tuning and Rule-Based Approaches for De-identification and Normalization of Sensitive Health Information in Electronic Medical Record Notes
- Author
-
Zhao, Zi-Rui, Chou, Po-Chen, Hussain Mir, Tatheer, Dai, Hong-Jie, Li, Gang, Series Editor, Filipe, Joaquim, Series Editor, Xu, Zhiwei, Series Editor, Jonnagaddala, Jitendra, editor, Dai, Hong-Jie, editor, and Chen, Ching-Tai, editor
- Published
- 2025
- Full Text
- View/download PDF
5. Advancing Sensitive Health Data Recognition and Normalization Through Large Language Model Driven Data Augmentation
- Author
-
Chao, Chia-Yi, Lin, Cheng-Wei, Li, Gang, Series Editor, Filipe, Joaquim, Series Editor, Xu, Zhiwei, Series Editor, Jonnagaddala, Jitendra, editor, Dai, Hong-Jie, editor, and Chen, Ching-Tai, editor
- Published
- 2025
- Full Text
- View/download PDF
6. Applying Language Models for Recognizing and Normalizing Sensitive Information from Electronic Health Records Text Notes
- Author
-
Huang, Sheng-Xuan, Cheng, Hung-An, Li, Zheng-Hao, Li, Gang, Series Editor, Filipe, Joaquim, Series Editor, Xu, Zhiwei, Series Editor, Jonnagaddala, Jitendra, editor, Dai, Hong-Jie, editor, and Chen, Ching-Tai, editor
- Published
- 2025
- Full Text
- View/download PDF
7. Deidentification and Temporal Normalization of the Electronic Health Record Notes Using Large Language Models: The 2023 SREDH/AI-Cup Competition for Deidentification of Sensitive Health Information
- Author
-
Mir, Tatheer Hussain, Yang, Hao-Ping, Chou, Yi-Yun, Teng, Yu-Chin, Liao, Wei-Hsiang, Lin, Yu-Chuan, Gupta, Shalini, Panchal, Omkar, Jonnagaddala, Jitendra, Chen, Ching-Tai, Dai, Hong-Jie, Li, Gang, Series Editor, Filipe, Joaquim, Series Editor, Xu, Zhiwei, Series Editor, Jonnagaddala, Jitendra, editor, Dai, Hong-Jie, editor, and Chen, Ching-Tai, editor
- Published
- 2025
- Full Text
- View/download PDF
8. Privacy Protection and Standardization of Electronic Medical Records Using Large Language Model
- Author
-
Huang, Chao-Long, Rianto, Babam, Sun, Jun-Teng, Fu, Zheng-Xin, Lee, Chung-Hong, Li, Gang, Series Editor, Filipe, Joaquim, Series Editor, Xu, Zhiwei, Series Editor, Jonnagaddala, Jitendra, editor, Dai, Hong-Jie, editor, and Chen, Ching-Tai, editor
- Published
- 2025
- Full Text
- View/download PDF
9. Comprehensive Evaluation of Pythia Model Efficiency in De-identification and Normalization for Enhanced Medical Data Management
- Author
-
Cho, Yen-Cheng, Yang, Yu-Jie, Liu, Yu-De, Tsao, Tung-Sheng, Lee, Min-Jain, Li, Gang, Series Editor, Filipe, Joaquim, Series Editor, Xu, Zhiwei, Series Editor, Jonnagaddala, Jitendra, editor, Dai, Hong-Jie, editor, and Chen, Ching-Tai, editor
- Published
- 2025
- Full Text
- View/download PDF
10. Implementing tokenization in clinical research to expand real-world insights.
- Author
-
Walters, Chelsea, Langlais, Crystal S., Oakkar, Eva E., Hoogendoorn, Wilhelmina E., Coutcher, James B., and Van Zandt, Mui
- Abstract
Interest in leveraging real-world data (RWD) to support clinical research is increasing, including studies to further characterize safety and effectiveness of new treatments. Such studies often require a combination of primary, study-specific data, with secondary, existing RWD. So-called enriched studies are becoming more common but require tailored methodologies that facilitate linkage across data sources. Tokenization has emerged as a key tool in the United States (US) to enable the linkage of secondary data with primary data, although key considerations to operationalize tokenization are often overlooked during study set-up. This article aims to explore key aspects for implementing tokenization in the US and to define relevant terminology. Appropriate study designs and RWD sources to leverage this tool are also discussed and advantages and considerations for study stakeholders to enhance possibilities to generate real-world evidence are highlighted. The article concludes with a description of case studies where tokenization is a suitable fit to fulfill study goals. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
11. De-identification of clinical notes with pseudo-labeling using regular expression rules and pre-trained BERT.
- Author
-
An, Jiyong, Kim, Jiyun, Sunwoo, Leonard, Baek, Hyunyoung, Yoo, Sooyoung, and Lee, Seunggeun
- Subjects
- *
NATURAL language processing , *LANGUAGE models , *MEDICAL sciences , *ARTIFICIAL intelligence , *TELECOMMUNICATION - Abstract
Background: De-identification of clinical notes is essential to utilize the rich information in unstructured text data in medical research. However, only limited work has been done in removing personal information from clinical notes in Korea. Methods: Our study utilized a comprehensive dataset stored in the Note table of the OMOP Common Data Model at Seoul National University Bundang Hospital. This dataset includes 11,181,617 radiology and 9,282,477 notes from various other departments (non-radiology reports). From this, 0.1% of the reports (11,182) were randomly selected for training and validation purposes. We used two de-identification strategies to improve performance with limited and few annotated data. First, a rule-based approach is used to construct regular expressions on the 1,112 notes annotated by domain experts. Second, by using the regular expressions as label-er, we applied a semi-supervised approach to fine-tune a pre-trained Korean BERT model with pseudo-labeled notes. Results: Validation was conducted using 342 radiology and 12 non-radiology notes labeled at the token level. Our rule-based approach achieved 97.2% precision, 93.7% recall, and 96.2% F1 score from the department of radiology notes. For machine learning approach, KoBERT-NER that is fine-tuned with 32,000 automatically pseudo-labeled notes achieved 96.5% precision, 97.6% recall, and 97.1% F1 score. Conclusion: By combining a rule-based approach and machine learning in a semi-supervised way, our results show that the performance of de-identification can be improved. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
12. Large Language Models for Electronic Health Record De-Identification in English and German.
- Author
-
Sousa, Samuel, Jantscher, Michael, Kröll, Mark, and Kern, Roman
- Abstract
Electronic health record (EHR) de-identification is crucial for publishing or sharing medical data without violating the patient's privacy. Protected health information (PHI) is abundant in EHRs, and privacy regulations worldwide mandate de-identification before downstream tasks are performed. The ever-growing data generation in healthcare and the advent of generative artificial intelligence have increased the demand for de-identified EHRs and highlighted privacy issues with large language models (LLMs), especially data transmission to cloud-based LLMs. In this study, we benchmark ten LLMs for de-identifying EHRs in English and German. We then compare de-identification performance for in-context learning and full model fine-tuning and analyze the limitations of LLMs for this task. Our experimental evaluation shows that LLMs effectively de-identify EHRs in both languages. Moreover, in-context learning with a one-shot setting boosts de-identification performance without the costly full fine-tuning of the LLMs. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
13. A survey on UK researchers' views regarding their experiences with the de-identification, anonymisation, release methods and re-identification risk estimation for clinical trial datasets.
- Author
-
Rodriguez, Aryelly, Lewis, Steff C, Eldridge, Sandra, Jackson, Tracy, and Weir, Christopher J
- Subjects
WORK ,RISK assessment ,CROSS-sectional method ,DOCUMENTATION ,DATABASE management ,RESEARCH funding ,CLINICAL trials ,PRIVACY ,DESCRIPTIVE statistics ,RESEARCH ,RESEARCH methodology ,EXPERIENTIAL learning ,MEDICAL ethics - Abstract
Background: There are increasing pressures for anonymised datasets from clinical trials to be shared across the scientific community. However, there is no standardised set of recommendations on how to anonymise and prepare clinical trial datasets for sharing, while an ever-increasing number of anonymised datasets are becoming available for secondary research. Our aim was to explore the current views and experiences of researchers in the United Kingdom about de-identification, anonymisation, release methods and re-identification risk estimation for clinical trial datasets. Methods: We used an online exploratory cross-sectional descriptive survey that consisted of both open-ended and closed questions. Results: We had 38 responses to invitation from June 2022 to October 2022. However, 35 participants (92%) used internal documentation and published guidance to de-identify/anonymise clinical trial datasets. De-identification, followed by anonymisation and then fulfilling data holders' requirements before access was granted (controlled access), was the most common process for releasing the datasets as reported by 18 (47%) participants. However, 11 participants (29%) had previous knowledge of re-identification risk estimation, but they did not use any of the methodologies. Experiences in the process of de-identifying/anonymising the datasets and maintaining such datasets were mostly negative, and the main reported issues were lack of resources, guidance, and training. Conclusion: The majority of responders reported using documented processes for de-identification and anonymisation. However, our survey results clearly indicate that there are still gaps in the areas of guidance, resources and training to fulfil sharing requests of de-identified/anonymised datasets, and that re-identification risk estimation is an underdeveloped area. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
14. Identifying protected health information by transformers-based deep learning approach in Chinese medical text.
- Author
-
Xu, Kun, Song, Yang, and Ma, Jingdong
- Abstract
Purpose: In the context of Chinese clinical texts, this paper aims to propose a deep learning algorithm based on Bidirectional Encoder Representation from Transformers (BERT) to identify privacy information and to verify the feasibility of our method for privacy protection in the Chinese clinical context. Methods: We collected and double-annotated 33,017 discharge summaries from 151 medical institutions on a municipal regional health information platform, developed a BERT-based Bidirectional Long Short-Term Memory Model (BiLSTM) and Conditional Random Field (CRF) model, and tested the performance of privacy identification on the dataset. To explore the performance of different substructures of the neural network, we created five additional baseline models and evaluated the impact of different models on performance. Results: Based on the annotated data, the BERT model pre-trained with the medical corpus showed a significant performance improvement to the BiLSTM-CRF model with a micro-recall of 0.979 and an F1 value of 0.976, which indicates that the model has promising performance in identifying private information in Chinese clinical texts. Conclusions: The BERT-based BiLSTM-CRF model excels in identifying privacy information in Chinese clinical texts, and the application of this model is very effective in protecting patient privacy and facilitating data sharing. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
15. Deep Learning Framework for Advanced De-Identification of Protected Health Information.
- Author
-
Aloqaily, Ahmad, Abdallah, Emad E., Al-Zyoud, Rahaf, Abu Elsoud, Esraa, Al-Hassan, Malak, and Abdallah, Alaa E.
- Subjects
ELECTRONIC health records ,LONG short-term memory ,ARCHITECTURAL design ,DIGITAL learning ,MEDICAL records ,DEEP learning - Abstract
Electronic health records (EHRs) are widely used in healthcare institutions worldwide, containing vast amounts of unstructured textual data. However, the sensitive nature of Protected Health Information (PHI) embedded within these records presents significant privacy challenges, necessitating robust de-identification techniques. This paper introduces a novel approach, leveraging a Bi-LSTM-CRF model to achieve accurate and reliable PHI de-identification, using the i2b2 dataset sourced from Harvard University. Unlike prior studies that often unify Bi-LSTM and CRF layers, our approach focuses on the individual design, optimization, and hyperparameter tuning of both the Bi-LSTM and CRF components, allowing for precise model performance improvements. This rigorous approach to architectural design and hyperparameter tuning, often underexplored in the existing literature, significantly enhances the model's capacity for accurate PHI tag detection while preserving the essential clinical context. Comprehensive evaluations are conducted across 23 PHI categories, as defined by HIPAA, ensuring thorough security across critical domains. The optimized model achieves exceptional performance metrics, with a precision of 99%, recall of 98%, and F1-score of 98%, underscoring its effectiveness in balancing recall and precision. By enabling the de-identification of medical records, this research strengthens patient confidentiality, promotes compliance with privacy regulations, and facilitates safe data sharing for research and analysis. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
16. Toward a Privacy-Preserving Face Recognition System: A Survey of Leakages and Solutions.
- Author
-
Laishram, Lamyanba, Shaheryar, Muhammad, Lee, Jong Taek, and Jung, Soon Ki
- Subjects
- *
ARTIFICIAL neural networks , *ARTIFICIAL intelligence , *MACHINE learning , *CONVOLUTIONAL neural networks , *PATTERN recognition systems , *DEEP learning , *HUMAN facial recognition software - Published
- 2025
- Full Text
- View/download PDF
17. Automated redaction of names in adverse event reports using transformer-based neural networks
- Author
-
Eva-Lisa Meldau, Shachi Bista, Carlos Melgarejo-González, and G. Niklas Norén
- Subjects
De-identification ,Data anonymization ,Pharmacovigilance ,Domain adaptation ,Adverse drug reaction reporting systems ,Medical language processing ,Computer applications to medicine. Medical informatics ,R858-859.7 - Abstract
Abstract Background Automated recognition and redaction of personal identifiers in free text can enable organisations to share data while protecting privacy. This is important in the context of pharmacovigilance since relevant detailed information on the clinical course of events, differential diagnosis, and patient-reported reflections may often only be conveyed in narrative form. The aim of this study is to develop and evaluate a method for automated redaction of person names in English narrative text on adverse event reports. The target domain for this study was case narratives from the United Kingdom’s Yellow Card scheme, which collects and monitors information on suspected side effects to medicines and vaccines. Methods We finetuned BERT – a transformer-based neural network – for recognising names in case narratives. Training data consisted of newly annotated records from the Yellow Card data and of the i2b2 2014 deidentification challenge. Because the Yellow Card data contained few names, we used predictive models to select narratives for training. Performance was evaluated on a separate set of annotated narratives from the Yellow Card scheme. In-depth review determined whether (parts of) person names missed by the de-identification method could enable re-identification of the individual, and whether de-identification reduced the clinical utility of narratives by collaterally masking relevant information. Results Recall on held-out Yellow Card data was 87% (155/179) at a precision of 55% (155/282) and a false-positive rate of 0.05% (127/ 263,451). Considering tokens longer than three characters separately, recall was 94% (102/108) and precision 58% (102/175). For 13 of the 5,042 narratives in Yellow Card test data (71 with person names), the method failed to flag at least one name token. According to in-depth review, the leaked information could enable direct identification for one narrative and indirect identification for two narratives. Clinically relevant information was removed in less than 1% of the 5,042 processed narratives; 97% of the narratives were completely untouched. Conclusions Automated redaction of names in free-text narratives of adverse event reports can achieve sufficient recall including shorter tokens like patient initials. In-depth review shows that the rare leaks that occur tend not to compromise patient confidentiality. Precision and false positive rates are acceptable with almost all clinically relevant information retained.
- Published
- 2024
- Full Text
- View/download PDF
18. Automated redaction of names in adverse event reports using transformer-based neural networks.
- Author
-
Meldau, Eva-Lisa, Bista, Shachi, Melgarejo-González, Carlos, and Norén, G. Niklas
- Subjects
DRUG side effects ,VACCINATION complications ,LEAKS (Disclosure of information) ,TRANSFORMER models ,MEDICAL language - Abstract
Background: Automated recognition and redaction of personal identifiers in free text can enable organisations to share data while protecting privacy. This is important in the context of pharmacovigilance since relevant detailed information on the clinical course of events, differential diagnosis, and patient-reported reflections may often only be conveyed in narrative form. The aim of this study is to develop and evaluate a method for automated redaction of person names in English narrative text on adverse event reports. The target domain for this study was case narratives from the United Kingdom's Yellow Card scheme, which collects and monitors information on suspected side effects to medicines and vaccines. Methods: We finetuned BERT – a transformer-based neural network – for recognising names in case narratives. Training data consisted of newly annotated records from the Yellow Card data and of the i2b2 2014 deidentification challenge. Because the Yellow Card data contained few names, we used predictive models to select narratives for training. Performance was evaluated on a separate set of annotated narratives from the Yellow Card scheme. In-depth review determined whether (parts of) person names missed by the de-identification method could enable re-identification of the individual, and whether de-identification reduced the clinical utility of narratives by collaterally masking relevant information. Results: Recall on held-out Yellow Card data was 87% (155/179) at a precision of 55% (155/282) and a false-positive rate of 0.05% (127/ 263,451). Considering tokens longer than three characters separately, recall was 94% (102/108) and precision 58% (102/175). For 13 of the 5,042 narratives in Yellow Card test data (71 with person names), the method failed to flag at least one name token. According to in-depth review, the leaked information could enable direct identification for one narrative and indirect identification for two narratives. Clinically relevant information was removed in less than 1% of the 5,042 processed narratives; 97% of the narratives were completely untouched. Conclusions: Automated redaction of names in free-text narratives of adverse event reports can achieve sufficient recall including shorter tokens like patient initials. In-depth review shows that the rare leaks that occur tend not to compromise patient confidentiality. Precision and false positive rates are acceptable with almost all clinically relevant information retained. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
19. Giyilebilir Cihazlardan Gelen Sağlık Verilerinin Kimliksizleştirilmesi Yeterince Güvenli mi?
- Author
-
DURMUŞ, Veli
- Subjects
DATA privacy ,WEARABLE technology ,PUBLIC health research ,PERSONALLY identifiable information ,INFORMATION sharing - Abstract
Copyright of Istanbul Gelisim University Journal of Health Sciences / İstanbul Gelişim Üniversitesi Sağlık Bilimleri Dergisi is the property of Istanbul Gelisim Universitesi Saglik Bilimleri Yuksekokulu and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2024
- Full Text
- View/download PDF
20. PyFaceWipe: a new defacing tool for almost any MRI contrast.
- Author
-
Mitew, Stanislaw, Yeow, Ling Yun, Ho, Chi Long, Bhanu, Prakash K. N., and Nickalls, Oliver James
- Subjects
OLDER people ,YOUNG adults ,BRAIN imaging ,BRAIN research ,VOLUME (Cubic content) - Abstract
Rationale and objectives: Defacing research MRI brain scans is often a mandatory step. With current defacing software, there are issues with Windows compatibility and researcher doubt regarding the adequacy of preservation of brain voxels in non-T1w scans. To address this, we developed PyFaceWipe, a multiplatform software for multiple MRI contrasts, which was evaluated based on its anonymisation ability and effect on downstream processing. Materials and methods: Multiple MRI brain scan contrasts from the OASIS-3 dataset were defaced with PyFaceWipe and PyDeface and manually assessed for brain voxel preservation, remnant facial features and effect on automated face detection. Original and PyFaceWipe-defaced data from locally acquired T1w structural scans underwent volumetry with FastSurfer and brain atlas generation with ANTS. Results: 214 MRI scans of several contrasts from OASIS-3 were successfully processed with both PyFaceWipe and PyDeface. PyFaceWipe maintained complete brain voxel preservation in all tested contrasts except ASL (45%) and DWI (90%), and PyDeface in all tested contrasts except ASL (95%), BOLD (25%), DWI (40%) and T2* (25%). Manual review of PyFaceWipe showed no failures of facial feature removal. Pinna removal was less successful (6% of T1 scans showed residual complete pinna). PyDeface achieved 5.1% failure rate. Automated detection found no faces in PyFaceWipe-defaced scans, 19 faces in PyDeface scans compared with 78 from the 224 original scans. Brain atlas generation showed no significant difference between atlases created from original and defaced data in both young adulthood and late elderly cohorts. Structural volumetry dice scores were ≥ 0.98 for all structures except for grey matter which had 0.93. PyFaceWipe output was identical across the tested operating systems. Conclusion: PyFaceWipe is a promising multiplatform defacing tool, demonstrating excellent brain voxel preservation and competitive defacing in multiple MRI contrasts, performing favourably against PyDeface. ASL, BOLD, DWI and T2* scans did not produce recognisable 3D renders and hence should not require defacing. Structural volumetry dice scores (≥ 0.98) were higher than previously published FreeSurfer results, except for grey matter which were comparable. The effect is measurable and care should be exercised during studies. ANTS atlas creation showed no significant effect from PyFaceWipe defacing. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
21. THE MEDIATING ROLE OF A SIBLING IN IDENTITY DEVELOPMENT: CONTEMPORARY PSYCHOANALYTIC PERSPECTIVE.
- Author
-
Joksimović, Teodora Vuletić
- Abstract
The main aim of this paper was to open the space for discussing siblings’ mutual effects on their identity development, fostering a specific contemporary psychoanalytic perspective – Lacan’s identity topology. By exploring the mother’s and the father’s ‘phallic functions’ in the subject’s identity development, I tripped over the same stone as classical psychoanalytic theory, which nudged me to challenge the sole relevance of the parental roles and pose a question of what the sibling-function in the process of developing identity would be. I adapted Lacan’s R-schema to honour these delicate family relationships and understand their underlying structure. Lacan emphasised the importance of siblings through the intrusion complex in his earliest work but ceased to deal with this topic afterward. However, he tended to preserve the foundations he had laid with the mirror stage and Oedipus complex – concepts based on identification. Therefore, in this work I tried to look at the co-constructing relationship between siblings in the context of the (de)identification phenomenon while taking into account Lacan’s latter works. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
22. PROSurvival: A Technical Case Report on Creating and Publishing a Dataset for Federated Learning on Survival Prediction of Prostate Cancer Patients.
- Author
-
Tingyan XU, WOLTERS, Timo, LOTZ, Johannes, BISSON, Tom, KIEHL, Tim-Rasmus, FLINNER, Nadine, ZERBE, Norman, and EICHELBERG, Marco
- Abstract
The PROSurvival project aims to improve the prediction of recurrencefree survival in prostate cancer by applying federated machine learning to whole slide images combined with selected clinical data. Both the image and clinical data will be aggregated into an anonymized dataset compliant with the General Data Protection Regulation and published under the principles of findable, accessible, interoperable, and reusable data. The DICOM standard will be used for the image data. For the accompanying clinical data, a human-readable, compact and flexible standard is yet to be defined. From the set of existing standards, mostly extendable with varying degrees of modifications, we chose oBDS as a starting point and modified it to include missing data points and to remove mandatory items not applicable to our dataset. Clinical and survival data from clinic-specific spreadsheets were converted into this modified standard, ensuring on-site data privacy during processing. For publication of the dataset, both image and clinical data are anonymized using established methods. The key challenges arose during the clinical data anonymization and in identifying research repositories meeting all of our requirements. Each clinic had to coordinate the publication with their responsible data protection officers, requiring different approval processes due to the individual states' differing interpretations of the legal regulations. The newly established German Health Data Utilization Act is expected to simplify future data sharing in a responsible and powerful way. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
23. Asserting the public interest in health data: On the ethics of data governance for biobanks and insurers.
- Author
-
Metcalf, Kathryne and Sadowski, Jathan
- Subjects
INSURANCE companies ,ACTUARIAL risk ,BREACH of trust ,BUSINESS insurance ,ACTUARIAL science - Abstract
Recent reporting has revealed that the UK Biobank (UKB)—a large, publicly-funded research database containing highly-sensitive health records of over half a million participants—has shared its data with private insurance companies seeking to develop actuarial AI systems for analyzing risk and predicting health. While news reports have characterized this as a significant breach of public trust, the UKB contends that insurance research is "in the public interest," and that all research participants are adequately protected from the possibility of insurance discrimination via data de-identification. Here, we contest both of these claims. Insurers use population data to identify novel categories of risk, which become fodder in the production of black-boxed actuarial algorithms. The deployment of these algorithms, as we argue, has the potential to increase inequality in health and decrease access to insurance. Importantly, these types of harms are not limited just to UKB participants: instead, they are likely to proliferate unevenly across various populations within global insurance markets via practices of profiling and sorting based on the synthesis of multiple data sources, alongside advances in data analysis capabilities, over space/time. This necessitates a significantly expanded understanding of the publics who must be involved in biobank governance and data-sharing decisions involving insurers. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
24. A Method for Efficient De-identification of DICOM Metadata and Burned-in Pixel Text.
- Author
-
Macdonald, Jacob A., Morgan, Katelyn R., Konkel, Brandon, Abdullah, Kulsoom, Martin, Mark, Ennis, Cory, Lo, Joseph Y., Stroo, Marissa, Snyder, Denise C., and Bashir, Mustafa R.
- Subjects
MEDICAL information storage & retrieval systems ,DATA security ,CLINICAL medicine ,DIAGNOSTIC imaging ,COMPUTER software ,HEALTH ,INFORMATION resources ,DICOM (Computer network protocol) ,METADATA ,MEDICAL radiology ,MANAGEMENT of medical records ,QUALITY assurance - Abstract
De-identification of DICOM images is an essential component of medical image research. While many established methods exist for the safe removal of protected health information (PHI) in DICOM metadata, approaches for the removal of PHI "burned-in" to image pixel data are typically manual, and automated high-throughput approaches are not well validated. Emerging optical character recognition (OCR) models can potentially detect and remove PHI-bearing text from medical images but are very time-consuming to run on the high volume of images found in typical research studies. We present a data processing method that performs metadata de-identification for all images combined with a targeted approach to only apply OCR to images with a high likelihood of burned-in text. The method was validated on a dataset of 415,182 images across ten modalities representative of the de-identification requests submitted at our institution over a 20-year span. Of the 12,578 images in this dataset with burned-in text of any kind, only 10 passed undetected with the method. OCR was only required for 6050 images (1.5% of the dataset). [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
25. A certified de-identification system for all clinical text documents for information extraction at scale.
- Author
-
Radhakrishnan, Lakshmi, Schenk, Gundolf, Muenzen, Kathleen, Oskotsky, Boris, Ashouri Choshali, Habibeh, Plunkett, Thomas, Israni, Sharat, and Butte, Atul J
- Subjects
Patient Safety ,clinical note text ,de-identification ,unstructured data ,Philter - Abstract
ObjectivesClinical notes are a veritable treasure trove of information on a patient's disease progression, medical history, and treatment plans, yet are locked in secured databases accessible for research only after extensive ethics review. Removing personally identifying and protected health information (PII/PHI) from the records can reduce the need for additional Institutional Review Boards (IRB) reviews. In this project, our goals were to: (1) develop a robust and scalable clinical text de-identification pipeline that is compliant with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule for de-identification standards and (2) share routinely updated de-identified clinical notes with researchers.Materials and methodsBuilding on our open-source de-identification software called Philter, we added features to: (1) make the algorithm and the de-identified data HIPAA compliant, which also implies type 2 error-free redaction, as certified via external audit; (2) reduce over-redaction errors; and (3) normalize and shift date PHI. We also established a streamlined de-identification pipeline using MongoDB to automatically extract clinical notes and provide truly de-identified notes to researchers with periodic monthly refreshes at our institution.ResultsTo the best of our knowledge, the Philter V1.0 pipeline is currently the first and only certified, de-identified redaction pipeline that makes clinical notes available to researchers for nonhuman subjects' research, without further IRB approval needed. To date, we have made over 130 million certified de-identified clinical notes available to over 600 UCSF researchers. These notes were collected over the past 40 years, and represent data from 2757016 UCSF patients.
- Published
- 2023
26. Summary of the National Cancer Institute 2023 Virtual Workshop on Medical Image De-identification—Part 1: Report of the MIDI Task Group - Best Practices and Recommendations, Tools for Conventional Approaches to De-identification, International Approaches to De-identification, and Industry Panel on Image De-identification
- Author
-
Clunie, David, Prior, Fred, Rutherford, Michael, Moore, Stephen, Parker, William, Kondylakis, Haridimos, Ludwigs, Christian, Klenk, Juergen, Lou, Bob, O’Sullivan, Lawrence (Tony), Marcus, Dan, Dobes, Jiri, Gutman, Abraham, and Farahani, Keyvan
- Published
- 2025
- Full Text
- View/download PDF
27. Summary of the National Cancer Institute 2023 Virtual Workshop on Medical Image De-identification—Part 2: Pathology Whole Slide Image De-identification, De-facing, the Role of AI in Image De-identification, and the NCI MIDI Datasets and Pipeline
- Author
-
Clunie, David, Taylor, Adam, Bisson, Tom, Gutman, David, Xiao, Ying, Schwarz, Christopher G., Greve, Douglas, Gichoya, Judy, Shih, George, Kline, Adrienne, Kopchick, Ben, and Farahani, Keyvan
- Published
- 2025
- Full Text
- View/download PDF
28. Face image de-identification based on feature embedding
- Author
-
Goki Hanawa, Koichi Ito, and Takafumi Aoki
- Subjects
De-identification ,Face recognition ,Biometrics ,Privacy protection ,Electronics ,TK7800-8360 - Abstract
Abstract A large number of images are available on the Internet with the growth of social networking services, and many of them are face photos or contain faces. It is necessary to protect the privacy of face images to prevent their malicious use by face image de-identification techniques that make face recognition difficult, which prevent the collection of specific face images using face recognition. In this paper, we propose a face image de-identification method that generates a de-identified image from an input face image by embedding facial features extracted from that of another person into the input face image. We develop the novel framework for embedding facial features into a face image and loss functions based on images and features to de-identify a face image preserving its appearance. Through a set of experiments using public face image datasets, we demonstrate that the proposed method exhibits higher de-identification performance against unknown face recognition models than conventional methods while preserving the appearance of the input face images.
- Published
- 2024
- Full Text
- View/download PDF
29. Face image de-identification based on feature embedding.
- Author
-
Hanawa, Goki, Ito, Koichi, and Aoki, Takafumi
- Subjects
FACE perception ,SOCIAL networks ,SOCIAL media ,PRIVACY ,BIOMETRY ,HUMAN facial recognition software - Abstract
A large number of images are available on the Internet with the growth of social networking services, and many of them are face photos or contain faces. It is necessary to protect the privacy of face images to prevent their malicious use by face image de-identification techniques that make face recognition difficult, which prevent the collection of specific face images using face recognition. In this paper, we propose a face image de-identification method that generates a de-identified image from an input face image by embedding facial features extracted from that of another person into the input face image. We develop the novel framework for embedding facial features into a face image and loss functions based on images and features to de-identify a face image preserving its appearance. Through a set of experiments using public face image datasets, we demonstrate that the proposed method exhibits higher de-identification performance against unknown face recognition models than conventional methods while preserving the appearance of the input face images. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
30. LPDi GAN: A License Plate De-Identification Method to Preserve Strong Data Utility.
- Author
-
Li, Xiying, Liu, Heng, Lin, Qunxiong, Sun, Quanzhong, Jiang, Qianyin, and Su, Shuyan
- Subjects
- *
GENERATIVE adversarial networks , *PATTERN recognition systems , *FEATURE extraction , *AUTOMOBILE license plates , *DEEP learning - Abstract
License plate (LP) information is an important part of personal privacy, which is protected by law. However, in some publicly available transportation datasets, the LP areas in the images have not been processed. Other datasets have applied simple de-identification operations such as blurring and masking. Such crude operations will lead to a reduction in data utility. In this paper, we propose a method of LP de-identification based on a generative adversarial network (LPDi GAN) to transform an original image to a synthetic one with a generated LP. To maintain the original LP attributes, the background features are extracted from the background to generate LPs that are similar to the originals. The LP template and LP style are also fed into the network to obtain synthetic LPs with controllable characters and higher quality. The results show that LPDi GAN can perceive changes in environmental conditions and LP tilt angles, and control the LP characters through the LP templates. The perceptual similarity metric, Learned Perceptual Image Patch Similarity (LPIPS), reaches 0.25 while ensuring the effect of character recognition on de-identified images, demonstrating that LPDi GAN can achieve outstanding de-identification while preserving strong data utility. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
31. Masketeer: An Ensemble-Based Pseudonymization Tool with Entity Recognition for German Unstructured Medical Free Text.
- Author
-
Baumgartner, Martin, Kreiner, Karl, Wiesmüller, Fabian, Hayn, Dieter, Puelacher, Christian, and Schreier, Günter
- Subjects
LANGUAGE models ,COMPUTATIONAL linguistics ,SAFE harbor ,SENSITIVITY & specificity (Statistics) ,HEALTH Insurance Portability & Accountability Act - Abstract
Background: The recent rise of large language models has triggered renewed interest in medical free text data, which holds critical information about patients and diseases. However, medical free text is also highly sensitive. Therefore, de-identification is typically required but is complicated since medical free text is mostly unstructured. With the Masketeer algorithm, we present an effective tool to de-identify German medical text. Methods: We used an ensemble of different masking classes to remove references to identifiable data from over 35,000 clinical notes in accordance with the HIPAA Safe Harbor Guidelines. To retain additional context for readers, we implemented an entity recognition scheme and corpus-wide pseudonymization. Results: The algorithm performed with a sensitivity of 0.943 and specificity of 0.933. Further performance analyses showed linear runtime complexity (O(n)) with both increasing text length and corpus size. Conclusions: In the future, large language models will likely be able to de-identify medical free text more effectively and thoroughly than handcrafted rules. However, such gold-standard de-identification tools based on large language models are yet to emerge. In the current absence of such, we hope to provide best practices for a robust rule-based algorithm designed with expert domain knowledge. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
32. Comprehensive and Comparative Analysis of Compliances and Data Privacy Techniques for protecting Data Leakages in Cloud Computing.
- Author
-
Srinivasan, B.
- Subjects
- *
DATABASES , *SMART devices , *COMPUTER software installation , *INTERNET of things , *COMPUTERS - Abstract
In the present era, much data and operations are being processed online. Today, an individual or a business organization has to utilize the data or a process regardless of where they are working with the help of computers or smart devices. Most notably, the Internet of Things (IoT) allows devices like sensors to transfer data with the help of the Internet. Consequently, accessing data and processes from anywhere is becoming an essential requirement in the modern day. In the beginning ages of computing and even the Internet, data were stored in a centralized database and processed either on a local machine or on the server side. During those periods, users had to shell out vast amounts of money to store data and install software products. However, after the genesis of Cloud Computing (CC), the problem of investing amounts for storage and software installation became optimal. Nowadays, Cloud Companies offer various services to their users with essential security and authentication methods. Nevertheless, they cannot offer any elucidation regarding data privacy and do not provide any promise about data leakage on the cloud. Currently, a plethora of data privacy practices, like data anonymization, pseudonymization, scrambling and masking, etc., are implemented in CC to protect data privacy. This paper discusses a deep analysis of the merits and demerits of those methods, along with case studies and applications. [ABSTRACT FROM AUTHOR]
- Published
- 2024
33. SAIC: Integration of Speech Anonymization and Identity Classification
- Author
-
Cheng, Ming, Diao, Xingjian, Cheng, Shitong, Liu, Wenjun, Kacprzyk, Janusz, Series Editor, Shaban-Nejad, Arash, editor, Michalowski, Martin, editor, and Bianco, Simone, editor
- Published
- 2024
- Full Text
- View/download PDF
34. De-Identification Challenges in Real-World Portuguese Clinical Texts
- Author
-
Prado, Carolina Braun, Gumiel, Yohan Bonescki, Schneider, Elisa Terumi Rubel, Cintho, Lilian Mie Mukai, de Souza, João Vitor Andrioli, Oliveira, Lucas Emanuel Silva e, Paraiso, Emerson Cabrera, Rebelo, Marina Sa, Gutierrez, Marco Antonio, Pires, Fabio Antero, Krieger, José Eduardo, Moro, Claudia, Magjarević, Ratko, Series Editor, Ładyżyński, Piotr, Associate Editor, Ibrahim, Fatimah, Associate Editor, Lackovic, Igor, Associate Editor, Rock, Emilio Sacristan, Associate Editor, Marques, Jefferson Luiz Brum, editor, Rodrigues, Cesar Ramos, editor, Suzuki, Daniela Ota Hisayasu, editor, Marino Neto, José, editor, and García Ojeda, Renato, editor
- Published
- 2024
- Full Text
- View/download PDF
35. Understanding how to identify and manage personal identifying information (PII) to further data interoperability
- Author
-
Zixin Nie
- Subjects
personal identifying information ,de-identification ,data interoperability ,data privacy ,HIPAA ,HIPAA Safe Harbor ,Bibliography. Library science. Information resources - Abstract
Respect for research participant rights is a key aspect for consideration when creating and utilizing interoperable data. From that perspective, requirements for sharing research data often call for the data to be de-identified, i.e., the removal of all personal identifying information (PII) prior to data sharing, to ensure that the participant’s data privacy rights are not infringed upon. However, what constitutes PII is often a point of confusion amongst researchers who are not familiar with privacy laws and regulations. This paper hopes to provide some clarity around what makes research data identifiable by presenting it under a different perspective from what most researchers are familiar with. It also provides a framework to help researchers determine where PII could exist within their data that they can use to help with privacy impact evaluations. The goal is to empower researchers to share their data with greater confidence that the privacy rights of their research subjects have been sufficiently protected, enabling access to greater amounts of data for research use.
- Published
- 2024
- Full Text
- View/download PDF
36. End-to-end pseudonymization of fine-tuned clinical BERT models
- Author
-
Thomas Vakili, Aron Henriksson, and Hercules Dalianis
- Subjects
Natural language processing ,Language models ,BERT ,Electronic health records ,Clinical text ,De-identification ,Computer applications to medicine. Medical informatics ,R858-859.7 - Abstract
Abstract Many state-of-the-art results in natural language processing (NLP) rely on large pre-trained language models (PLMs). These models consist of large amounts of parameters that are tuned using vast amounts of training data. These factors cause the models to memorize parts of their training data, making them vulnerable to various privacy attacks. This is cause for concern, especially when these models are applied in the clinical domain, where data are very sensitive. Training data pseudonymization is a privacy-preserving technique that aims to mitigate these problems. This technique automatically identifies and replaces sensitive entities with realistic but non-sensitive surrogates. Pseudonymization has yielded promising results in previous studies. However, no previous study has applied pseudonymization to both the pre-training data of PLMs and the fine-tuning data used to solve clinical NLP tasks. This study evaluates the effects on the predictive performance of end-to-end pseudonymization of Swedish clinical BERT models fine-tuned for five clinical NLP tasks. A large number of statistical tests are performed, revealing minimal harm to performance when using pseudonymized fine-tuning data. The results also find no deterioration from end-to-end pseudonymization of pre-training and fine-tuning data. These results demonstrate that pseudonymizing training data to reduce privacy risks can be done without harming data utility for training PLMs.
- Published
- 2024
- Full Text
- View/download PDF
37. Deep Learning Framework for Advanced De-Identification of Protected Health Information
- Author
-
Ahmad Aloqaily, Emad E. Abdallah, Rahaf Al-Zyoud, Esraa Abu Elsoud, Malak Al-Hassan, and Alaa E. Abdallah
- Subjects
protected health information ,electronic health record ,deep learning ,de-identification ,Bi-LSTM-CRF ,Information technology ,T58.5-58.64 - Abstract
Electronic health records (EHRs) are widely used in healthcare institutions worldwide, containing vast amounts of unstructured textual data. However, the sensitive nature of Protected Health Information (PHI) embedded within these records presents significant privacy challenges, necessitating robust de-identification techniques. This paper introduces a novel approach, leveraging a Bi-LSTM-CRF model to achieve accurate and reliable PHI de-identification, using the i2b2 dataset sourced from Harvard University. Unlike prior studies that often unify Bi-LSTM and CRF layers, our approach focuses on the individual design, optimization, and hyperparameter tuning of both the Bi-LSTM and CRF components, allowing for precise model performance improvements. This rigorous approach to architectural design and hyperparameter tuning, often underexplored in the existing literature, significantly enhances the model’s capacity for accurate PHI tag detection while preserving the essential clinical context. Comprehensive evaluations are conducted across 23 PHI categories, as defined by HIPAA, ensuring thorough security across critical domains. The optimized model achieves exceptional performance metrics, with a precision of 99%, recall of 98%, and F1-score of 98%, underscoring its effectiveness in balancing recall and precision. By enabling the de-identification of medical records, this research strengthens patient confidentiality, promotes compliance with privacy regulations, and facilitates safe data sharing for research and analysis.
- Published
- 2025
- Full Text
- View/download PDF
38. Fully generated mammogram patch dataset using CycleGAN with de-identification texture analysis
- Author
-
Richmond, Luke, Trivedi, Hari, and Deshpande, Priya
- Published
- 2024
- Full Text
- View/download PDF
39. End-to-end pseudonymization of fine-tuned clinical BERT models: Privacy preservation with maintained data utility.
- Author
-
Vakili, Thomas, Henriksson, Aron, and Dalianis, Hercules
- Subjects
LANGUAGE models ,DATA privacy ,PRIVACY ,NATURAL language processing - Abstract
Many state-of-the-art results in natural language processing (NLP) rely on large pre-trained language models (PLMs). These models consist of large amounts of parameters that are tuned using vast amounts of training data. These factors cause the models to memorize parts of their training data, making them vulnerable to various privacy attacks. This is cause for concern, especially when these models are applied in the clinical domain, where data are very sensitive. Training data pseudonymization is a privacy-preserving technique that aims to mitigate these problems. This technique automatically identifies and replaces sensitive entities with realistic but non-sensitive surrogates. Pseudonymization has yielded promising results in previous studies. However, no previous study has applied pseudonymization to both the pre-training data of PLMs and the fine-tuning data used to solve clinical NLP tasks. This study evaluates the effects on the predictive performance of end-to-end pseudonymization of Swedish clinical BERT models fine-tuned for five clinical NLP tasks. A large number of statistical tests are performed, revealing minimal harm to performance when using pseudonymized fine-tuning data. The results also find no deterioration from end-to-end pseudonymization of pre-training and fine-tuning data. These results demonstrate that pseudonymizing training data to reduce privacy risks can be done without harming data utility for training PLMs. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
40. Fast refacing of MR images with a generative neural network lowers re‐identification risk and preserves volumetric consistency.
- Author
-
Molchanova, Nataliia, Maréchal, Bénédicte, Thiran, Jean‐Philippe, Kober, Tobias, Huelnhagen, Till, and Richiardi, Jonas
- Subjects
- *
GENERATIVE adversarial networks , *MAGNETIC resonance imaging , *COMPUTATIONAL complexity - Abstract
With the rise of open data, identifiability of individuals based on 3D renderings obtained from routine structural magnetic resonance imaging (MRI) scans of the head has become a growing privacy concern. To protect subject privacy, several algorithms have been developed to de‐identify imaging data using blurring, defacing or refacing. Completely removing facial structures provides the best re‐identification protection but can significantly impact post‐processing steps, like brain morphometry. As an alternative, refacing methods that replace individual facial structures with generic templates have a lower effect on the geometry and intensity distribution of original scans, and are able to provide more consistent post‐processing results by the price of higher re‐identification risk and computational complexity. In the current study, we propose a novel method for anonymized face generation for defaced 3D T1‐weighted scans based on a 3D conditional generative adversarial network. To evaluate the performance of the proposed de‐identification tool, a comparative study was conducted between several existing defacing and refacing tools, with two different segmentation algorithms (FAST and Morphobox). The aim was to evaluate (i) impact on brain morphometry reproducibility, (ii) re‐identification risk, (iii) balance between (i) and (ii), and (iv) the processing time. The proposed method takes 9 s for face generation and is suitable for recovering consistent post‐processing results after defacing. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
41. De-Identifying GRASCCO -- A Pilot Study for the De-Identification of the German Medical Text Project (GeMTeX) Corpus.
- Author
-
LOHR, Christina, MATTHIES, Franz, FALLER, Jakob, MODERSOHN, Luise, RIEDEL, Andrea, HAHN, Udo, KISER, Rebekka, BOEKER, Martin, and MEINEKEa, Frank
- Abstract
Introduction: The German Medical Text Project (GeMTeX) is one of the largest infrastructure efforts targeting German-language clinical documents. We here introduce the architecture of the de-identification pipeline of GeMTeX. Methods: This pipeline comprises the export of raw clinical documents from the local hospital information system, the import into the annotation platform INCEpTION, fully automatic pre-tagging with protected health information (PHI) items by the Averbis Health Discovery pipeline, a manual curation step of these preannotated data, and, finally, the automatic replacement of PHI items with typeconformant substitutes. This design was implemented in a pilot study involving six annotators and two curators each at the Data Integration Centers of the University Hospitals Leipzig and Erlangen. Results: As a proof of concept, the publicly available Graz Synthetic Text Clinical Corpus (GRASSCO) was enhanced with PHI annotations in an annotation campaign for which reasonable inter-annotator agreement values of Krippendorff's α ≈ 0.97 can be reported. Conclusion: These curated 1.4 K PHI annotations are released as open-source data constituting the first publicly available German clinical language text corpus with PHI metadata. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
42. Health Data Re-Identification: Assessing Adversaries and Potential Harms.
- Author
-
MEURERS, Thierry, BAUM, Lena, HABER, Anna Christine, HALILOVIC, Mehmed, HEINZ, Birgit, MILICEVIC, Vladimir, NEVES, Diogo Telmo, OTTE, Karen, PASQUIER, Anna, POIKELA, Maija, SHEYKHOLESLAMI, Maryam, WIRTH, Felix, and PRASSER, Fabian
- Abstract
Sharing biomedical data for research can help to improve disease understanding and support the development of preventive, diagnostic, and therapeutic methods. However, it is vital to balance the amount of data shared and the sharing mechanism chosen with the privacy protection provided. This requires a detailed understanding of potential adversaries who might attempt to re-identify data and the consequences of their actions. The aim of this paper is to present a comprehensive list of potential types of adversaries, motivations, and harms to targeted individuals. A group of 13 researchers performed a three-step process in a one-day workshop, involving the identification of adversaries, the categorization by motivation, and the deduction of potential harms. The group collected 28 suggestions and categorized them into six types, each associated with several of six distinct harms. The findings align with previous efforts in structuring threat actors and outcomes and we believe that they provide a robust foundation for evaluating re-identification risks and developing protection measures in health data sharing scenarios. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
43. Differentially private de-identifying textual medical document is compliant with challenging NLP analyses: Example of privacy-preserving ICD-10 code association
- Author
-
Yakini Tchouka, Jean-François Couchot, David Laiymani, Philippe Selles, and Azzedine Rahmani
- Subjects
De-identification ,Clinical data ,Local differential privacy ,Metric-privacy ,Natural language processing ,ICD-10 code association ,Cybernetics ,Q300-390 ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Medical research plays a crucial role within scientific research. Technological advancements, especially those related to the rise of machine learning, pave the way for the exploration of medical issues that were once beyond reach. Unstructured textual data, such as correspondence between doctors, operative reports, etc., often serve as a starting point for many medical applications.However, for obvious privacy reasons, researchers do not legally have the right to access these documents as long as they contain sensitive data, as defined by regulations like GDPR (General Data Protection Regulation) or HIPAA (Health Insurance Portability and Accountability Act). De-identification, meaning the detection, removal or substitution of all sensitive information, is therefore a necessary step to facilitate the sharing of these data between the medical field and research. Over the past decade, various approaches have been proposed to de-identify medical textual data. However, while entity detection is a well-known task in the natural language processing field, it presents some specific challenges in the medical context. Moreover, existing substitution methods proposed in the literature often pay little attention to the medical relevance of de-identified data or are not very resilient to attacks.This paper addresses these challenges. Firstly, an efficient system for detecting sensitive entities in French medical data and then accurately substitute them was implemented. Secondly, robust strategies for generating substitutes that incorporate the medical utility of the data were provided, thereby minimizing the difference in utility between the original and de-identified data, and that mathematically ensure privacy protection. Thirdly, the utility of the de-identification system in a context of ICD-10 code association was evaluated. Finally, various systems developed to tackle ICD-10 code association were presented while providing a state-of-the-art model in French.
- Published
- 2024
- Full Text
- View/download PDF
44. Anonymization and Pseudonymization of FHIR Resources for Secondary Use of Healthcare Data
- Author
-
Emanuele Raso, Pierpaolo Loreti, Michele Ravaziol, and Lorenzo Bracciale
- Subjects
Anonymisation ,de-identification ,FHIR ,healthcare ,pseudonymisation ,privacy ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Along with the creation of medical profiles of patients, Electronic Health Records have several secondary missions, such as health economy and research. The recent, increasing adoption of a common standard, i.e., the Fast Healthcare Interoperability Resources (FHIR), makes it easier to exchange medical data among the several parties involved, for example, in an epidemiological research activity. However, this exchange process is hindered by regulatory frameworks due to privacy issues related to the presence of personal information, which allows patients to be identified directly (or indirectly) from their medical data. When properly used, de-identification techniques can provide crucial support in overcoming these problems. FHIR-DIET aims to bring flexibility and concreteness to the implementation of de-identification of health data, supporting many customised data-processing behaviours that can be easily configured and tailored to match specific use case requirements. Our solution enables faster and easier cooperation between legal and IT professionals to establish and implement de-identification rules. The performance evaluation demonstrates the viability of processing hundreds of FHIR patient information data per second using standard hardware. We believe FHIR-DIET can be a valuable tool to satisfy the current regulation requirements and help to create added-value for the secondary use of healthcare data.
- Published
- 2024
- Full Text
- View/download PDF
45. Face Swapping for Low-Resolution and Occluded Images In-the-Wild
- Author
-
Jaehyun Park, Wonjun Kang, Hyung Il Koo, and Nam Ik Cho
- Subjects
Deep learning ,de-identification ,face swapping ,in-the-wild ,low-resolution ,occlusion ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Safeguarding personal identity in various surveillance videos, dashcams, and on-street videos is crucial. One way to do this is to detect faces and blur them, but a better solution is to replace them with non-existent ones to maintain the naturalness of the videos. While face swapping methods have already been used in the media industry with high-quality faces, it is challenging to apply them for identity protection to faces in-the-wild where faces are often occluded and of low-resolution. Therefore, we propose a new framework for face swapping specifically designed to work with face images taken in real-world scenarios, making it useful as a privacy protection method. To tackle the issue of low-resolution images, we introduce a Cross-Resolution Contrastive Loss (CRCL) technique, which allows our neural network model to be trained using triplets of varying resolutions. This enables the model to learn and use identity information across different resolutions, thereby improving its accuracy. We also propose a plug-and-play framework that can be easily applied to existing face swapping models to handle occlusions. By explicit swapping of facial features and filling of occluded regions, our framework provides a more seamless blend. To demonstrate the effectiveness of our method in handling faces in-the-wild, we create an occluded VGGFace2 dataset consisting of face images augmented with various facial masks and hand occlusions. Through quantitative and qualitative assessments on this dataset, our proposed method demonstrates robust performance under low-resolution or occluded scenarios. Significant improvements are made in the quality of swapped faces while preserving their identity and attributes, highlighting the effectiveness of our framework in advancing face swapping as a reliable privacy protection measure.
- Published
- 2024
- Full Text
- View/download PDF
46. Zero and Few Short Learning Using Large Language Models for De-Identification of Medical Records
- Author
-
Y. S. Yashwanth and Rajashree Shettar
- Subjects
De-identification ,Bard ,fine-tuning ,GPT-3.5 ,GPT-4 ,PaLM ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
The paper aims to evaluate and provide a comparative analysis of the performance and fine-tuning cost of various Large Language Models (LLMs) such as GPT-3.5, GPT-4, PaLM, Bard, and Llama in automating the de-identification of Protected Health Information (PHI) from medical records, ensuring patient and healthcare professional privacy. Zero-shot learning was utilized initially to assess the capabilities of these LLMs in de-identifying medical data. Subsequently, each model was fine-tuned with varying training set sizes to observe changes in performance. The study also investigates the impact of the specificity of prompts on the accuracy of de-identification tasks. Fine-tuning LLMs with specific examples significantly enhanced the accuracy of the de-identification process, surpassing the zero-shot learning accuracy of pre-trained counterparts. Notably, a fine-tuned GPT-3.5 model with a few-shot learning technique was able to exceed the performance of a zero-shot learning GPT-4 model, with 99% accuracy. Detailed prompts resulted in higher task accuracy across all models, yet fine-tuned models with brief instructions still outperformed pre-trained models given detailed prompts. Also, the fine-tuned models were more resilient to medical record format change than the zero-shot models. Code, calculations, and comparisons are available at https://github.com/YashwanthYS/De-Identification-of-medical-Records. The findings underscore the potential of LLMs, particularly when fine-tuned, to effectively automate the de-identification of PHI in medical records. The study highlights the importance of model training and prompt specificity in achieving high accuracy in de-identification tasks.
- Published
- 2024
- Full Text
- View/download PDF
47. OBIA: An Open Biomedical Imaging Archive
- Author
-
Enhui Jin, Dongli Zhao, Gangao Wu, Junwei Zhu, Zhonghuang Wang, Zhiyao Wei, Sisi Zhang, Anke Wang, Bixia Tang, Xu Chen, Yanling Sun, Zhe Zhang, Wenming Zhao, and Yuanguang Meng
- Subjects
Open Biomedical Imaging Archive ,Database ,Biomedical imaging ,De-identification ,Quality control ,Biology (General) ,QH301-705.5 ,Computer applications to medicine. Medical informatics ,R858-859.7 - Abstract
With the development of artificial intelligence (AI) technologies, biomedical imaging data play an important role in scientific research and clinical application, but the available resources are limited. Here we present Open Biomedical Imaging Archive (OBIA), a repository for archiving biomedical imaging and related clinical data. OBIA adopts five data objects (Collection, Individual, Study, Series, and Image) for data organization, and accepts the submission of biomedical images of multiple modalities, organs, and diseases. In order to protect personal privacy, OBIA has formulated a unified de-identification and quality control process. In addition, OBIA provides friendly and intuitive web interfaces for data submission, browsing, and retrieval, as well as image retrieval. As of September 2023, OBIA has housed data for a total of 937 individuals, 4136 studies, 24,701 series, and 1,938,309 images covering 9 modalities and 30 anatomical sites. Collectively, OBIA provides a reliable platform for biomedical imaging data management and offers free open access to all publicly available data to support research activities throughout the world. OBIA can be accessed at https://ngdc.cncb.ac.cn/obia.
- Published
- 2023
- Full Text
- View/download PDF
48. Face De-Identification Using Convolutional Neural Network (CNN) Models for Visual-Copy Detection.
- Author
-
Song, Jinha, Kim, Juntae, and Nang, Jongho
- Subjects
CONVOLUTIONAL neural networks ,DEEP learning ,GENERATIVE adversarial networks - Abstract
The proliferation of media-sharing platforms has led to issues with illegally edited content and the distribution of pornography. To protect personal information, de-identification technologies are being developed to prevent facial identification. Existing de-identification methods directly alter the pixel values in the face region, leading to reduced feature representation and identification accuracy. This study aims to develop a method that minimizes the possibility of personal identification while effectively preserving important features for image- and video-copy-detection tasks, proposing a new deep-learning-based de-identification approach that surpasses traditional pixel-based alteration methods. We introduce two de-identification models using different approaches: one emphasizing the contours of the original face through feature inversion and the other generating a blurred version of the face using D2GAN (Dual Discriminator Generative Adversarial Network). Both models were evaluated on their performance in image- and video-copy-detection tasks before and after de-identification, demonstrating effective feature preservation. This research presents new possibilities for personal-information protection and digital-content security, contributing to digital-rights management and law enforcement. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
49. Expanding the Role of Justice in Secondary Research Using Digital Psychological Data.
- Author
-
Herington, Jonathan, Li, Kevin, and Pisani, Anthony R.
- Subjects
- *
PSYCHIATRY laws , *DATA quality , *HUMAN research subjects , *PATIENT participation , *PARTICIPANT-researcher relationships , *DIGITAL technology , *STAKEHOLDER analysis , *SOCIAL values , *SOCIAL justice , *COOPERATIVENESS , *CONFIDENTIAL communications , *SOCIAL stigma , *RESEARCH ethics , *RESPONSIBILITY , *INFORMED consent (Medical law) , *INTELLECT , *DATA security , *RESEARCH bias , *STATISTICAL models , *SECONDARY analysis - Abstract
Secondary analysis of digital psychological data (DPD) is an increasingly popular method for behavioral health research. Under current practices, secondary research does not require human subjects research review so long as data are de-identified. We argue that this standard approach to the ethics of secondary research (i.e., de-identification) does not address a range of ethical risks and that greater emphasis should be placed on the ethical principle of justice. We outline the inadequacy of an individually focused research ethic for DPD and describe unaddressed "social risks" generated by secondary research of DPD. These risks exist in the "circumstances of justice": that is, a circumstance where individuals must cooperate to create a public good (e.g., research knowledge), and where it is impractical to individually exempt individuals. This requires researchers to emphasize the just allocation of benefits and burdens against a background of social cooperation. We explore six considerations for researchers who wish to conduct research with DPD without explicit consent: (a) create socially valuable knowledge, (b) fairly share the benefits and burdens of research, (c) be transparent about data use, (d) create mechanisms for withdrawal of data, (e) ensure that stakeholders can provide input into the design and implementation of the research, and (f) responsibly report results. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
50. An Extensible Evaluation Framework Applied to Clinical Text Deidentification Natural Language Processing Tools: Multisystem and Multicorpus Study.
- Author
-
Heider, Paul M and Meystre, Stéphane M
- Subjects
NATURAL language processing ,PERSONALLY identifiable information ,TEST systems ,EVALUATION methodology ,CORPORA - Abstract
Background: Clinical natural language processing (NLP) researchers need access to directly comparable evaluation results for applications such as text deidentification across a range of corpus types and the means to easily test new systems or corpora within the same framework. Current systems, reported metrics, and the personally identifiable information (PII) categories evaluated are not easily comparable. Objective: This study presents an open-source and extensible end-to-end framework for comparing clinical NLP system performance across corpora even when the annotation categories do not align. Methods: As a use case for this framework, we use 6 off-the-shelf text deidentification systems (ie, CliniDeID, deid from PhysioNet, MITRE Identity Scrubber Toolkit [MIST], NeuroNER, National Library of Medicine [NLM] Scrubber, and Philter) across 3 standard clinical text corpora for the task (2 of which are publicly available) and 1 private corpus (all in English), with annotation categories that are not directly analogous. The framework is built on shell scripts that can be extended to include new systems, corpora, and performance metrics. We present this open tool, multiple means for aligning PII categories during evaluation, and our initial timing and performance metric findings. Code for running this framework with all settings needed to run all pairs are available via Codeberg and GitHub. Results: From this case study, we found large differences in processing speed between systems. The fastest system (ie, MIST) processed an average of 24.57 (SD 26.23) notes per second, while the slowest (ie, CliniDeID) processed an average of 1.00 notes per second. No system uniformly outperformed the others at identifying PII across corpora and categories. Instead, a rich tapestry of performance trade-offs emerged for PII categories. CliniDeID and Philter prioritize recall over precision (with an average recall 6.9 and 11.2 points higher, respectively, for partially matching spans of text matching any PII category), while the other 4 systems consistently have higher precision (with MIST's precision scoring 20.2 points higher, NLM Scrubber scoring 4.4 points higher, NeuroNER scoring 7.2 points higher, and deid scoring 17.1 points higher). The macroaverage recall across corpora for identifying names, one of the more sensitive PII categories, included deid (48.8%) and MIST (66.9%) at the low end and NeuroNER (84.1%), NLM Scrubber (88.1%), and CliniDeID (95.9%) at the high end. A variety of metrics across categories and corpora are reported with a wider variety (eg, F
2 -score) available via the tool. Conclusions: NLP systems in general and deidentification systems and corpora in our use case tend to be evaluated in stand-alone research articles that only include a limited set of comparators. We hold that a single evaluation pipeline across multiple systems and corpora allows for more nuanced comparisons. Our open pipeline should reduce barriers to evaluation and system advancement. [ABSTRACT FROM AUTHOR]- Published
- 2024
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.