Back to Search Start Over

Semi-automated De-identification of German Content Sensitive Reports for Big Data Analytics.

Authors :
Seuss H
Dankerl P
Ihle M
Grandjean A
Hammon R
Kaestle N
Fasching PA
Maier C
Christoph J
Sedlmayr M
Uder M
Cavallaro A
Hammon M
Source :
RoFo : Fortschritte auf dem Gebiete der Rontgenstrahlen und der Nuklearmedizin [Rofo] 2017 Jul; Vol. 189 (7), pp. 661-671. Date of Electronic Publication: 2017 Mar 23.
Publication Year :
2017

Abstract

Purpose  Projects involving collaborations between different institutions require data security via selective de-identification of words or phrases. A semi-automated de-identification tool was developed and evaluated on different types of medical reports natively and after adapting the algorithm to the text structure. Materials and Methods  A semi-automated de-identification tool was developed and evaluated for its sensitivity and specificity in detecting sensitive content in written reports. Data from 4671 pathology reports (4105 + 566 in two different formats), 2804 medical reports, 1008 operation reports, and 6223 radiology reports of 1167 patients suffering from breast cancer were de-identified. The content was itemized into four categories: direct identifiers (name, address), indirect identifiers (date of birth/operation, medical ID, etc.), medical terms, and filler words. The software was tested natively (without training) in order to establish a baseline. The reports were manually edited and the model re-trained for the next test set. After manually editing 25, 50, 100, 250, 500 and if applicable 1000 reports of each type re-training was applied. Results  In the native test, 61.3 % of direct and 80.8 % of the indirect identifiers were detected. The performance (P) increased to 91.4 % (P25), 96.7 % (P50), 99.5 % (P100), 99.6 % (P250), 99.7 % (P500) and 100 % (P1000) for direct identifiers and to 93.2 % (P25), 97.9 % (P50), 97.2 % (P100), 98.9 % (P250), 99.0 % (P500) and 99.3 % (P1000) for indirect identifiers. Without training, 5.3 % of medical terms were falsely flagged as critical data. The performance increased, after training, to 4.0 % (P25), 3.6 % (P50), 4.0 % (P100), 3.7 % (P250), 4.3 % (P500), and 3.1 % (P1000). Roughly 0.1 % of filler words were falsely flagged. Conclusion  Training of the developed de-identification tool continuously improved its performance. Training with roughly 100 edited reports enables reliable detection and labeling of sensitive data in different types of medical reports. Key Points:   · Collaborations between different institutions require de-identification of patients' data. · Software-based de-identification of content-sensitive reports grows in importance as a result of 'Big data'. · A de-identification software was developed and tested natively and after training. · The proposed de-identification software worked quite reliably, following training with roughly 100 edited reports. · A final check of the texts by an authorized person remains necessary. Citation Format · Seuss H, Dankerl P, Ihle M et al. Semi-automated De-identification of German Content Sensitive Reports for Big Data Analytics. Fortschr Röntgenstr 2017; 189: 661 - 671.<br />Competing Interests: Conflict of interest: Compliance with ethical standards This retrospective study was conducted in accordance with the guidelines of the Declaration of Helsinki and approved by the Ethics Committee of the University Hospital Erlangen. The use of written informed consent was waived by the ethics committee. Conflict of interest Matthias Ihle and Andrea Grandjean are employees of the Averbis GmbH, Freiburg, Germany. They provided the software within the context of the Smart Data Program in the KDI project of the Federal Ministry for Economic Affairs and Energy, Germany (01MT14 001E). They did not participate in the design, conduction and evaluation of the study. Therefore, the authors declare that no competing interests exist. Funding This research has been supported by the Smart Data Program in the KDI project of the Federal Ministry for Economic Affairs and Energy, Germany (01MT14 001E).<br /> (© Georg Thieme Verlag KG Stuttgart · New York.)

Details

Language :
English
ISSN :
1438-9010
Volume :
189
Issue :
7
Database :
MEDLINE
Journal :
RoFo : Fortschritte auf dem Gebiete der Rontgenstrahlen und der Nuklearmedizin
Publication Type :
Academic Journal
Accession number :
28335044
Full Text :
https://doi.org/10.1055/s-0043-102939