Back to Search
Start Over
Towards automated generation of curated datasets in radiology: Application of natural language processing to unstructured reports exemplified on CT for pulmonary embolism
- Source :
- European journal of radiology. 125
- Publication Year :
- 2019
-
Abstract
- Purpose To design and evaluate a self-trainable natural language processing (NLP)-based procedure to classify unstructured radiology reports. The method enabling the generation of curated datasets is exemplified on CT pulmonary angiogram (CTPA) reports. Method We extracted the impressions of CTPA reports created at our institution from 2016 to 2018 (n = 4397; language: German). The status (pulmonary embolism: yes/no) was manually labelled for all exams. Data from 2016/2017 (n = 2801) served as a ground truth to train three NLP architectures that only require a subset of reference datasets for training to be operative. The three architectures were as follows: a convolutional neural network (CNN), a support vector machine (SVM) and a random forest (RF) classifier. Impressions of 2018 (n = 1377) were kept aside and used for general performance measurements. Furthermore, we investigated the dependence of classification performance on the amount of training data with multiple simulations. Results The classification performance of all three models was excellent (accuracies: 97 %–99 %; F1 scores 0.88–0.97; AUCs: 0.993–0.997). Highest accuracy was reached by the CNN with 99.1 % (95 % CI 98.5–99.6 %). Training with 470 labelled impressions was sufficient to reach an accuracy of > 93 % with all three NLP architectures. Conclusion Our NLP-based approaches allow for an automated and highly accurate retrospective classification of CTPA reports with manageable effort solely using unstructured impression sections. We demonstrated that this approach is useful for the classification of radiology reports not written in English. Moreover, excellent classification performance is achieved at relatively small training set sizes.
- Subjects :
- Male
medicine.medical_specialty
Support Vector Machine
Pulmonary angiogram
Datasets as Topic
Pulmonary Artery
computer.software_genre
Convolutional neural network
030218 nuclear medicine & medical imaging
03 medical and health sciences
0302 clinical medicine
Classifier (linguistics)
Image Interpretation, Computer-Assisted
Medicine
Humans
Radiology, Nuclear Medicine and imaging
Aged
Natural Language Processing
Retrospective Studies
Ground truth
Training set
Data curation
business.industry
General Medicine
Random forest
Support vector machine
030220 oncology & carcinogenesis
Area Under Curve
Female
Radiology
Artificial intelligence
Neural Networks, Computer
business
Pulmonary Embolism
Tomography, X-Ray Computed
computer
Natural language processing
Subjects
Details
- ISSN :
- 18727727
- Volume :
- 125
- Database :
- OpenAIRE
- Journal :
- European journal of radiology
- Accession number :
- edsair.doi.dedup.....86d777dcd2b384a3849d789f61db8778