1. Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC Patients.
- Author
-
Paul, Tanmoy, Islam, Humayera, Singh, Nitesh, Jampani, Yaswitha, Kotapati, Teja Venkat Pavan, Tautam, Preethi Aishwarya, Rana, Md Kamruz Zaman, Mandhadi, Vasanthi, Sharma, Vishakha, Barnes, Michael, Hammer, Richard D., and Mosa, Abu Saleh Mohammad
- Subjects
FEATURE extraction ,NATURAL language processing ,FEATURE selection ,MACHINE learning ,NON-small-cell lung carcinoma ,MACHINE performance - Abstract
The de-identification of clinical reports is essential to protect the confidentiality of patients. The natural-language-processing-based named entity recognition (NER) model is a widely used technique of automatic clinical de-identification. The performance of such a machine learning model relies largely on the proper selection of features. The objective of this study was to investigate the utility of various features in a conditional-random-field (CRF)-based NER model. Natural language processing (NLP) toolkits were used to annotate the protected health information (PHI) from a total of 10,239 radiology reports that were divided into seven types. Multiple features were extracted by the toolkit and the NER models were built using these features and their combinations. A total of 10 features were extracted and the performance of the models was evaluated based on their precision, recall, and F
1 -score. The best-performing features were n-gram, prefix-suffix, word embedding, and word shape. These features outperformed others across all types of reports. The dataset we used was large in volume and divided into multiple types of reports. Such a diverse dataset made sure that the results were not subject to a small number of structured texts from where a machine learning model can easily learn the features. The manual de-identification of large-scale clinical reports is impractical. This study helps to identify the best-performing features for building an NER model for automatic de-identification from a wide array of features mentioned in the literature. [ABSTRACT FROM AUTHOR]- Published
- 2022
- Full Text
- View/download PDF