Back to Search Start Over

Classification of Noisy Free-Text Prostate Cancer Pathology Reports Using Natural Language Processing

Authors :
Henning Müller
Manfredo Atzori
Anjani Dhrangadhariya
Sebastian Otálora
Source :
Pattern Recognition. ICPR International Workshops and Challenges ISBN: 9783030687625, ICPR Workshops (1)
Publication Year :
2021
Publisher :
Springer Science and Business Media Deutschland GmbH, 2021.

Abstract

Free-text reporting has been the main approach in clinical pathology practice for decades. Pathology reports are an essential information source to guide the treatment of cancer patients and for cancer registries, which process high volumes of free-text reports annually. Information coding and extraction are usually performed manually and it is an expensive and time-consuming process, since reports vary widely between institutions, usually contain noise and do not have a standard structure. This paper presents strategies based on natural language processing (NLP) models to classify noisy free-text pathology reports of high and low-grade prostate cancer from the open-source repository TCGA (The Cancer Genome Atlas). We used paragraph vectors to encode the reports and compared them with n-grams and TF-IDF representations. The best representation based on distributed bag of words of paragraph vectors obtained an \(f_{1}\)-score of 0.858 and an AUC of 0.854 using a logistic regression classifier. We investigate the classifier’s more relevant words in each case using the LIME interpretability tool, confirming the classifiers’ usefulness to select relevant diagnostic words. Our results show the feasibility of using paragraph embeddings to represent and classify pathology reports.

Details

Language :
English
ISBN :
978-3-030-68762-5
ISBNs :
9783030687625
Database :
OpenAIRE
Journal :
Pattern Recognition. ICPR International Workshops and Challenges ISBN: 9783030687625, ICPR Workshops (1)
Accession number :
edsair.doi.dedup.....178701c566a5892b95968f66d0aecf4b