Back to Search
Start Over
Classification of Noisy Free-Text Prostate Cancer Pathology Reports Using Natural Language Processing
- Source :
- Pattern Recognition. ICPR International Workshops and Challenges ISBN: 9783030687625, ICPR Workshops (1)
- Publication Year :
- 2021
- Publisher :
- Springer Science and Business Media Deutschland GmbH, 2021.
-
Abstract
- Free-text reporting has been the main approach in clinical pathology practice for decades. Pathology reports are an essential information source to guide the treatment of cancer patients and for cancer registries, which process high volumes of free-text reports annually. Information coding and extraction are usually performed manually and it is an expensive and time-consuming process, since reports vary widely between institutions, usually contain noise and do not have a standard structure. This paper presents strategies based on natural language processing (NLP) models to classify noisy free-text pathology reports of high and low-grade prostate cancer from the open-source repository TCGA (The Cancer Genome Atlas). We used paragraph vectors to encode the reports and compared them with n-grams and TF-IDF representations. The best representation based on distributed bag of words of paragraph vectors obtained an \(f_{1}\)-score of 0.858 and an AUC of 0.854 using a logistic regression classifier. We investigate the classifier’s more relevant words in each case using the LIME interpretability tool, confirming the classifiers’ usefulness to select relevant diagnostic words. Our results show the feasibility of using paragraph embeddings to represent and classify pathology reports.
- Subjects :
- Pathology
medicine.medical_specialty
Computer science
Paragraph embeddings
computer.software_genre
ENCODE
Logistic regression
01 natural sciences
010309 optics
03 medical and health sciences
0302 clinical medicine
Natural language processing
Pathology reports
0103 physical sciences
Classifier (linguistics)
medicine
Interpretability
Structure (mathematical logic)
Clinical pathology
business.industry
Bag-of-words model
030220 oncology & carcinogenesis
Artificial intelligence
Paragraph
business
computer
Subjects
Details
- Language :
- English
- ISBN :
- 978-3-030-68762-5
- ISBNs :
- 9783030687625
- Database :
- OpenAIRE
- Journal :
- Pattern Recognition. ICPR International Workshops and Challenges ISBN: 9783030687625, ICPR Workshops (1)
- Accession number :
- edsair.doi.dedup.....178701c566a5892b95968f66d0aecf4b