105 results on '"clinical text"'
Search Results
2. ICH-PRNet: a cross-modal intracerebral haemorrhage prognostic prediction method using joint-attention interaction mechanism
- Author
-
Yu, Xinlei, Elazab, Ahmed, Ge, Ruiquan, Zhu, Jichao, Zhang, Lingyan, Jia, Gangyong, Wu, Qing, Wan, Xiang, Li, Lihua, and Wang, Changmiao
- Published
- 2025
- Full Text
- View/download PDF
3. Reshaping free-text radiology notes into structured reports with generative question answering transformers
- Author
-
Bergomi, Laura, Buonocore, Tommaso M., Antonazzo, Paolo, Alberghi, Lorenzo, Bellazzi, Riccardo, Preda, Lorenzo, Bortolotto, Chandra, and Parimbelli, Enea
- Published
- 2024
- Full Text
- View/download PDF
4. End-to-end pseudonymization of fine-tuned clinical BERT models
- Author
-
Thomas Vakili, Aron Henriksson, and Hercules Dalianis
- Subjects
Natural language processing ,Language models ,BERT ,Electronic health records ,Clinical text ,De-identification ,Computer applications to medicine. Medical informatics ,R858-859.7 - Abstract
Abstract Many state-of-the-art results in natural language processing (NLP) rely on large pre-trained language models (PLMs). These models consist of large amounts of parameters that are tuned using vast amounts of training data. These factors cause the models to memorize parts of their training data, making them vulnerable to various privacy attacks. This is cause for concern, especially when these models are applied in the clinical domain, where data are very sensitive. Training data pseudonymization is a privacy-preserving technique that aims to mitigate these problems. This technique automatically identifies and replaces sensitive entities with realistic but non-sensitive surrogates. Pseudonymization has yielded promising results in previous studies. However, no previous study has applied pseudonymization to both the pre-training data of PLMs and the fine-tuning data used to solve clinical NLP tasks. This study evaluates the effects on the predictive performance of end-to-end pseudonymization of Swedish clinical BERT models fine-tuned for five clinical NLP tasks. A large number of statistical tests are performed, revealing minimal harm to performance when using pseudonymized fine-tuning data. The results also find no deterioration from end-to-end pseudonymization of pre-training and fine-tuning data. These results demonstrate that pseudonymizing training data to reduce privacy risks can be done without harming data utility for training PLMs.
- Published
- 2024
- Full Text
- View/download PDF
5. End-to-end pseudonymization of fine-tuned clinical BERT models: Privacy preservation with maintained data utility.
- Author
-
Vakili, Thomas, Henriksson, Aron, and Dalianis, Hercules
- Subjects
LANGUAGE models ,DATA privacy ,PRIVACY ,NATURAL language processing - Abstract
Many state-of-the-art results in natural language processing (NLP) rely on large pre-trained language models (PLMs). These models consist of large amounts of parameters that are tuned using vast amounts of training data. These factors cause the models to memorize parts of their training data, making them vulnerable to various privacy attacks. This is cause for concern, especially when these models are applied in the clinical domain, where data are very sensitive. Training data pseudonymization is a privacy-preserving technique that aims to mitigate these problems. This technique automatically identifies and replaces sensitive entities with realistic but non-sensitive surrogates. Pseudonymization has yielded promising results in previous studies. However, no previous study has applied pseudonymization to both the pre-training data of PLMs and the fine-tuning data used to solve clinical NLP tasks. This study evaluates the effects on the predictive performance of end-to-end pseudonymization of Swedish clinical BERT models fine-tuned for five clinical NLP tasks. A large number of statistical tests are performed, revealing minimal harm to performance when using pseudonymized fine-tuning data. The results also find no deterioration from end-to-end pseudonymization of pre-training and fine-tuning data. These results demonstrate that pseudonymizing training data to reduce privacy risks can be done without harming data utility for training PLMs. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
6. Modeling disagreement in automatic data labeling for semi-supervised learning in Clinical Natural Language Processing
- Author
-
Hongshu Liu, Nabeel Seedat, and Julia Ive
- Subjects
automated labeling ,clinical text ,Natural Language Processing ,radiology ,semi-supervised learning ,uncertainty ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
IntroductionComputational models providing accurate estimates of their uncertainty are crucial for risk management associated with decision-making in healthcare contexts. This is especially true since many state-of-the-art systems are trained using the data which have been labeled automatically (self-supervised mode) and tend to overfit.MethodsIn this study, we investigate the quality of uncertainty estimates from a range of current state-of-the-art predictive models applied to the problem of observation detection in radiology reports. This problem remains understudied for Natural Language Processing in the healthcare domain.ResultsWe demonstrate that Gaussian Processes (GPs) provide superior performance in quantifying the risks of three uncertainty labels based on the negative log predictive probability (NLPP) evaluation metric and mean maximum predicted confidence levels (MMPCL), whilst retaining strong predictive performance.DiscussionOur conclusions highlight the utility of probabilistic models applied to “noisy” labels and that similar methods could provide utility for Natural Language Processing (NLP) based automated labeling tasks.
- Published
- 2024
- Full Text
- View/download PDF
7. Semantic Web Techniques for Clinical Topic Detection in Health Care.
- Author
-
RAMAN, R., SAHAYARAJ, Kishore Anthuvan, SONI, Mukesh, NAYAK, Nihar Ranjan, GOVINDARAJ, Ramya, and SINGH, Nikhil Kumar
- Subjects
MEDICAL technology ,SEMANTICS ,MEDICAL care ,MICROBLOGS ,TIME series analysis - Abstract
The scope of this paper is that it investigates and proposes a new clustering method that takes into account the timing characteristics of frequently used feature words and the semantic similarity of microblog short texts as well as designing and implementing microblog topic detection and detection based on clustering results. The aim of the proposed research is to provide a new cluster overlap reduction method based on the divisions of semantic memberships to solve limited semantic expression and diversify short microblog contents. First, by defining the time-series frequent word set of the microblog text, a feature word selection method for hot topics is given; then, for the existence of initial clusters, according to the time-series recurring feature word set, to obtain the initial clustering of the microblog. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
8. Entity normalization in a Spanish medical corpus using a UMLS-based lexicon: findings and limitations
- Author
-
Báez, Pablo, Campillos-Llanos, Leonardo, Núñez, Fredy, and Dunstan, Jocelyn
- Published
- 2024
- Full Text
- View/download PDF
9. Clinical Text Classification with Word Representation Features and Machine Learning Algorithms.
- Author
-
Almazaydeh, Laiali, Abuhelaleh, Mohammed, Al Tawil, Arar, and Elleithy, Khaled
- Subjects
MACHINE learning ,NAIVE Bayes classification ,K-nearest neighbor classification ,SUPPORT vector machines ,ELECTRONIC health records ,MEDICAL transcription ,MEDICAL coding - Abstract
Clinical text classification of electronic medical records is a challenging task. Existing electronic records suffer from irrelevant text, misspellings, semantic ambiguity, and abbreviations. The approach reported in this paper elaborates on machine learning techniques to develop an intelligent framework for classification of the medical transcription dataset. The proposed approach is based on four main phases: the text preprocessing phase, word representation phase, features reduction phase and classification phase. We have used four machine learning algorithms, support vector machines, naïve bayes, logistic regression and k-nearest neighbors in combination with different word representation models. We have applied the four algorithms to the bag of words, to TF-IDF, to word2vec. Experimental results were evaluated based on precision, recall, accuracy and F1 score. The best results were obtained with the combination of the k-NN classifier, and the word represented by Word2vec achieving an accuracy of 92% to correctly classify the medical specialties based on the transcription text. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
10. Automatic Diagnosis of COVID-19 Patients from Unstructured Data Based on a Novel Weighting Scheme.
- Author
-
Mahdi, Amir Yasseen and Yuhaniz, Siti Sophiayati
- Subjects
COVID-19 ,COVID-19 testing ,VIRUS diseases ,PUBLIC hospitals ,FEATURE extraction - Abstract
The extraction of features fromunstructured clinical data of Covid-19 patients is critical for guiding clinical decision-making and diagnosing this viral disease. Furthermore, an early and accurate diagnosis of COVID-19 can reduce the burden on healthcare systems. In this paper, an improved Term Weighting technique combined with Parts-Of-Speech (POS) Tagging is proposed to reduce dimensions for automatic and effective classification of clinical text related to Covid-19 disease. Term Frequency-Inverse Document Frequency (TF-IDF) is the most often used term weighting scheme (TWS). However, TF-IDF has several developments to improve its drawbacks, in particular, it is not efficient enough to classify text by assigning effective weights to the terms in unstructured data. In this research, we proposed a modification term weighting scheme: RTF-C-IEF and compare the proposed model with four extraction methods: TF, TF-IDF, TF-IHF, and TF-IEF. The experiment was conducted on two new datasets for COVID-19 patients. The first datasetwas collected from government hospitals in Iraq with 3053 clinical records, and the second dataset with 1446 clinical reports, was collected from several different websites. Based on the experimental results using several popular classifiers applied to the datasets of Covid-19, we observe that the proposed scheme RTF-C-IEF achieves is a consistent performer with the best scores in most of the experiments. Further, the modifiedRTF-C-IEF proposed in the study outperformed the original scheme and other employed term weighting methods in most experiments. Thus, the proper selection of term weighting scheme among the different methods improves the performance of the classifier and helps to find the informative term. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
11. Bi-LSTM-CRF Network for Clinical Event Extraction With Medical Knowledge Features
- Author
-
Shunli Zhang, Yancui Li, Shiyong Li, and Fang Yan
- Subjects
Clinical text ,entity recognition ,deep learning ,natural language processing ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Extracting clinical event expressions and their types from clinical text is a fundamental task for many applications in clinical NLP. State-of-the-art systems need handcraft features and do not take into account the representation of the low-frequency words. To address these issues, a Bi-LSTM-CRF neural network architecture based on medical knowledge features is proposed. First, we employ convolutional neural networks (CNNs) to encode character-level information of a word and extract medical knowledge features from an open-source clinical knowledge system. Then, we concatenate character-level and word-level embedding and the medical knowledge features of words together, and feed them into bi-directional long short-term memory (Bi-LSTM) to build context information of each word. Finally, we jointly use a conditional random field (CRF) to decode labels for the whole sentence. We evaluate our model on two publicly available clinical datasets, namely THYME corpus and 2012 i2b2 dataset. Experimental results show that our model outperforms previous state-of-the-art systems with different methodologies, including machine learning-based methods, deep learning-based methods, and Bert-based methods.
- Published
- 2022
- Full Text
- View/download PDF
12. Temporal disambiguation of relative temporal expressions in clinical texts
- Author
-
Amy L. Olex and Bridget T. McInnes
- Subjects
natural language processing ,temporal reasoning ,temporal expression recognition and normalization ,clinical text ,relative temporal expression ,error analysis ,Bibliography. Library science. Information resources - Abstract
Temporal expression recognition and normalization (TERN) is the foundation for all higher-level temporal reasoning tasks in natural language processing, such as timeline extraction, so it must be performed well to limit error propagation. Achieving new heights in state-of-the-art performance for TERN in clinical texts requires knowledge of where current systems struggle. In this work, we summarize the results of a detailed error analysis for three top performing state-of-the-art TERN systems that participated in the 2012 i2b2 Clinical Temporal Relation Challenge, and compare our own home-grown system Chrono to identify specific areas in need of improvement. Performance metrics and an error analysis reveal that all systems have reduced performance in normalization of relative temporal expressions, specifically in disambiguating temporal types and in the identification of the correct anchor time. To address the issue of temporal disambiguation we developed and integrated a module into Chrono that utilizes temporally fine-tuned contextual word embeddings to disambiguate relative temporal expressions. Chrono now achieves state-of-the-art performance for temporal disambiguation of relative temporal expressions in clinical text, and is the only TERN system to output dual annotations into both TimeML and SCATE schemes.
- Published
- 2022
- Full Text
- View/download PDF
13. Transformer-based active learning for multi-class text annotation and classification.
- Author
-
Afzal M, Hussain J, Abbas A, Hussain M, Attique M, and Lee S
- Abstract
Objective: Data-driven methodologies in healthcare necessitate labeled data for effective decision-making. However, medical data, particularly in unstructured formats, such as clinical notes, often lack explicit labels, making manual annotation challenging and tedious., Methods: This paper introduces a novel deep active learning framework designed to facilitate the annotation process for multiclass text classification, specifically using the SOAP (subjective, objective, assessment, plan) framework, a widely recognized medical protocol. Our methodology leverages transformer-based deep learning techniques to automatically annotate clinical notes, significantly easing the manual labor involved and enhancing classification performance. Transformer-based deep learning models, with their ability to capture complex patterns in large datasets, represent a cutting-edge approach for advancing natural language processing tasks., Results: We validate our approach through experiments on a diverse set of clinical notes from publicly available datasets, comprising over 426 documents. Our model demonstrates superior classification accuracy, with an F1 score improvement of 4.8% over existing methods but also provides a practical tool for healthcare professionals, potentially improving clinical documentation practices and patient care., Conclusions: The research underscores the synergy between active learning and advanced deep learning, paving the way for future exploration of automatic text annotation and its implications for clinical informatics. Future studies will aim to integrate multimodal data and large language models to enhance the richness and accuracy of clinical text analysis, opening new pathways for comprehensive healthcare insights., Competing Interests: The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article., (© The Author(s) 2024.)
- Published
- 2024
- Full Text
- View/download PDF
14. A Survey on Recent Named Entity Recognition and Relationship Extraction Techniques on Clinical Texts.
- Author
-
Bose, Priyankar, Srinivasan, Sriram, Sleeman IV, William C., Palta, Jatinder, Kapoor, Rishabh, and Ghosh, Preetam
- Subjects
DATA mining ,ELECTRONIC health records ,EXTRACTION techniques ,TASK performance ,NATURAL language processing ,TEXT messages - Abstract
Significant growth in Electronic Health Records (EHR) over the last decade has provided an abundance of clinical text that is mostly unstructured and untapped. This huge amount of clinical text data has motivated the development of new information extraction and text mining techniques. Named Entity Recognition (NER) and Relationship Extraction (RE) are key components of information extraction tasks in the clinical domain. In this paper, we highlight the present status of clinical NER and RE techniques in detail by discussing the existing proposed NLP models for the two tasks and their performances and discuss the current challenges. Our comprehensive survey on clinical NER and RE encompass current challenges, state-of-the-art practices, and future directions in information extraction from clinical text. This is the first attempt to discuss both of these interrelated topics together in the clinical context. We identified many research articles published based on different approaches and looked at applications of these tasks. We also discuss the evaluation metrics that are used in the literature to measure the effectiveness of the two these NLP methods and future research directions. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
15. Limitations of Transformers on Clinical Text Classification.
- Author
-
Gao, Shang, Alawad, Mohammed, Young, M. Todd, Gounley, John, Schaefferkoetter, Noah, Yoon, Hong Jun, Wu, Xiao-Cheng, Durbin, Eric B., Doherty, Jennifer, Stroup, Antoinette, Coyle, Linda, and Tourassi, Georgia
- Subjects
CONVOLUTIONAL neural networks ,NATURAL language processing ,CLASSIFICATION ,DEEP learning ,DEFAULT (Finance) - Abstract
Bidirectional Encoder Representations from Transformers (BERT) and BERT-based approaches are the current state-of-the-art in many natural language processing (NLP) tasks; however, their application to document classification on long clinical texts is limited. In this work, we introduce four methods to scale BERT, which by default can only handle input sequences up to approximately 400 words long, to perform document classification on clinical texts several thousand words long. We compare these methods against two much simpler architectures – a word-level convolutional neural network and a hierarchical self-attention network – and show that BERT often cannot beat these simpler baselines when classifying MIMIC-III discharge summaries and SEER cancer pathology reports. In our analysis, we show that two key components of BERT – pretraining and WordPiece tokenization – may actually be inhibiting BERT's performance on clinical text classification tasks where the input document is several thousand words long and where correctly identifying labels may depend more on identifying a few key words or phrases rather than understanding the contextual meaning of sequences of text. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
16. Automatic Extraction and Decryption of Abbreviations from Domain-Specific Texts.
- Author
-
EGOROV, Michil and FUNKNER, Anastasia
- Abstract
This paper explores the problems of extraction and decryption of abbreviations from domain-specific texts in Russian. The main focus are unstructured electronic medical records which pose specific preprocessing problems. The major challenge is that there is no uniform way to write medical histories. The aim of the paper is to generalize the way of decrypting abbreviations from any variant of text. A dataset of nearly three million medical records was collected. A classifier model was trained in order to extract and decrypt abbreviations. After testing the proposed method with 224,307 records, the model showed an F1 score of 93.7% on a valid dataset. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
17. An efficient prototype method to identify and correct misspellings in clinical text
- Author
-
T. Elizabeth Workman, Yijun Shao, Guy Divita, and Qing Zeng-Treitler
- Subjects
Spelling analysis ,Spelling correction ,Clinical text ,Word embeddings ,Word2Vec ,Medicine ,Biology (General) ,QH301-705.5 ,Science (General) ,Q1-390 - Abstract
Abstract Objective Misspellings in clinical free text present challenges to natural language processing. With an objective to identify misspellings and their corrections, we developed a prototype spelling analysis method that implements Word2Vec, Levenshtein edit distance constraints, a lexical resource, and corpus term frequencies. We used the prototype method to process two different corpora, surgical pathology reports, and emergency department progress and visit notes, extracted from Veterans Health Administration resources. We evaluated performance by measuring positive predictive value and performing an error analysis of false positive output, using four classifications. We also performed an analysis of spelling errors in each corpus, using common error classifications. Results In this small-scale study utilizing a total of 76,786 clinical notes, the prototype method achieved positive predictive values of 0.9057 and 0.8979, respectively, for the surgical pathology reports, and emergency department progress and visit notes, in identifying and correcting misspelled words. False positives varied by corpus. Spelling error types were similar among the two corpora, however, the authors of emergency department progress and visit notes made over four times as many errors. Overall, the results of this study suggest that this method could also perform sufficiently in identifying misspellings in other clinical document types.
- Published
- 2019
- Full Text
- View/download PDF
18. Incorporating Domain Knowledge into Natural Language Inference on Clinical Texts
- Author
-
Mingming Lu, Yu Fang, Fengqi Yan, and Maozhen Li
- Subjects
Attention mechanism ,clinical text ,medical domain knowledge ,natural language inference ,word representation ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Making inference on clinical texts is a task which has not been fully studied. With the newly released, expert annotated MedNLI dataset, this task is being boosted. Compared with open domain data, clinical texts present unique linguistic phenomena, e.g., a large number of medical terms and abbreviations, different written forms for the same medical concept, which make inference much harder. Incorporating domain-specific knowledge is a way to eliminate this problem, in this paper, we assemble a new incorporating medical concept definitions module on the classic enhanced sequential inference model (ESIM), which first extracts the most relevant medical concept for each word, if it exists, then encodes the definition of this medical concept with a bidirectional long short-term network (BiLSTM) to obtain domain-specific definition representations, and attends these definition representations over vanilla word embeddings. The empirical evaluations are conducted to demonstrate that our model improves the prediction performance and achieves a high level of accuracy on the MedNLI dataset. Specifically, the knowledge enhanced word representations contribute significantly to entailment class.
- Published
- 2019
- Full Text
- View/download PDF
19. ContextMEL: Classifying Contextual Modifiers in Clinical Text.
- Author
-
Chocrón, Paula, Abella, Álvaro, and de Maeztu, Gabriel
- Subjects
NATURAL language processing ,COMPUTATIONAL linguistics ,ELECTRONIC health records ,DEEP learning ,ALGORITHMS ,MEDICAL records - Abstract
Taking advantage of electronic health records in clinical research requires the development of natural language processing tools to extract data from unstructured text in different languages. A key task is the detection of contextual modifiers, such as understanding whether a concept is negated or if it belongs to the past. We present ContextMEL, a method to build classifiers for contextual modifiers that is independent of the specific task and the language, allowing for a fast model development cycle. ContextMEL uses annotation by experts to build a curated dataset, and state-of-the-art deep learning architectures to train models with it. We discuss the application of ContextMEL for three modifiers, namely Negation, Temporality and Certainty, on Spanish and Catalan medical text. The metrics we obtain show our models are suitable for industrial use, outperforming commonly used rule-based approaches such as the NegEx algorithm. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
20. De-Identifying Swedish EHR Text Using Public Resources in the General Domain.
- Author
-
CHOMUTARE, Taridzo, YIGZAW, Kassaye Yitbarek, BUDRIONIS, Andrius, MAKHLYSHEVA, Alexandra, GODTLIEBSEN, Fred, and DALIANIS, Hercules
- Abstract
Sensitive data is normally required to develop rule-based or train machine learning-based models for de-identifying electronic health record (EHR) clinical notes; and this presents important problems for patient privacy. In this study,we add non-sensitive public datasets to EHR training data; (i) scientific medical textand (ii) Wikipedia word vectors. The data, all in Swedish, is used to train a deep learning model using recurrent neural networks. Tests on pseudonymized Swedish EHR clinical notes showed improved precision and recall from 55.62% and 80.02%with the base EHR embedding layer, to 85.01% and 87.15% when Wikipedia word vectors are added. These results suggest that non-sensitive text from the general domain can be used to train robust models for de-identifying Swedish clinical text;and this could be useful in cases where the data is both sensitive and in low-resource languages [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
21. Deep learning in clinical natural language processing: a methodical review.
- Author
-
Wu, Stephen, Roberts, Kirk, Datta, Surabhi, Du, Jingcheng, Ji, Zongcheng, Si, Yuqi, Soni, Sarvesh, Wang, Qiong, Wei, Qiang, Xiang, Yang, Zhao, Bo, and Xu, Hua
- Abstract
Objective: This article methodically reviews the literature on deep learning (DL) for natural language processing (NLP) in the clinical domain, providing quantitative analysis to answer 3 research questions concerning methods, scope, and context of current research.Materials and Methods: We searched MEDLINE, EMBASE, Scopus, the Association for Computing Machinery Digital Library, and the Association for Computational Linguistics Anthology for articles using DL-based approaches to NLP problems in electronic health records. After screening 1,737 articles, we collected data on 25 variables across 212 papers.Results: DL in clinical NLP publications more than doubled each year, through 2018. Recurrent neural networks (60.8%) and word2vec embeddings (74.1%) were the most popular methods; the information extraction tasks of text classification, named entity recognition, and relation extraction were dominant (89.2%). However, there was a "long tail" of other methods and specific tasks. Most contributions were methodological variants or applications, but 20.8% were new methods of some kind. The earliest adopters were in the NLP community, but the medical informatics community was the most prolific.Discussion: Our analysis shows growing acceptance of deep learning as a baseline for NLP research, and of DL-based NLP in the medical community. A number of common associations were substantiated (eg, the preference of recurrent neural networks for sequence-labeling named entity recognition), while others were surprisingly nuanced (eg, the scarcity of French language clinical NLP with deep learning).Conclusion: Deep learning has not yet fully penetrated clinical NLP and is growing rapidly. This review highlighted both the popular and unique trends in this active field. [ABSTRACT FROM AUTHOR]- Published
- 2020
- Full Text
- View/download PDF
22. A Survey on Recent Named Entity Recognition and Relationship Extraction Techniques on Clinical Texts
- Author
-
Priyankar Bose, Sriram Srinivasan, William C. Sleeman, Jatinder Palta, Rishabh Kapoor, and Preetam Ghosh
- Subjects
electronic health records ,clinical text ,natural language processing ,named entity recognition ,relationship extraction ,machine learning ,Technology ,Engineering (General). Civil engineering (General) ,TA1-2040 ,Biology (General) ,QH301-705.5 ,Physics ,QC1-999 ,Chemistry ,QD1-999 - Abstract
Significant growth in Electronic Health Records (EHR) over the last decade has provided an abundance of clinical text that is mostly unstructured and untapped. This huge amount of clinical text data has motivated the development of new information extraction and text mining techniques. Named Entity Recognition (NER) and Relationship Extraction (RE) are key components of information extraction tasks in the clinical domain. In this paper, we highlight the present status of clinical NER and RE techniques in detail by discussing the existing proposed NLP models for the two tasks and their performances and discuss the current challenges. Our comprehensive survey on clinical NER and RE encompass current challenges, state-of-the-art practices, and future directions in information extraction from clinical text. This is the first attempt to discuss both of these interrelated topics together in the clinical context. We identified many research articles published based on different approaches and looked at applications of these tasks. We also discuss the evaluation metrics that are used in the literature to measure the effectiveness of the two these NLP methods and future research directions.
- Published
- 2021
- Full Text
- View/download PDF
23. Zero-shot Learning with Minimum Instruction to Extract Social Determinants and Family History from Clinical Notes using GPT Model.
- Author
-
Bhate NJ, Mittal A, He Z, and Luo X
- Abstract
Demographics, social determinants of health, and family history documented in the unstructured text within the electronic health records are increasingly being studied to understand how this information can be utilized with the structured data to improve healthcare outcomes. After the GPT models were released, many studies have applied GPT models to extract this information from the narrative clinical notes. Different from the existing work, our research focuses on investigating the zero-shot learning on extracting this information together by providing minimum information to the GPT model. We utilize de-identified real-world clinical notes annotated for demographics, various social determinants, and family history information. Given that the GPT model might provide text different from the text in the original data, we explore two sets of evaluation metrics, including the traditional NER evaluation metrics and semantic similarity evaluation metrics, to completely understand the performance. Our results show that the GPT-3.5 method achieved an average of 0.975 F1 on demographics extraction, 0.615 F1 on social determinants extraction, and 0.722 F1 on family history extraction. We believe these results can be further improved through model fine-tuning or few-shots learning. Through the case studies, we also identified the limitations of the GPT models, which need to be addressed in future research.
- Published
- 2023
- Full Text
- View/download PDF
24. Phenotero: Annotate as you write.
- Author
-
Hombach, Daniela, Schwarz, Jana M., Knierim, Ellen, Schuelke, Markus, Seelow, Dominik, and Köhler, Sebastian
- Subjects
- *
PHENOTYPES , *GENETIC disorders , *GENE ontology , *PATIENT education , *MOLECULAR diagnosis - Abstract
In clinical genetics, the Human Phenotype Ontology as well as disease ontologies are often used for deep phenotyping of patients and coding of clinical diagnoses. However, assigning ontology classes to patient descriptions is often disconnected from writing patient reports or manuscripts in word processing software. This additional workload and the requirement to install dedicated software may discourage usage of ontologies for parts of the target audience. Here we present Phenotero, a freely available and simple solution to annotate patient phenotypes and diseases at the time of writing clinical reports or manuscripts. We adopt Zotero, a citation management software to create a tool which allows to reference classes from ontologies within text at the time of writing. We expect this approach to decrease the additional workload to a minimum while ensuring high quality associations with ontology classes. Standardized collection of phenotypic information at the time of describing the patient allows for streamlining the clinic workflow and efficient data entry. It will subsequently promote clinical and molecular diagnosis with the ultimate goal of better understanding genetic diseases. Thus, we believe that Phenotero eases the usage of ontologies and controlled vocabularies in the field of clinical genetics. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
25. Classifying medical relations in clinical text via convolutional neural networks.
- Author
-
He, Bin, Guan, Yi, and Dai, Rui
- Subjects
- *
ARTIFICIAL neural networks , *MEDICAL records , *CONSTRAINT satisfaction , *POOLINGS of interest , *ECONOMIC competition - Abstract
Deep learning research on relation classification has achieved solid performance in the general domain. This study proposes a convolutional neural network (CNN) architecture with a multi-pooling operation for medical relation classification on clinical records and explores a loss function with a category-level constraint matrix. Experiments using the 2010 i2b2/VA relation corpus demonstrate these models, which do not depend on any external features, outperform previous single-model methods and our best model is competitive with the existing ensemble-based method. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
26. Medical Text Classification Using Convolutional Neural Networks.
- Author
-
HUGHES, Mark, Irene LI, KOTOULAS, Spyros, and Toyotaro SUZUMURA
- Abstract
We present an approach to automatically classify clinical text at a sentence level. We are using deep convolutional neural networks to represent complex features. We train the network on a dataset providing a broad categorization of health information. Through a detailed evaluation, we demonstrate that our method outperforms several approaches widely used in natural language processing tasks by about 15%. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
27. Improving Terminology Mapping in Clinical Text with Context-Sensitive Spelling Correction.
- Author
-
DZIADEK, Juliusz, HENRIKSSON, Aron, and DUNELD, Martin
- Abstract
The mapping of unstructured clinical text to an ontology facilitates meaningful secondary use of health records but is non-trivial due to lexical variation and the abundance of misspellings in hurriedly produced notes. Here, we apply several spelling correction methods to Swedish medical text and evaluate their impact on SNOMED CT mapping; first in a controlled evaluation using medical literature text with induced errors, followed by a partial evaluation on clinical notes. It is shown that the best-performing method is context-sensitive, taking into account trigram frequencies and utilizing a corpus-based dictionary. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
28. Clinical Prediction Models for Hospital-Induced Delirium Using Structured and Unstructured Electronic Health Record Data: Protocol for a Development and Validation Study.
- Author
-
Ser SE, Shear K, Snigurska UA, Prosperi M, Wu Y, Magoc T, Bjarnadottir RI, and Lucero RJ
- Abstract
Background: Hospital-induced delirium is one of the most common and costly iatrogenic conditions, and its incidence is predicted to increase as the population of the United States ages. An academic and clinical interdisciplinary systems approach is needed to reduce the frequency and impact of hospital-induced delirium., Objective: The long-term goal of our research is to enhance the safety of hospitalized older adults by reducing iatrogenic conditions through an effective learning health system. In this study, we will develop models for predicting hospital-induced delirium. In order to accomplish this objective, we will create a computable phenotype for our outcome (hospital-induced delirium), design an expert-based traditional logistic regression model, leverage machine learning techniques to generate a model using structured data, and use machine learning and natural language processing to produce an integrated model with components from both structured data and text data., Methods: This study will explore text-based data, such as nursing notes, to improve the predictive capability of prognostic models for hospital-induced delirium. By using supervised and unsupervised text mining in addition to structured data, we will examine multiple types of information in electronic health record data to predict medical-surgical patient risk of developing delirium. Development and validation will be compliant to the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) statement., Results: Work on this project will take place through March 2024. For this study, we will use data from approximately 332,230 encounters that occurred between January 2012 to May 2021. Findings from this project will be disseminated at scientific conferences and in peer-reviewed journals., Conclusions: Success in this study will yield a durable, high-performing research-data infrastructure that will process, extract, and analyze clinical text data in near real time. This model has the potential to be integrated into the electronic health record and provide point-of-care decision support to prevent harm and improve quality of care., International Registered Report Identifier (irrid): DERR1-10.2196/48521., (©Sarah E Ser, Kristen Shear, Urszula A Snigurska, Mattia Prosperi, Yonghui Wu, Tanja Magoc, Ragnhildur I Bjarnadottir, Robert J Lucero. Originally published in JMIR Research Protocols (https://www.researchprotocols.org), 09.11.2023.)
- Published
- 2023
- Full Text
- View/download PDF
29. Disorder recognition in clinical texts using multi-label structured SVM.
- Author
-
Wutao Lin, Donghong Ji, and Yanan Lu
- Subjects
- *
CLINICAL trials , *MEDICAL personnel , *ALLIED health personnel , *MEDICAL care , *PUBLIC health - Abstract
Background: Information extraction in clinical texts enables medical workers to find out problems of patients faster as well as makes intelligent diagnosis possible in the future. There has been a lot of work about disorder mention recognition in clinical narratives. But recognition of some more complicated disorder mentions like overlapping ones is still an open issue. This paper proposes a multi-label structured Support Vector Machine (SVM) based method for disorder mention recognition. We present a multi-label scheme which could be used in complicated entity recognition tasks. Results: We performed three sets of experiments to evaluate our model. Our best F1-Score on the 2013 Conference and Labs of the Evaluation Forum data set is 0.7343. There are six types of labels in our multi-label scheme, all of which are represented by 24-bit binary numbers. The binary digits of each label contain information about different disorder mentions. Our multi-label method can recognize not only disorder mentions in the form of contiguous or discontiguous words but also mentions whose spans overlap with each other. The experiments indicate that our multi-label structured SVM model outperforms the condition random field (CRF) model for this disorder mention recognition task. The experiments show that our multi-label scheme surpasses the baseline. Especially for overlapping disorder mentions, the F1-Score of our multi-label scheme is 0.1428 higher than the baseline BIOHD1234 scheme. Conclusions: This multi-label structured SVM based approach is demonstrated to work well with this disorder recognition task. The novel multi-label scheme we presented is superior to the baseline and it can be used in other models to solve various types of complicated entity recognition tasks as well. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
30. Extraction of Temporal Information from Clinical Narratives
- Author
-
Moharasan, Gandhimathi and Ho, Tu-Bao
- Published
- 2019
- Full Text
- View/download PDF
31. Annotating patient clinical records with syntactic chunks and named entities: the Harvey Corpus.
- Author
-
Savkov, Aleksandar, Carroll, John, Koeling, Rob, and Cassell, Jackie
- Subjects
- *
ANNOTATIONS , *MEDICAL records , *CORPORA , *SPELLING errors , *LANGUAGE & languages , *MACHINE learning - Abstract
The free text notes typed by physicians during patient consultations contain valuable information for the study of disease and treatment. These notes are difficult to process by existing natural language analysis tools since they are highly telegraphic (omitting many words), and contain many spelling mistakes, inconsistencies in punctuation, and non-standard word order. To support information extraction and classification tasks over such text, we describe a de-identified corpus of free text notes, a shallow syntactic and named entity annotation scheme for this kind of text, and an approach to training domain specialists with no linguistic background to annotate the text. Finally, we present a statistical chunking system for such clinical text with a stable learning rate and good accuracy, indicating that the manual annotation is consistent and that the annotation scheme is tractable for machine learning. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
32. Swedification patterns of Latin and Greek affixes in clinical text.
- Author
-
Grigonytė, Gintarė, Kvist, Maria, Wirén, Mats, Velupillai, Sumithra, and Henriksson, Aron
- Subjects
- *
SWEDISH language , *AFFIXES (Grammar) , *CORPORA , *MEDICAL records , *MEDICAL terminology - Abstract
Swedish medical language is rich with Latin and Greek terminology which has undergone a Swedification since the 1980s. However, many original expressions are still used by clinical professionals. The goal of this study is to obtain precise quantitative measures of how the foreign terminology is manifested in Swedish clinical text. To this end, we explore the use of Latin and Greek affixes in Swedish medical texts in three genres: clinical text, scientific medical text and online medical information for laypersons. More specifically, we use frequency lists derived from tokenised Swedish medical corpora in the three domains, and extract word pairs belonging to types that display both the original and Swedified spellings. We describe six distinct patterns explaining the variation in the usage of Latin and Greek affixes in clinical text. The results show that to a large extent affixes in clinical text are Swedified and that prefixes are used more conservatively than suffixes. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
33. Extracting Clinical Relations in Electronic Health Records Using Enriched Parse Trees.
- Author
-
Kim, Jisung, Choe, Yoonsuck, and Mueller, Klaus
- Subjects
NATURAL language processing ,ELECTRONIC health records ,DATA quality ,BIG data ,COMPUTER science - Abstract
Integrating semantic features into parse trees is an active research topic in open-domain natural language processing (NLP). We study six different parse tree structures enriched with various semantic features for determining entity relations in clinical notes using a tree kernel-based relation extraction system. We used the relation extraction task definition and the dataset from the popular 2010 i2b2/VA challenge for our evaluation. We found that the parse tree structure enriched with entity type suffixes resulted in the highest F1 score of 0.7725 and was the fastest. In terms of reducing the number of feature vectors in trained models, the entity type feature was most effective among the semantic features while adding semantic feature node was better than adding feature suffixes to the labels. Our study demonstrates that parse tree enhancements with semantic features are effective for clinical relation extraction. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
34. On the creation of a clinical gold standard corpus in Spanish: Mining adverse drug reactions.
- Author
-
Oronoz, Maite, Gojenola, Koldo, Pérez, Alicia, de Ilarraza, Arantza Díaz, and Casillas, Arantza
- Abstract
The advances achieved in Natural Language Processing make it possible to automatically mine information from electronically created documents. Many Natural Language Processing methods that extract information from texts make use of annotated corpora, but these are scarce in the clinical domain due to legal and ethical issues. In this paper we present the creation of the IxaMed-GS gold standard composed of real electronic health records written in Spanish and manually annotated by experts in pharmacology and pharmacovigilance. The experts mainly annotated entities related to diseases and drugs, but also relationships between entities indicating adverse drug reaction events. To help the experts in the annotation task, we adapted a general corpus linguistic analyzer to the medical domain. The quality of the annotation process in the IxaMed-GS corpus has been assessed by measuring the inter-annotator agreement, which was 90.53% for entities and 82.86% for events. In addition, the corpus has been used for the automatic extraction of adverse drug reaction events using machine learning. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
35. Developing a Clinical Language Model for Swedish : Continued Pretraining of Generic BERT with In-Domain Data
- Author
-
Anastasios Lamproudis, Aron Henriksson, and Hercules Dalianis
- Subjects
Vocabulary ,Computer and Information Sciences ,Downstream (software development) ,Computer science ,business.industry ,media_common.quotation_subject ,Data- och informationsvetenskap ,computer.software_genre ,Task (project management) ,Domain (software engineering) ,clinical text ,language models ,Added value ,Language model ,Diagnosis code ,Artificial intelligence ,natural language processing ,business ,computer ,Natural language processing ,Protected health information ,media_common - Abstract
The use of pretrained language models, fine-tuned to perform a specific downstream task, has become widespread in NLP. Using a generic language model in specialized domains may, however, be sub-optimal due to differences in language use and vocabulary. In this paper, it is investigated whether an existing, generic language model for Swedish can be improved for the clinical domain through continued pretraining with clinical text. The generic and domain-specific language models are fine-tuned and evaluated on three representative clinical NLP tasks: (i) identifying protected health information, (ii) assigning ICD-10 diagnosis codes to discharge summaries, and (iii) sentence-level uncertainty prediction. The results show that continued pretraining on in-domain data leads to improved performance on all three downstream tasks, indicating that there is a potential added value of domain-specific language models for clinical NLP.
- Published
- 2021
36. Professional language in Swedish clinical text: Linguistic characterization and comparative studies.
- Author
-
Smith, Kelly, Megyesi, Beata, Velupillai, Sumithra, Kvist, Maria, Andersen, Gisle, and Hardt, Daniel
- Subjects
- *
LINGUISTICS research , *SWEDISH language , *COMPARATIVE studies , *ELECTRONIC health records , *MEDICAL terminology , *RADIOLOGICAL research - Abstract
This study investigates the linguistic characteristics of Swedish clinical text in radiology reports and doctor's daily notes from electronic health records (EHRs) in comparison to general Swedish and biomedical journal text. We quantify linguistic features through a comparative register analysis to determine how the free text of EHRs differ from general and biomedical Swedish text in terms of lexical complexity, word and sentence composition, and common sentence structures. The linguistic features are extracted using state-of-the-art computational tools: a tokenizer, a part-of-speech tagger, and scripts for statistical analysis. Results show that technical terms and abbreviations are more frequent in clinical text, and lexical variance is low. Moreover, clinical text frequently omit subjects, verbs, and function words resulting in shorter sentences. Clinical text not only differs from general Swedish, but also internally, across its sub-domains, e.g. sentences lacking verbs are significantly more frequent in radiology reports. These results provide a foundation for future development of automatic methods for EHR simplification or clarification. [ABSTRACT FROM PUBLISHER]
- Published
- 2014
- Full Text
- View/download PDF
37. Using large clinical corpora for query expansion in text-based cohort identification.
- Author
-
Zhu, Dongqing, Wu, Stephen, Carterette, Ben, and Liu, Hongfang
- Abstract
Highlights: [•] Demonstrated utility of an in-domain collection (clinical text) for query expansion. [•] Analyzed effect of external collection size on a mixture of relevance models. [•] Any existing query expansion configuration can benefit from an indomain collection. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
38. ContextMEL: un Clasificador de Modificadores Contextuales en Texto Clínico
- Author
-
Chocrón, Paula, Abella, Álvaro, and Maeztu, Gabriel de
- Subjects
Temporality ,Annotation ,Aprendizaje profundo ,Lenguajes y Sistemas Informáticos ,Texto clínico ,Anotación ,Deep learning ,Certainty ,Clasificación ,Clinical text ,Negation - Abstract
Taking advantage of electronic health records in clinical research requires the development of natural language processing tools to extract data from unstructured text in different languages. A key task is the detection of contextual modifiers, such as understanding whether a concept is negated or if it belongs to the past. We present ContextMEL, a method to build classifiers for contextual modifiers that is independent of the specific task and the language, allowing for a fast model development cycle. ContextMEL uses annotation by experts to build a curated dataset, and state-of-the-art deep learning architectures to train models with it. We discuss the application of ContextMEL for three modifiers, namely Negation, Temporality and Certainty, on Spanish and Catalan medical text. The metrics we obtain show our models are suitable for industrial use, outperforming commonly used rule-based approaches such as the NegEx algorithm. Las historias clínicas electrónicas pueden traer grandes avances en la investigación médica, pero requieren el desarrollo de herramientas para procesar texto no estructurado en diferentes idiomas. Una tarea clave es la detección de distintos modificadores contextuales, como el aspecto temporal de un concepto, o si está negado. En este trabajo presentamos ContextMEL, un método para construir clasificadores para modificadores contextuales que es independiente tanto de la tarea específica como del lenguaje, permitiendo un ciclo de desarrollo dinámico. ContextMEL usa anotaciones de expertos para crear un dataset curado, y las últimas tecnologías en aprendizaje profundo. En este artículo discutimos la aplicación de ContextMEL para tres modificadores contextuales (temporalidad, negación, y certeza) en texto médico en castellano y catalán. Los resultados obtenidos muestran que nuestros modelos pueden utilizarse en un entorno industrial, y que son más precisos que conocidos métodos basados en reglas, como el algoritmo NegEx.
- Published
- 2020
39. Incorporating Domain Knowledge into Natural Language Inference on Clinical Texts
- Author
-
Yu Fang, Fengqi Yan, Maozhen Li, and Mingming Lu
- Subjects
General Computer Science ,Computer science ,Inference ,Attention mechanism ,02 engineering and technology ,computer.software_genre ,Logical consequence ,Task (project management) ,medical domain knowledge ,Natural language inference ,0202 electrical engineering, electronic engineering, information engineering ,Open domain ,General Materials Science ,word representation ,Class (computer programming) ,business.industry ,General Engineering ,020206 networking & telecommunications ,clinical text ,natural language inference ,Domain knowledge ,020201 artificial intelligence & image processing ,Artificial intelligence ,lcsh:Electrical engineering. Electronics. Nuclear engineering ,business ,computer ,lcsh:TK1-9971 ,Natural language processing ,Word (computer architecture) - Abstract
Making inference on clinical texts is a task which has not been fully studied. With the newly released, expert annotated MedNLI dataset, this task is being boosted. Compared with open domain data, clinical texts present unique linguistic phenomena, e.g., a large number of medical terms and abbreviations, different written forms for the same medical concept, which make inference much harder. Incorporating domain-specific knowledge is a way to eliminate this problem, in this paper, we assemble a new incorporating medical concept definitions module on the classic enhanced sequential inference model (ESIM), which first extracts the most relevant medical concept for each word, if it exists, then encodes the definition of this medical concept with a bidirectional long short-term network (BiLSTM) to obtain domain-specific definition representations, and attends these definition representations over vanilla word embeddings. The empirical evaluations are conducted to demonstrate that our model improves the prediction performance and achieves a high level of accuracy on the MedNLI dataset. Specifically, the knowledge enhanced word representations contribute significantly to entailment class. Institute of Electrical and Electronics Engineers
- Published
- 2019
40. A la Recherche du Temps Perdu: extracting temporal relations from medical text in the 2012 i2b2 NLP challenge.
- Author
-
Cherry, Colin, Zhu, Xiaodan, Martin, Joel, and de Bruijn, Berry
- Abstract
Objective: An analysis of the timing of events is critical for a deeper understanding of the course of events within a patient record. The 2012 i2b2 NLP challenge focused on the extraction of temporal relationships between concepts within textual hospital discharge summaries.Materials and Methods: The team from the National Research Council Canada (NRC) submitted three system runs to the second track of the challenge: typifying the time-relationship between pre-annotated entities. The NRC system was designed around four specialist modules containing statistical machine learning classifiers. Each specialist targeted distinct sets of relationships: local relationships, 'sectime'-type relationships, non-local overlap-type relationships, and non-local causal relationships.Results: The best NRC submission achieved a precision of 0.7499, a recall of 0.6431, and an F1 score of 0.6924, resulting in a statistical tie for first place. Post hoc improvements led to a precision of 0.7537, a recall of 0.6455, and an F1 score of 0.6954, giving the highest scores reported on this task to date.Discussion and Conclusions: Methods for general relation extraction extended well to temporal relations, and gave top-ranked state-of-the-art results. Careful ordering of predictions within result sets proved critical to this success. [ABSTRACT FROM AUTHOR]- Published
- 2013
- Full Text
- View/download PDF
41. A controlled greedy supervised approach for co-reference resolution on clinical text.
- Author
-
Chowdhury, Md. Faisal Mahbub and Zweigenbaum, Pierre
- Abstract
Abstract: Identification of co-referent entity mentions inside text has significant importance for other natural language processing (NLP) tasks (e.g. event linking). However, this task, known as co-reference resolution, remains a complex problem, partly because of the confusion over different evaluation metrics and partly because the well-researched existing methodologies do not perform well on new domains such as clinical records. This paper presents a variant of the influential mention-pair model for co-reference resolution. Using a series of linguistically and semantically motivated constraints, the proposed approach controls generation of less-informative/sub-optimal training and test instances. Additionally, the approach also introduces some aggressive greedy strategies in chain clustering. The proposed approach has been tested on the official test corpus of the recently held i2b2/VA 2011 challenge. It achieves an unweighted average F
1 score of 0.895, calculated from multiple evaluation metrics (MUC, B3 and CEAF scores). These results are comparable to the best systems of the challenge. What makes our proposed system distinct is that it also achieves high average F1 scores for each individual chain type (Test: 0.897, Person: 0.852, Problem: 0.855, Treatment: 0.884). Unlike other works, it obtains good scores for each of the individual metrics rather than being biased towards a particular metric. [Copyright &y& Elsevier]- Published
- 2013
- Full Text
- View/download PDF
42. Factuality Levels of Diagnoses in Swedish Clinical Text.
- Author
-
Moen, Anne, Andersen, Stig Kjær, Aarts, Jos, Hurlen, Petter, Velupillai, Sumithra, Dalianis, Hercules, and Kvist, Maria
- Abstract
Different levels of knowledge certainty, or factuality levels, are expressed in clinical health record documentation. This information is currently not fully exploited, as the subtleties expressed in natural language cannot easily be machine analyzed. Extracting relevant information from knowledge-intensive resources such as electronic health records can be used for improving health care in general by e.g. building automated information access systems. We present an annotation model of six factuality levels linked to diagnoses in Swedish clinical assessments from an emergency ward. Our main findings are that overall agreement is fairly high (0.7/0.58 F-measure, 0.73/0.6 Cohen's κ, Intra/Inter). These distinctions are important for knowledge models, since only approx. 50% of the diagnoses are affirmed with certainty. Moreover, our results indicate that there are patterns inherent in the diagnosis expressions themselves conveying factuality levels, showing that certainty is not only dependent on context cues. [ABSTRACT FROM AUTHOR]
- Published
- 2011
43. UMLS content views appropriate for NLP processing of the biomedical literature vs. clinical text.
- Author
-
Demner-Fushman, Dina, Mork, James G., Shooshan, Sonya E., and Aronson, Alan R.
- Abstract
Abstract: Identification of medical terms in free text is a first step in such Natural Language Processing (NLP) tasks as automatic indexing of biomedical literature and extraction of patients’ problem lists from the text of clinical notes. Many tools developed to perform these tasks use biomedical knowledge encoded in the Unified Medical Language System (UMLS) Metathesaurus. We continue our exploration of automatic approaches to creation of subsets (UMLS content views) which can support NLP processing of either the biomedical literature or clinical text. We found that suppression of highly ambiguous terms in the conservative AutoFilter content view can partially replace manual filtering for literature applications, and suppression of two character mappings in the same content view achieves 89.5% precision at 78.6% recall for clinical applications. [Copyright &y& Elsevier]
- Published
- 2010
- Full Text
- View/download PDF
44. Building a semantically annotated corpus of clinical texts.
- Author
-
Roberts, Angus, Gaizauskas, Robert, Hepple, Mark, Demetriou, George, Guo, Yikun, Roberts, Ian, and Setzer, Andrea
- Abstract
Abstract: In this paper, we describe the construction of a semantically annotated corpus of clinical texts for use in the development and evaluation of systems for automatically extracting clinically significant information from the textual component of patient records. The paper details the sampling of textual material from a collection of 20,000 cancer patient records, the development of a semantic annotation scheme, the annotation methodology, the distribution of annotations in the final corpus, and the use of the corpus for development of an adaptive information extraction system. The resulting corpus is the most richly semantically annotated resource for clinical text processing built to date, whose value has been demonstrated through its use in developing an effective information extraction system. The detailed presentation of our corpus construction and annotation methodology will be of value to others seeking to build high-quality semantically annotated corpora in biomedical domains. [Copyright &y& Elsevier]
- Published
- 2009
- Full Text
- View/download PDF
45. Preparing Clinical Text for Use in Biomedical Research.
- Author
-
Pestian, John P., Itert, Lukasz, Andersen, Charlotte, and Duch, Wlodzislaw
- Subjects
MEDICAL research ,ELECTRONIC health records ,ANNOTATIONS ,HOSPITAL records ,MEDICAL centers ,AUTOMATIC data collection systems ,HOSPITAL admission & discharge ,CHILDREN'S hospitals - Abstract
Approximately 57 different types of clinical annotations construct a patient's medical record. These annotations include radiology reports, discharge summaries, and surgical and nursing notes. Hospitals typically produce millions of text-based medical records over the course of a year These records are essential for the delivery of care, but many are underutilized or not utilized at all for clinical research. The textual data found in these annotations is a rich source of insights into aspects of clinical care and the clinical delivery system. Recent regulatory actions, however require that, in many cases, data not obtained through informed consent or data not related to the delivery of care must be made anonymous (as referred to by regulators as harmless), before they can be used This article describes a practical approach with which Cincinnati Children's Hospital Medical Center (CCHMC), a large pediatric academic medical center with more than 761,000 annual patient encounters, developed open source software for making pediatric clinical text harmless without losing its rich meaning. Development of the software dealt with many of the issues that often arise in natural language processing, such as data collection, disambiguation, and data scrubbing. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
46. Attention-based bidirectional long short-term memory networks for extracting temporal relationships from clinical discharge summaries.
- Author
-
Alfattni, Ghada, Peek, Niels, and Nenadic, Goran
- Abstract
Temporal relation extraction between health-related events is a widely studied task in clinical Natural Language Processing (NLP). The current state-of-the-art methods mostly rely on engineered features (i.e., rule-based modelling) and sequence modelling, which often encodes a source sentence into a single fixed-length context. An obvious disadvantage of this fixed-length context design is its incapability to model longer sentences, as important temporal information in the clinical text may appear at different positions. To address this issue, we propose an Attention-based Bidirectional Long Short-Term Memory (Att-BiLSTM) model to enable learning the important semantic information in long source text segments and to better determine which parts of the text are most important. We experimented with two embeddings and compared the performances to traditional state-of-the-art methods that require elaborate linguistic pre-processing and hand-engineered features. The experimental results on the i2b2 2012 temporal relation test corpus show that the proposed method achieves a significant improvement with an F-score of 0.811, which is at least 10% better than state-of-the-art in the field. We show that the model can be remarkably effective at classifying temporal relations when provided with word embeddings trained on corpora in a general domain. Finally, we perform an error analysis to gain insight into the common errors made by the model. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
47. An efficient prototype method to identify and correct misspellings in clinical text
- Author
-
Yijun Shao, T. Elizabeth Workman, Qing Zeng-Treitler, and Guy Divita
- Subjects
0301 basic medicine ,Research Report ,Medical Records Systems, Computerized ,Computer science ,Pathology, Surgical ,lcsh:Medicine ,Dictionaries as Topic ,computer.software_genre ,General Biochemistry, Genetics and Molecular Biology ,03 medical and health sciences ,Spelling analysis ,0302 clinical medicine ,Error analysis ,Text messaging ,False positive paradox ,Humans ,Word2vec ,Word2Vec ,030212 general & internal medicine ,lcsh:Science (General) ,lcsh:QH301-705.5 ,Language ,Natural Language Processing ,business.industry ,lcsh:R ,Reproducibility of Results ,General Medicine ,Emergency department ,Clinical text ,Unified Medical Language System ,Spelling ,Term (time) ,Research Note ,030104 developmental biology ,lcsh:Biology (General) ,Vocabulary, Controlled ,Word embeddings ,Edit distance ,Artificial intelligence ,business ,computer ,Spelling correction ,Natural language processing ,Algorithms ,Medical Informatics ,lcsh:Q1-390 - Abstract
Objective Misspellings in clinical free text present challenges to natural language processing. With an objective to identify misspellings and their corrections, we developed a prototype spelling analysis method that implements Word2Vec, Levenshtein edit distance constraints, a lexical resource, and corpus term frequencies. We used the prototype method to process two different corpora, surgical pathology reports, and emergency department progress and visit notes, extracted from Veterans Health Administration resources. We evaluated performance by measuring positive predictive value and performing an error analysis of false positive output, using four classifications. We also performed an analysis of spelling errors in each corpus, using common error classifications. Results In this small-scale study utilizing a total of 76,786 clinical notes, the prototype method achieved positive predictive values of 0.9057 and 0.8979, respectively, for the surgical pathology reports, and emergency department progress and visit notes, in identifying and correcting misspelled words. False positives varied by corpus. Spelling error types were similar among the two corpora, however, the authors of emergency department progress and visit notes made over four times as many errors. Overall, the results of this study suggest that this method could also perform sufficiently in identifying misspellings in other clinical document types. Electronic supplementary material The online version of this article (10.1186/s13104-019-4073-y) contains supplementary material, which is available to authorized users.
- Published
- 2019
48. Extracting and classifying diagnosis dates from clinical notes: A case study.
- Author
-
Fu, Julia T., Sholle, Evan, Krichevsky, Spencer, Scandura, Joseph, and Campion, Thomas R.
- Abstract
Myeloproliferative neoplasms (MPNs) are chronic hematologic malignancies that may progress over long disease courses. The original date of diagnosis is an important piece of information for patient care and research, but is not consistently documented. We describe an attempt to build a pipeline for extracting dates with natural language processing (NLP) tools and techniques and classifying them as relevant diagnoses or not. Inaccurate and incomplete date extraction and interpretation impacted the performance of the overall pipeline. Existing lightweight Python packages tended to have low specificity for identifying and interpreting partial and relative dates in clinical text. A rules-based regular expression (regex) approach achieved recall of 83.0% on dates manually annotated as diagnosis dates, and 77.4% on all annotated dates. With only 3.8% of annotated dates representing initial MPN diagnoses, additional methods of targeting candidate date instances may alleviate noise and class imbalance. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
49. Modern Clinical Text Mining: A Guide and Review.
- Author
-
Percha B
- Subjects
- Electronic Health Records, Humans, Machine Learning, Time, Data Mining, Physicians
- Abstract
Electronic health records (EHRs) are becoming a vital source of data for healthcare quality improvement, research, and operations. However, much of the most valuable information contained in EHRs remains buried in unstructured text. The field of clinical text mining has advanced rapidly in recent years, transitioning from rule-based approaches to machine learning and, more recently, deep learning. With new methods come new challenges, however, especially for those new to the field. This review provides an overview of clinical text mining for those who are encountering it for the first time (e.g., physician researchers, operational analytics teams, machine learning scientists from other domains). While not a comprehensive survey, this review describes the state of the art, with a particular focus on new tasks and methods developed over the past few years. It also identifies key barriers between these remarkable technical advances and the practical realities of implementation in health systems and in industry.
- Published
- 2021
- Full Text
- View/download PDF
50. Predicting Semantic Similarity Between Clinical Sentence Pairs Using Transformer Models: Evaluation and Representational Analysis.
- Author
-
Ormerod M, Martínez Del Rincón J, and Devereux B
- Abstract
Background: Semantic textual similarity (STS) is a natural language processing (NLP) task that involves assigning a similarity score to 2 snippets of text based on their meaning. This task is particularly difficult in the domain of clinical text, which often features specialized language and the frequent use of abbreviations., Objective: We created an NLP system to predict similarity scores for sentence pairs as part of the Clinical Semantic Textual Similarity track in the 2019 n2c2/OHNLP Shared Task on Challenges in Natural Language Processing for Clinical Data. We subsequently sought to analyze the intermediary token vectors extracted from our models while processing a pair of clinical sentences to identify where and how representations of semantic similarity are built in transformer models., Methods: Given a clinical sentence pair, we take the average predicted similarity score across several independently fine-tuned transformers. In our model analysis we investigated the relationship between the final model's loss and surface features of the sentence pairs and assessed the decodability and representational similarity of the token vectors generated by each model., Results: Our model achieved a correlation of 0.87 with the ground-truth similarity score, reaching 6th place out of 33 teams (with a first-place score of 0.90). In detailed qualitative and quantitative analyses of the model's loss, we identified the system's failure to correctly model semantic similarity when both sentence pairs contain details of medical prescriptions, as well as its general tendency to overpredict semantic similarity given significant token overlap. The token vector analysis revealed divergent representational strategies for predicting textual similarity between bidirectional encoder representations from transformers (BERT)-style models and XLNet. We also found that a large amount information relevant to predicting STS can be captured using a combination of a classification token and the cosine distance between sentence-pair representations in the first layer of a transformer model that did not produce the best predictions on the test set., Conclusions: We designed and trained a system that uses state-of-the-art NLP models to achieve very competitive results on a new clinical STS data set. As our approach uses no hand-crafted rules, it serves as a strong deep learning baseline for this task. Our key contribution is a detailed analysis of the model's outputs and an investigation of the heuristic biases learned by transformer models. We suggest future improvements based on these findings. In our representational analysis we explore how different transformer models converge or diverge in their representation of semantic signals as the tokens of the sentences are augmented by successive layers. This analysis sheds light on how these "black box" models integrate semantic similarity information in intermediate layers, and points to new research directions in model distillation and sentence embedding extraction for applications in clinical NLP., (©Mark Ormerod, Jesús Martínez del Rincón, Barry Devereux. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 26.05.2021.)
- Published
- 2021
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.