19 results on '"Özlem Uzuner"'
Search Results
2. Fake Document Generation for Cyber Deception by Manipulating Text Comprehensibility
- Author
-
Hemant Purohit, Prakruthi Karuna, Sushil Jajodia, Rajesh Ganesan, and Özlem Uzuner
- Subjects
Class (computer programming) ,021103 operations research ,Information retrieval ,Computer Networks and Communications ,Computer science ,business.industry ,media_common.quotation_subject ,0211 other engineering and technologies ,Access control ,02 engineering and technology ,Intellectual property ,Deception ,Psycholinguistics ,Computer Science Applications ,Reading comprehension ,Control and Systems Engineering ,Genetic algorithm ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,Electrical and Electronic Engineering ,Set (psychology) ,business ,Information Systems ,media_common - Abstract
Advanced cyber attackers can penetrate enterprise networks and steal critical documents containing intellectual property despite all access control measures. Cyber deception is one of many solutions to protect critical documents after an attacker penetrates the network. It requires the generation and deployment of decoys such as fake text. The comprehensibility of a fake text document can affect the required time and effort for an attack to succeed. However, existing cybersecurity research has given limited attention to exploring the comprehensibility features of text for fake document generation. This article presents a novel method to generate believable fake text documents by measuring and manipulating the comprehensibility of legit text within a genetic algorithm (GA) framework. For measuring text comprehensibility, we adopt a set of quantitative measures based on qualitative principles of psycholinguistics and reading comprehension: connectivity , dispersion , and sequentiality . Our user-study analysis indicates that the quantitative comprehensibility measures can approximate the degree of human effort required to comprehend a fake text document in contrast to a legit text. For manipulating text comprehensibility, we develop a multiobjective, multimutation GA that modifies a legit document to Pareto-optimally alter its comprehensibility measures and generate hard-to-comprehend, believable fake documents. Our experiments show that the proposed algorithm successfully generates fake documents for a broader class of legit documents with varied text characteristics when compared to baselines from previous research. Hence, the application of our method can help improve cyber deception systems by providing more believable yet hard-to-comprehend fake documents to mislead cyber attackers.
- Published
- 2021
- Full Text
- View/download PDF
3. An Exploratory Study on Pseudo-Data Generation in Prescription and Adverse Drug Reaction Extraction
- Author
-
Carson, Tao, Kahyun, Lee, Michele, Filannino, and Özlem, Uzuner
- Subjects
Drug-Related Side Effects and Adverse Reactions ,Knowledge Bases ,Information Storage and Retrieval ,Drug Prescriptions ,Natural Language Processing ,Semantics - Abstract
Prescription information and adverse drug reactions (ADR) are two components of detailed medication instructions that can benefit many aspects of clinical research. Automatic extraction of this information from free-text narratives via Information Extraction (IE) can open it up to downstream uses. IE is commonly tackled by supervised Natural Language Processing (NLP) systems which rely on annotated training data. However, training data generation is manual, time-consuming, and labor-intensive. It is desirable to develop automatic methods for augmenting manually labeled data. We propose pseudo-data generation as one such automatic method. Pseudo-data are synthetic data generated by combining elements of existing labeled data. We propose and evaluate two sets of pseudo-data generation methods: knowledge-driven methods based on gazetteers and data-driven methods based on deep learning. We use the resulting pseudo-data to improve medication and ADR extraction. Data-driven pseudo-data are suitable for concept categories with high semantic regularities and short textual spans. Knowledge-driven pseudo-data are effective for concept categories with longer textual spans, assuming the knowledge base offers good coverage of these concepts. Combining the knowledge- and data-driven pseudo-data achieves significant performance improvement on medication names and ADRs over baselines limited to the use of available labeled data.
- Published
- 2019
4. An Empirical Test of GRUs and Deep Contextualized Word Representations on De-Identification
- Author
-
Kahyun, Lee, Michele, Filannino, and Özlem, Uzuner
- Subjects
Data Anonymization ,Electronic Health Records - Abstract
De-identification aims to remove 18 categories of protected health information from electronic health records. Ideally, de-identification systems should be reliable and generalizable. Previous research has focused on improving performance but has not examined generalizability. This paper investigates both performance and generalizability. To improve current state-of-the-art performance based on long short-term memory (LSTM) units, we introduce a system that uses gated recurrent units (GRUs) and deep contextualized word representations, both of which have never been applied to de-identification. We measure performance and generalizability of each system using the 2014 i2b2/UTHealth and 2016 CEGS N-GRID de-identification datasets. We show that deep contextualized word representations improve state-of-the-art performance, while the benefit of switching LSTM units with GRUs is not significant. The generalizability of de-identification system significantly improved with deep contextualized word representations; in addition, LSTM units-based system is more generalizable than the GRUs-based system.
- Published
- 2019
5. E-petition popularity: Do linguistic and semantic factors matter?
- Author
-
Loni Hagen, Tim Fake, Teresa M. Harrison, Will May, Satya Katragadda, and Özlem Uzuner
- Subjects
Topic model ,Wicked problem ,Sociology and Political Science ,Repetition (rhetorical device) ,Computer science ,Process (engineering) ,05 social sciences ,050801 communication & media studies ,Library and Information Sciences ,computer.software_genre ,Popularity ,Linguistics ,0506 political science ,World Wide Web ,0508 media and communications ,Variation (linguistics) ,Named-entity recognition ,Public participation ,050602 political science & public administration ,Law ,computer - Abstract
E-petitioning technology platforms elicit the participation of citizens in the policy-making process but at the same time create large volumes of unstructured textual data that are difficult to analyze. Fortunately, computational tools can assist policy analysts in uncovering latent patterns from these large textual datasets. This study uses such computational tools to explore e-petitions, viewing them as persuasive texts with linguistic and semantic features that may be related to the popularity of petitions, as indexed by the number of signatures they attract. Using We the People website data, we analyzed linguistic features, such as extremity and repetition, and semantic features, such as named entities and topics, to determine whether and to what extent they are related to petition popularity. The results show that each block of variables independently explains statistically significant variation in signature accumulation, and that 1) language extremity is persistently and negatively associated with petition popularity, 2) petitions with many names tend not to become popular, and 3) petition popularity is associated with petitions that include topics familiar to the public or about important social events. We believe explorations along these lines will yield useful strategies to address the wicked problem of too much text data and to facilitate the enhancement of public participation in policy-making.
- Published
- 2016
- Full Text
- View/download PDF
6. Do Sentiments Matter in Fraud Detection? Estimating Semantic Orientation of Annual Reports
- Author
-
Sunita Goel and Özlem Uzuner
- Subjects
Subjectivity ,Orientation (computer vision) ,business.industry ,Contrast (statistics) ,Advertising ,02 engineering and technology ,Adverb ,computer.software_genre ,General Business, Management and Accounting ,Frequent use ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Artificial intelligence ,Psychology ,business ,computer ,Adjective ,Finance ,Natural language processing - Abstract
We present a novel approach for analysing the qualitative content of annual reports. Using natural language processing techniques we determine if sentiment expressed in the text matters in fraud detection. We focus on the Management Discussion and Analysis MD&A section of annual reports because of the nonfactual content present in this section, unlike other components of the annual reports. We measure the sentiment expressed in the text on the dimensions of polarity, subjectivity, and intensity and investigate in depth whether truthful and fraudulent MD&As differ in terms of sentiment polarity, sentiment subjectivity and sentiment intensity. Our results show that fraudulent MD&As on average contain three times more positive sentiment and four times more negative sentiment compared with truthful MD&As. This suggests that use of both positive and negative sentiment is more pronounced in fraudulent MD&As. We further find that, compared with truthful MD&As, fraudulent MD&As contain a greater proportion of subjective content than objective content. This suggests that the use of subjectivity clues such as presence of too many adjectives and adverbs could be an indicator of fraud. Clear cases of fraud show a higher intensity of sentiment exhibited by more use of adverbs in the "adverb modifying adjective" pattern. Based on the results of this study, frequent use of intensifiers, particularly in this pattern, could be another indicator of fraud. Moreover, the dimensions of subjectivity and intensity help in accurately classifying borderline examples of MD&As that are equal in sentiment polarity into fraudulent and truthful categories. When taken together, these findings suggest that fraudulent MD&As in contrast to truthful MD&As contain higher sentiment content. Copyright © 2016 John Wiley & Sons, Ltd.
- Published
- 2016
- Full Text
- View/download PDF
7. Advancing the State of the Art in Clinical Natural Language Processing through Shared Tasks
- Author
-
Michele Filannino and Özlem Uzuner
- Subjects
020205 medical informatics ,Computer science ,MEDLINE ,02 engineering and technology ,computer.software_genre ,scientific challenges ,03 medical and health sciences ,Section 9: Natural Language Processing ,0302 clinical medicine ,Named-entity recognition ,Clinical natural language processing ,Health care ,0202 electrical engineering, electronic engineering, information engineering ,Adverse Drug Reaction Reporting Systems ,Electronic Health Records ,Confidentiality ,Social media ,030212 general & internal medicine ,shared tasks ,Survey ,Drug Labeling ,Natural Language Processing ,Drug labeling ,business.industry ,General Medicine ,Relationship extraction ,Artificial intelligence ,business ,computer ,Social Media ,Natural language processing - Abstract
Objectives: To review the latest scientific challenges organized in clinical Natural Language Processing (NLP) by highlighting the tasks, the most effective methodologies used, the data, and the sharing strategies. Methods: We harvested the literature by using Google Scholar and PubMed Central to retrieve all shared tasks organized since 2015 on clinical NLP problems on English data. Results: We surveyed 17 shared tasks. We grouped the data into four types (synthetic, drug labels, social data, and clinical data) which are correlated with size and sensitivity. We found named entity recognition and classification to be the most common tasks. Most of the methods used to tackle the shared tasks have been data-driven. There is homogeneity in the methods used to tackle the named entity recognition tasks, while more diverse solutions are investigated for relation extraction, multi-class classification, and information retrieval problems. Conclusions: There is a clear trend in using data-driven methods to tackle problems in clinical NLP. The availability of more and varied data from different institutions will undoubtedly lead to bigger advances in the field, for the benefit of healthcare as a whole.
- Published
- 2018
8. Subgraph augmented non-negative tensor factorization (SANTF) for modeling clinical narrative text
- Author
-
Ephraim P. Hochberg, Yu Xin, Rohit Joshi, Peter Szolovits, Özlem Uzuner, and Yuan Luo
- Subjects
Narration ,business.industry ,Health Informatics ,Pattern recognition ,computer.software_genre ,Hodgkin Disease ,Matrix decomposition ,Non-negative matrix factorization ,Tensor (intrinsic definition) ,Feature (machine learning) ,Selection (linguistics) ,Data Mining ,Electronic Health Records ,Humans ,Unsupervised learning ,Focus on Natural Language Processing ,Artificial intelligence ,Cluster analysis ,business ,computer ,Natural language processing ,Natural Language Processing ,Unsupervised Machine Learning ,Mathematics ,Interpretability - Abstract
Objective Extracting medical knowledge from electronic medical records requires automated approaches to combat scalability limitations and selection biases. However, existing machine learning approaches are often regarded by clinicians as black boxes. Moreover, training data for these automated approaches at often sparsely annotated at best. The authors target unsupervised learning for modeling clinical narrative text, aiming at improving both accuracy and interpretability. Methods The authors introduce a novel framework named subgraph augmented non-negative tensor factorization (SANTF). In addition to relying on atomic features (e.g., words in clinical narrative text), SANTF automatically mines higher-order features (e.g., relations of lymphoid cells expressing antigens) from clinical narrative text by converting sentences into a graph representation and identifying important subgraphs. The authors compose a tensor using patients, higher-order features, and atomic features as its respective modes. We then apply non-negative tensor factorization to cluster patients, and simultaneously identify latent groups of higher-order features that link to patient clusters, as in clinical guidelines where a panel of immunophenotypic features and laboratory results are used to specify diagnostic criteria. Results and Conclusion SANTF demonstrated over 10% improvement in averaged F-measure on patient clustering compared to widely used non-negative matrix factorization (NMF) and k-means clustering methods. Multiple baselines were established by modeling patient data using patient-by-features matrices with different feature configurations and then performing NMF or k-means to cluster patients. Feature analysis identified latent groups of higher-order features that lead to medical insights. We also found that the latent groups of atomic features help to better correlate the latent groups of higher-order features.
- Published
- 2015
- Full Text
- View/download PDF
9. Evaluating temporal relations in clinical text: 2012 i2b2 Challenge
- Author
-
Anna Rumshisky, Weiyi Sun, and Özlem Uzuner
- Subjects
Normalization (statistics) ,Relation (database) ,Event (computing) ,Computer science ,business.industry ,Patient Discharge Summaries ,Health Informatics ,Timeline ,Rule-based system ,Review ,computer.software_genre ,Time ,Translational Research, Biomedical ,Information extraction ,Artificial Intelligence ,Informatics ,Electronic Health Records ,Humans ,Artificial intelligence ,Heuristics ,business ,computer ,Natural language processing ,Natural Language Processing - Abstract
Background The Sixth Informatics for Integrating Biology and the Bedside (i2b2) Natural Language Processing Challenge for Clinical Records focused on the temporal relations in clinical narratives. The organizers provided the research community with a corpus of discharge summaries annotated with temporal information, to be used for the development and evaluation of temporal reasoning systems. 18 teams from around the world participated in the challenge. During the workshop, participating teams presented comprehensive reviews and analysis of their systems, and outlined future research directions suggested by the challenge contributions. Methods The challenge evaluated systems on the information extraction tasks that targeted: (1) clinically significant events, including both clinical concepts such as problems, tests, treatments, and clinical departments, and events relevant to the patient's clinical timeline, such as admissions, transfers between departments, etc; (2) temporal expressions, referring to the dates, times, durations, or frequencies phrases in the clinical text. The values of the extracted temporal expressions had to be normalized to an ISO specification standard; and (3) temporal relations, between the clinical events and temporal expressions. Participants determined pairs of events and temporal expressions that exhibited a temporal relation, and identified the temporal relation between them. Results For event detection, statistical machine learning (ML) methods consistently showed superior performance. While ML and rule based methods seemed to detect temporal expressions equally well, the best systems overwhelmingly adopted a rule based approach for value normalization. For temporal relation classification, the systems using hybrid approaches that combined ML and heuristics based methods produced the best results.
- Published
- 2013
- Full Text
- View/download PDF
10. MCORES: a system for noun phrase coreference resolution for clinical records
- Author
-
Özlem Uzuner, Andreea Bodnari, and Peter Szolovits
- Subjects
Computer science ,Health Informatics ,Research and Applications ,computer.software_genre ,Semantics ,Pattern Recognition, Automated ,Set (abstract data type) ,Artificial Intelligence ,Data Mining ,Electronic Health Records ,Humans ,Natural Language Processing ,Coreference ,business.industry ,Anaphora (linguistics) ,Resolution (logic) ,Classification ,Patient Discharge ,United States ,Noun phrase ,Binary classification ,Pattern recognition (psychology) ,Artificial intelligence ,business ,computer ,Natural language processing - Abstract
Objective Narratives of electronic medical records contain information that can be useful for clinical practice and multi-purpose research. This information needs to be put into a structured form before it can be used by automated systems. Coreference resolution is a step in the transformation of narratives into a structured form. Methods This study presents a medical coreference resolution system (MCORES) for noun phrases in four frequently used clinical semantic categories: persons, problems, treatments, and tests. MCORES treats coreference resolution as a binary classification task. Given a pair of concepts from a semantic category, it determines coreferent pairs and clusters them into chains. MCORES uses an enhanced set of lexical, syntactic, and semantic features. Some MCORES features measure the distance between various representations of the concepts in a pair and can be asymmetric. Results and Conclusion MCORES was compared with an in-house baseline that uses only single-perspective ‘token overlap’ and ‘number agreement’ features. MCORES was shown to outperform the baseline; its enhanced features contribute significantly to performance. In addition to the baseline, MCORES was compared against two available third-party, open-domain systems, RECONCILEACL09 and the Beautiful Anaphora Resolution Toolkit (BART). MCORES was shown to outperform both of these systems on clinical records.
- Published
- 2012
- Full Text
- View/download PDF
11. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text
- Author
-
Brett R. South, Shuying Shen, Scott L. DuVall, and Özlem Uzuner
- Subjects
Training set ,Relation (database) ,Computer science ,business.industry ,Assertion ,Health Informatics ,Decision Support Systems, Clinical ,computer.software_genre ,Task (project management) ,Editorial ,Ensembles of classifiers ,Relation classification ,Perspective ,Data Mining ,Electronic Health Records ,Humans ,Artificial intelligence ,business ,Reference standards ,computer ,Clinical record ,Natural language processing ,Natural Language Processing - Abstract
The 2010 i2b2/VA Workshop on Natural Language Processing Challenges for Clinical Records presented three tasks: a concept extraction task focused on the extraction of medical concepts from patient reports; an assertion classification task focused on assigning assertion types for medical problem concepts; and a relation classification task focused on assigning relation types that hold between medical problems, tests, and treatments. i2b2 and the VA provided an annotated reference standard corpus for the three tasks. Using this reference standard, 22 systems were developed for concept extraction, 21 for assertion classification, and 16 for relation classification. These systems showed that machine learning approaches could be augmented with rule-based systems to determine concepts, assertions, and relations. Depending on the task, the rule-based systems can either provide input for machine learning or post-process the output of machine learning. Ensembles of classifiers, information from unlabeled data, and external knowledge sources can help when the training data are inadequate.
- Published
- 2011
- Full Text
- View/download PDF
12. Semantic relations for problem-oriented medical records
- Author
-
Özlem Uzuner, Tawanda C. Sibanda, Russell Ryan, and Jonathan P. Mailoa
- Subjects
Patient discharge ,Information retrieval ,Medical Records Systems, Computerized ,Computer science ,business.industry ,Medical record ,Search engine indexing ,Medicine (miscellaneous) ,computer.software_genre ,Article ,Patient Care Planning ,Patient Discharge ,Semantics ,Support vector machine ,Relation classification ,Artificial Intelligence ,Medical Records, Problem-Oriented ,Humans ,Artificial intelligence ,business ,computer ,Classifier (UML) ,Natural language processing ,Sentence ,Semantic relation - Abstract
Objective: We describe semantic relation (SR) classification on medical discharge summaries. We focus on relations targeted to the creation of problem-oriented records. Thus, we define relations that involve the medical problems of patients. Methods and materials: We represent patients' medical problems with their diseases and symptoms. We study the relations of patients' problems with each other and with concepts that are identified as tests and treatments. We present an SR classifier that studies a corpus of patient records one sentence at a time. For all pairs of concepts that appear in a sentence, this SR classifier determines the relations between them. In doing so, the SR classifier takes advantage of surface, lexical, and syntactic features and uses these features as input to a support vector machine. We apply our SR classifier to two sets of medical discharge summaries, one obtained from the Beth Israel-Deaconess Medical Center (BIDMC), Boston, MA and the other from Partners Healthcare, Boston, MA. Results: On the BIDMC corpus, our SR classifier achieves micro-averaged F-measures that range from 74% to 95% on the various relation types. On the Partners corpus, the micro-averaged F-measures on the various relation types range from 68% to 91%. Our experiments show that lexical features (in particular, tokens that occur between candidate concepts, which we refer to as inter-concept tokens) are very informative for relation classification in medical discharge summaries. Using only the inter-concept tokens in the corpus, our SR classifier can recognize 84% of the relations in the BIDMC corpus and 72% of the relations in the Partners corpus. Conclusion: These results are promising for semantic indexing of medical records. They imply that we can take advantage of lexical patterns in discharge summaries for relation classification at a sentence level.
- Published
- 2010
- Full Text
- View/download PDF
13. Community annotation experiment for ground truth generation for the i2b2 medication challenge
- Author
-
Fei Xia, Özlem Uzuner, Imre Solti, and Eithon Cadag
- Subjects
Ground truth ,Information retrieval ,Computer science ,media_common.quotation_subject ,Information Storage and Retrieval ,Health Informatics ,Context (language use) ,Patient Discharge ,Annotation ,Pharmaceutical Preparations ,Electronic Health Records ,Humans ,Quality (business) ,Discharge summary ,GeneralLiterature_REFERENCE(e.g.,dictionaries,encyclopedias,glossaries) ,Clinical record ,Research Paper ,Natural Language Processing ,media_common - Abstract
Objective: Within the context of the Third i2b2 Workshop on Natural Language Processing Challenges for Clinical Records, the authors (also referred to as ‘the i2b2 medication challenge team’ or ‘the i2b2 team’ for short) organized a community annotation experiment. Design: For this experiment, the authors released annotation guidelines and a small set of annotated discharge summaries. They asked the participants of the Third i2b2 Workshop to annotate 10 discharge summaries per person; each discharge summary was annotated by two annotators from two different teams, and a third annotator from a third team resolved disagreements. Measurements: In order to evaluate the reliability of the annotations thus produced, the authors measured community inter-annotator agreement and compared it with the inter-annotator agreement of expert annotators when both the community and the expert annotators generated ground truth based on pooled system outputs. For this purpose, the pool consisted of the three most densely populated automatic annotations of each record. The authors also compared the community inter-annotator agreement with expert inter-annotator agreement when the experts annotated raw records without using the pool. Finally, they measured the quality of the community ground truth by comparing it with the expert ground truth. Results and conclusions: The authors found that the community annotators achieved comparable inter-annotator agreement to expert annotators, regardless of whether the experts annotated from the pool. Furthermore, the ground truth generated by the community obtained F-measures above 0.90 against the ground truth of the experts, indicating the value of the community as a source of high-quality ground truth even on intricate and domain-specific annotation tasks.
- Published
- 2010
- Full Text
- View/download PDF
14. Can Linguistic Predictors Detect Fraudulent Financial Filings?
- Author
-
Özlem Uzuner, Jagdish S. Gangolly, Sunita Goel, and Sue R. Faerman
- Subjects
Finance ,Basic premise ,Empirical examination ,Computer science ,business.industry ,Bag-of-words model ,Accounting ,business ,Baseline (configuration management) ,Linguistics ,Computer Science Applications ,Style (sociolinguistics) - Abstract
Extensive research has been done on the analytical and empirical examination of financial data in annual reports to detect fraud; however, there is scant research on the analysis of text in annual reports to detect fraud. The basic premise of this research is that there are clues hidden in the text that can be detected to determine the likelihood of fraud. In this research, we examine both the verbal content and the presentation style of the qualitative portion of the annual reports using natural language processing tools and explore linguistic features that distinguish fraudulent annual reports from nonfraudulent annual reports. Our results indicate that employment of linguistic features is an effective means for detecting fraud. We were able to improve the prediction accuracy of our fraud detection model from initial baseline results of 56.75 percent accuracy, using a “bag of words” approach, to 89.51 percent accuracy when we incorporated linguistically motivated features inspired by our informed reasoning and domain knowledge.
- Published
- 2010
- Full Text
- View/download PDF
15. A de-identifier for medical discharge summaries
- Author
-
Peter Szolovits, Yuan Luo, Tawanda C. Sibanda, and Özlem Uzuner
- Subjects
Conditional random field ,Biomedical Research ,Medical Records Systems, Computerized ,Computer science ,Medicine (miscellaneous) ,Context (language use) ,Semantics ,computer.software_genre ,Article ,Pattern Recognition, Automated ,Task (project management) ,Named-entity recognition ,Artificial Intelligence ,Humans ,Confidentiality ,Natural Language Processing ,business.industry ,Patient Discharge ,Identifier ,Artificial intelligence ,Heuristics ,business ,computer ,Natural language processing - Abstract
Objective: Clinical records contain significant medical information that can be useful to researchers in various disciplines. However, these records also contain personal health information (PHI) whose presence limits the use of the records outside of hospitals. The goal of de-identification is to remove all PHI from clinical records. This is a challenging task because many records contain foreign and misspelled PHI; they also contain PHI that are ambiguous with non-PHI. These complications are compounded by the linguistic characteristics of clinical records. For example, medical discharge summaries, which are studied in this paper, are characterized by fragmented, incomplete utterances and domain-specific language; they cannot be fully processed by tools designed for lay language. Methods and results: In this paper, we show that we can de-identify medical discharge summaries using a de-identifier, Stat De-id, based on support vector machines and local context (F-measure=97% on PHI). Our representation of local context aids de-identification even when PHI include out-of-vocabulary words and even when PHI are ambiguous with non-PHI within the same corpus. Comparison of Stat De-id with a rule-based approach shows that local context contributes more to de-identification than dictionaries combined with hand-tailored heuristics (F-measure=85%). Comparison with two well-known named entity recognition (NER) systems, SNoW (F-measure=94%) and IdentiFinder (F-measure=36%), on five representative corpora show that when the language of documents is fragmented, a system with a relatively thorough representation of local context can be a more effective de-identifier than systems that combine (relatively simpler) local context with global context. Comparison with a Conditional Random Field De-identifier (CRFD), which utilizes global context in addition to the local context of Stat De-id, confirms this finding (F-measure=88%) and establishes that strengthening the representation of local context may be more beneficial for de-identification than complementing local with global context.
- Published
- 2008
- Full Text
- View/download PDF
16. Identifying Patient Smoking Status from Medical Discharge Records
- Author
-
Ira Goldstein, Isaac S. Kohane, Yuan Luo, and Özlem Uzuner
- Subjects
Medical education ,Medical Records Systems, Computerized ,Recall ,business.industry ,Smoking ,Unified Medical Language System ,Health Informatics ,Classification ,Variety (linguistics) ,Data science ,Medical Records ,Patient Discharge ,Identifier ,Viewpoint Paper ,Annotation ,Identification (information) ,Social history (medicine) ,Informatics ,Humans ,Medicine ,business ,Natural Language Processing - Abstract
The authors organized a Natural Language Processing (NLP) challenge on automatically determining the smoking status of patients from information found in their discharge records. This challenge was issued as a part of the i2b2 (Informatics for Integrating Biology to the Bedside) project, to survey, facilitate, and examine studies in medical language understanding for clinical narratives. This article describes the smoking challenge, details the data and the annotation process, explains the evaluation metrics, discusses the characteristics of the systems developed for the challenge, presents an analysis of the results of received system runs, draws conclusions about the state of the art, and identifies directions for future research. A total of 11 teams participated in the smoking challenge. Each team submitted up to three system runs, providing a total of 23 submissions. The submitted system runs were evaluated with microaveraged and macroaveraged precision, recall, and F-measure. The systems submitted to the smoking challenge represented a variety of machine learning and rule-based algorithms. Despite the differences in their approaches to smoking status identification, many of these systems provided good results. There were 12 system runs with microaveraged F-measures above 0.84. Analysis of the results highlighted the fact that discharge summaries express smoking status using a limited number of textual features (e.g., "smok", "tobac", "cigar", Social History, etc.). Many of the effective smoking status identifiers benefit from these features.
- Published
- 2008
- Full Text
- View/download PDF
17. Evaluating the State-of-the-Art in Automatic De-identification
- Author
-
Yuan Luo, Peter Szolovits, and Özlem Uzuner
- Subjects
Health Insurance Portability and Accountability Act ,Medical Records Systems, Computerized ,Process (engineering) ,Computer science ,De-identification ,Health Informatics ,Data science ,Patient Discharge ,United States ,Set (abstract data type) ,Viewpoint Paper ,Annotation ,Evaluation Studies as Topic ,Terminology as Topic ,Informatics ,Humans ,Confidentiality ,Natural Language Processing ,Test data - Abstract
To facilitate and survey studies in automatic de-identification, as a part of the i2b2 (Informatics for Integrating Biology to the Bedside) project, authors organized a Natural Language Processing (NLP) challenge on automatically removing private health information (PHI) from medical discharge records. This manuscript provides an overview of this de-identification challenge, describes the data and the annotation process, explains the evaluation metrics, discusses the nature of the systems that addressed the challenge, analyzes the results of received system runs, and identifies directions for future research. The de-indentification challenge data consisted of discharge summaries drawn from the Partners Healthcare system. Authors prepared this data for the challenge by replacing authentic PHI with synthesized surrogates. To focus the challenge on non-dictionary-based de-identification methods, the data was enriched with out-of-vocabulary PHI surrogates, i.e., made up names. The data also included some PHI surrogates that were ambiguous with medical non-PHI terms. A total of seven teams participated in the challenge. Each team submitted up to three system runs, for a total of sixteen submissions. The authors used precision, recall, and F-measure to evaluate the submitted system runs based on their token-level and instance-level performance on the ground truth. The systems with the best performance scored above 98% in F-measure for all categories of PHI. Most out-of-vocabulary PHI could be identified accurately. However, identifying ambiguous PHI proved challenging. The performance of systems on the test data set is encouraging. Future evaluations of these systems will involve larger data sets from more heterogeneous sources.
- Published
- 2007
- Full Text
- View/download PDF
18. Editorial: The second international workshop on health natural language processing (HealthNLP 2019)
- Author
-
Yanshan Wang, Hua Xu, and Ozlem Uzuner
- Subjects
Natural language processing ,NLP ,Healthcare ,Electronic health records ,EHR ,Artificial intelligence ,Computer applications to medicine. Medical informatics ,R858-859.7 - Published
- 2019
- Full Text
- View/download PDF
19. Specializing for predicting obesity and its co-morbidities
- Author
-
Ira Goldstein and Özlem Uzuner
- Subjects
Databases, Factual ,Computer science ,Decision tree ,Information Storage and Retrieval ,Health Informatics ,02 engineering and technology ,Comorbidity ,Machine learning ,computer.software_genre ,Health informatics ,Article ,Bayes' theorem ,Predictive Value of Tests ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Humans ,Disease ,Obesity ,Discharge summary ,Patient discharge ,Models, Statistical ,business.industry ,Natural language processing ,Unified Medical Language System ,Decision Trees ,Bayes Theorem ,Combination of classifiers ,Classification ,Patient Discharge ,Computer Science Applications ,Semantics ,ComputingMethodologies_PATTERNRECOGNITION ,020201 artificial intelligence & image processing ,Co morbidity ,Artificial intelligence ,business ,computer ,Classifier (UML) ,Medical Informatics - Abstract
We present specializing, a method for combining classifiers for multi-class classification. Specializing trains one specialist classifier per class and utilizes each specialist to distinguish that class from all others in a one-versus-all manner. It then supplements the specialist classifiers with a catch-all classifier that performs multi-class classification across all classes. We refer to the resulting combined classifier as a specializing classifier.We develop specializing to classify 16 diseases based on discharge summaries. For each discharge summary, we aim to predict whether each disease is present, absent, or questionable in the patient, or unmentioned in the discharge summary. We treat the classification of each disease as an independent multi-class classification task. For each disease, we develop one specialist classifier for each of the present, absent, questionable, and unmentioned classes; we supplement these specialist classifiers with a catch-all classifier that encompasses all of the classes for that disease. We evaluate specializing on each of the 16 diseases and show that it improves significantly over voting and stacking when used for multi-class classification on our data.
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.