Author: "Michael Bada" / Topic: molecular biology - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Michael Bada"' showing total 12 results

Start Over Author "Michael Bada" Topic molecular biology

12 results on '"Michael Bada"'

1. Concept recognition as a machine translation problem

Author: Negacy D. Hailu, Michael Bada, William A. Baumgartner, Lawrence Hunter, and Mayla Boguslav
Subjects: Normalization (statistics), Machine translation, Computer science, QH301-705.5, Computer applications to medicine. Medical informatics, R858-859.7, Ontology (information science), computer.software_genre, Machine learning, Biochemistry, Named entity normalization, Robustness (computer science), Structural Biology, Computational resources, Biology (General), Molecular Biology, Transformer (machine learning model), Hyperparameter, Training set, business.industry, Research, Applied Mathematics, Biomedical text mining, Computer Science Applications, Named entity recognition, Artificial intelligence, business, Concept recognition, Encoder, computer
Abstract: BackgroundAutomated assignment of specific ontology concepts to mentions in text is a critical task in biomedical natural language processing, and the subject of many open shared tasks. Although the current state of the art involves the use of neural network language models as a post-processing step, the very large number of ontology classes to be recognized and the limited amount of gold-standard training data has impeded the creation of end-to-end systems based entirely on machine learning. Recently, Hailu et al. recast the concept recognition problem as a type of machine translation and demonstrated that sequence-to-sequence machine learning models have the potential to outperform multi-class classification approaches.MethodsWe systematically characterize the factors that contribute to the accuracy and efficiency of several approaches to sequence-to-sequence machine learning through extensive studies of alternative methods and hyperparameter selections. We not only identify the best-performing systems and parameters across a wide variety of ontologies but also provide insights into the widely varying resource requirements and hyperparameter robustness of alternative approaches. Analysis of the strengths and weaknesses of such systems suggest promising avenues for future improvements as well as design choices that can increase computational efficiency with small costs in performance.ResultsBidirectional encoder representations from transformers for biomedical text mining (BioBERT) for span detection along with the open-source toolkit for neural machine translation (OpenNMT) for concept normalization achieve state-of-the-art performance for most ontologies annotated in the CRAFT Corpus. This approach uses substantially fewer computational resources, including hardware, memory, and time than several alternative approaches.ConclusionsMachine translation is a promising avenue for fully machine-learning-based concept recognition that achieves state-of-the-art results on the CRAFT Corpus, evaluated via a direct comparison to previous results from the 2019 CRAFT shared task. Experiments illuminating the reasons for the surprisingly good performance of sequence-to-sequence methods targeting ontology identifiers suggest that further progress may be possible by mapping to alternative target concept representations. All code and models can be found at:https://github.com/UCDenver-ccp/Concept-Recognition-as-Translation.
Published: 2021
Full Text: View/download PDF

2. Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles

Author: Miji Joo-young Choi, Karin Verspoor, Martha Palmer, William A. Baumgartner, Lawrence Hunter, Arrick Lanfranchi, Natalya Panteleyeva, Michael Bada, and K. Bretonnel Cohen
Subjects: 0301 basic medicine, Computer science, Annotation, 02 engineering and technology, lcsh:Computer applications to medicine. Medical informatics, Corpus, computer.software_genre, Semantics, Referent, Biochemistry, Domain (software engineering), 03 medical and health sciences, Coreference, Structural Biology, Anaphora, 0202 electrical engineering, electronic engineering, information engineering, Data Mining, lcsh:QH301-705.5, Molecular Biology, Information retrieval, business.industry, Applied Mathematics, Noun phrase, Computer Science Applications, Benchmarking, Information extraction, 030104 developmental biology, lcsh:Biology (General), Identity (object-oriented programming), lcsh:R858-859.7, 020201 artificial intelligence & image processing, Artificial intelligence, Periodicals as Topic, Resolution, business, computer, Natural language processing, Research Article
Abstract: Background Coreference resolution is the task of finding strings in text that have the same referent as other strings. Failures of coreference resolution are a common cause of false negatives in information extraction from the scientific literature. In order to better understand the nature of the phenomenon of coreference in biomedical publications and to increase performance on the task, we annotated the Colorado Richly Annotated Full Text (CRAFT) corpus with coreference relations. Results The corpus was manually annotated with coreference relations, including identity and appositives for all coreferring base noun phrases. The OntoNotes annotation guidelines, with minor adaptations, were used. Interannotator agreement ranges from 0.480 (entity-based CEAF) to 0.858 (Class-B3), depending on the metric that is used to assess it. The resulting corpus adds nearly 30,000 annotations to the previous release of the CRAFT corpus. Differences from related projects include a much broader definition of markables, connection to extensive annotation of several domain-relevant semantic classes, and connection to complete syntactic annotation. Tool performance was benchmarked on the data. A publicly available out-of-the-box, general-domain coreference resolution system achieved an F-measure of 0.14 (B3), while a simple domain-adapted rule-based system achieved an F-measure of 0.42. An ensemble of the two reached F of 0.46. Following the IDENTITY chains in the data would add 106,263 additional named entities in the full 97-paper corpus, for an increase of 76% percent in the semantic classes of the eight ontologies that have been annotated in earlier versions of the CRAFT corpus. Conclusions The project produced a large data set for further investigation of coreference and coreference resolution in the scientific literature. The work raised issues in the phenomenon of reference in this domain and genre, and the paper proposes that many mentions that would be considered generic in the general domain are not generic in the biomedical domain due to their referents to specific classes in domain-specific ontologies. The comparison of the performance of a publicly available and well-understood coreference resolution system with a domain-adapted system produced results that are consistent with the notion that the requirements for successful coreference resolution in this genre are quite different from those of the general domain, and also suggest that the baseline performance difference is quite large.
Published: 2017
Full Text: View/download PDF

3. Identification of OBO nonalignments and its implications for OBO enrichment

Author: Lawrence Hunter and Michael Bada
Subjects: Statistics and Probability, Computer science, Databases and Ontologies, Information Storage and Retrieval, Ontology (information science), Biochemistry, Open Biomedical Ontologies, 03 medical and health sciences, Artificial Intelligence, Databases, Genetic, Controlled vocabulary, Molecular Biology, Natural Language Processing, 030304 developmental biology, 0303 health sciences, Information retrieval, Point (typography), 030306 microbiology, business.industry, Subject (documents), Object (computer science), Original Papers, Computer Science Applications, Systems Integration, Computational Mathematics, Identification (information), Vocabulary, Controlled, Computational Theory and Mathematics, Ontology, Database Management Systems, System integration, business, Algorithms
Abstract: Motivation: Existing projects that focus on the semiautomatic addition of links between existing terms in the Open Biomedical Ontologies can take advantage of reasoners that can make new inferences between terms that are based on the added formal definitions and that reflect nonalignments between the linked terms. However, these projects require that these definitions be necessary and sufficient, a strong requirement that often does not hold. If such definitions cannot be added, the reasoners cannot point to the nonalignments through the suggestion of new inferences. Results: We describe a methodology by which we have identified over 1900 instances of nonredundant nonalignments between terms from the Gene Ontology (GO) biological process (BP), cellular component (CC) and molecular function (MF) ontologies, Chemical Entities of Biological Interest (ChEBI) and the Cell Type Ontology (CL). Many of the 39.8% of these nonalignments whose object terms are more atomic than the subject terms are not currently examined in other ontology-enrichment projects due to the fact that the necessary and sufficient conditions required for the inferences are not currently examined. Analysis of the ratios of nonalignments to assertions from which the nonalignments were identified suggests that BP–MF, BP–BP, BP–CL and CC–CC terms are relatively well-aligned, while ChEBI–MF, BP–ChEBI and CC–MF terms are relatively not aligned well. We propose four ways to resolve an identified nonalignment and recommend an analogous implementation of our methodology in ontology-enrichment tools to identify types of nonalignments that are currently not detected. Availability: The nonalignments discussed in this article may be viewed at http://compbio.uchsc.edu/Hunter_lab/Bada/nonalignments_2008_03_06.html. Code for the generation of these nonalignments is available upon request. Contact: mike.bada@uchsc.edu
Published: 2008
Full Text: View/download PDF

4. Ribosomal dynamics inferred from variations in experimental measurements

Author: Michelle Whirl-Carrillo, D. Rey Banatao, Irene S. Gabashvili, Russ B. Altman, and Michael Bada
Subjects: Models, Molecular, Plane parallel, Bioinformatics, Dynamics (mechanics), Ribosomal RNA, Biology, Crystallography, X-Ray, Ribosome, RNA, Ribosomal, Path (graph theory), Transfer RNA, Nucleic Acid Conformation, 30S, Biological system, Ribosomes, Molecular Biology, 50S
Abstract: The crystal structures of the ribosome reveal remarkable complexity and provide a starting set of snapshots with which to understand the dynamics of translation. To augment the static crystallographic models with dynamic information present in crosslink, footprint, and cleavage data, we examined 2691 proximity measurements and focused on the subset that was apparently incompatible with >40 published crystal structures. The measurements from this subset generally involve regions of the structure that are functionally conserved and structurally flexible. Local movements in the crystallographic states of the ribosome that would satisfy biochemical proximity measurements show coherent patterns suggesting alternative conformations of the ribosome. Three different types of data obtained for the two subunits display similar “mismatching” patterns, suggesting that the signals are robust and real. In particular, there is an indication of coherent motion in the decoding region within the 30S subunit and central protuberance and surrounding areas of the 50S subunit. Directions of rearrangements fluctuate around the proposed path of tRNA translocation and the plane parallel to the interface of the two subunits. Our results demonstrate that systematic combination and analysis of noisy, apparently incompatible data sources can provide biologically useful signals about structural dynamics.
Published: 2003
Full Text: View/download PDF

5. Large-scale biomedical concept recognition: an evaluation of current automatic annotators and their parameters

Author: Karin Verspoor, Michael Bada, Benjamin J. Garcia, William A. Baumgartner, Lawrence Hunter, Christopher S. Funk, Christophe Roeder, and K. Bretonnel Cohen
Subjects: Current (mathematics), Databases, Factual, Computer science, Ontology (information science), computer.software_genre, Biochemistry, Concept recognition, Open Biomedical Ontologies, Structural Biology, Data Mining, Molecular Biology, business.industry, Applied Mathematics, Scale (chemistry), Reproducibility of Results, Biological Ontologies, Computer Science Applications, Range (mathematics), Ontology, Artificial intelligence, Data mining, business, computer, Natural language processing, Research Article
Abstract: Background Ontological concepts are useful for many different biomedical tasks. Concepts are difficult to recognize in text due to a disconnect between what is captured in an ontology and how the concepts are expressed in text. There are many recognizers for specific ontologies, but a general approach for concept recognition is an open problem. Results Three dictionary-based systems (MetaMap, NCBO Annotator, and ConceptMapper) are evaluated on eight biomedical ontologies in the Colorado Richly Annotated Full-Text (CRAFT) Corpus. Over 1,000 parameter combinations are examined, and best-performing parameters for each system-ontology pair are presented. Conclusions Baselines for concept recognition by three systems on eight biomedical ontologies are established (F-measures range from 0.14–0.83). Out of the three systems we tested, ConceptMapper is generally the best-performing system; it produces the highest F-measure of seven out of eight ontologies. Default parameters are not ideal for most systems on most ontologies; by changing parameters F-measure can be increased by up to 0.4. Not only are best performing parameters presented, but suggestions for choosing the best parameters based on ontology characteristics are presented.
Published: 2014
Full Text: View/download PDF

6. Solution structural studies and low-resolution model of the Schizosaccharomyces pombe sap1 protein

Author: Dirk Walther, Marc Delarue, Benoı̂t Arcangioli, Sebastian Doniach, Michael Bada, Stanford University, Incyte Pharmaceuticals, Virus oncogènes, Institut Pasteur [Paris] (IP)-Centre National de la Recherche Scientifique (CNRS), Biochimie Structurale, and Institut Pasteur [Paris] (IP)
Subjects: Models, Molecular, MESH: Protein Structure, Quaternary, Light, [SDV]Life Sciences [q-bio], Protein Data Bank (RCSB PDB), MESH: Amino Acid Sequence, DNA-protein interactions, Diffusion, MESH: Protein Structure, Tertiary, chemistry.chemical_compound, MESH: Structure-Activity Relationship, Structural Biology, Scattering, Radiation, biology, Chemistry, Small-angle X-ray scattering, MESH: Molecular Weight, Resolution (electron density), MESH: DNA, MESH: Diffusion, dynamic light-scattering, DNA-Binding Proteins, Solutions, MESH: Schizosaccharomyces, MESH: Protein Biosynthesis, Dimerization, MESH: Models, Molecular, Algorithms, Protein Binding, Sequence analysis, Recombinant Fusion Proteins, small angle X-ray scattering, MESH: Algorithms, MESH: Solutions, Structure-Activity Relationship, MESH: Computer Simulation, Schizosaccharomyces, MESH: Recombinant Fusion Proteins, MESH: Protein Binding, Computer Simulation, Amino Acid Sequence, MESH: Scattering, Radiation, Protein Structure, Quaternary, Bifunctional, Molecular Biology, low-resolution modelling, Binding Sites, C-terminus, MESH: Ultracentrifugation, MESH: Schizosaccharomyces pombe Proteins, DNA, biology.organism_classification, MESH: Light, Protein Structure, Tertiary, ultracentrifugation analysis, Molecular Weight, Crystallography, MESH: Binding Sites, MESH: Dimerization, Protein Biosynthesis, Intramolecular force, Schizosaccharomyces pombe, Biophysics, Schizosaccharomyces pombe Proteins, Ultracentrifugation, MESH: DNA-Binding Proteins
Abstract: Sap1 is a DNA-binding protein involved in controlling the mating type switch in fission yeast Schizosaccharomyces pombe. In the absence of any significant sequence similarity with any structurally known protein, a variety of biophysical techniques has been used to probe the solution low-resolution structure of the sap1 protein. First, sap1 is demonstrated to be an unusually elongated dimer in solution by measuring the translational diffusion coefficient with two independent techniques: dynamic light-scattering and ultracentrifugation. Second, sequence analysis revealed the existence of a long coiled-coil region, which is responsible for dimerization. The length of the predicted coiled-coil matches estimates drawn from the hydrodynamic experimental behaviour of the molecule. In addition, the same measurements done on a shorter construct with a coiled-coil region shortened by roughly one-half confirmed the localization of the long coiled-coil region. A crude T-shape model incorporating all these information was built. Third, small-angle X-ray scattering (SAXS) of the free molecule provided additional evidence for the model. In particular, the P(r) curve strikingly demonstrates the existence of long intramolecular distances. Using a novel 3D reconstruction algorithm, a low resolution 3D model of the protein has been independently constructed that matches the SAXS experimental data. It also fits the translation diffusion coefficients measurements and agrees with the first T-shaped model. This low-resolution model has clearly biologically relevant new functional implications, suggesting that sap1 is a bifunctional protein, with the two active sites being separated by as much as 120 A; a tetrapeptide repeated four times at the C terminus of the molecule is postulated to be of utmost functional importance.
Published: 2000
Full Text: View/download PDF

7. trans-2-Phenylcyclopropylamine is a substrate for and inactivator of horseradish peroxidase

Author: Molly E. Klein, Robert T. Naismith, Lawrence M. Sayre, Wen-Shan Li, Michael Bada, and Mark D. Tennant
Subjects: Free Radicals, Kinetics, Biophysics, Biochemistry, Medicinal chemistry, Horseradish peroxidase, Cinnamaldehyde, Substrate Specificity, chemistry.chemical_compound, Trans 2 Phenylcyclopropylamine, Structural Biology, Organic chemistry, Phenols, Amines, Enzyme Inhibitors, Molecular Biology, Horseradish Peroxidase, chemistry.chemical_classification, ABTS, biology, Substrate (chemistry), Enzyme, chemistry, biology.protein, Tranylcypromine, Oxidation-Reduction
Abstract: Horseradish peroxidase (HRP) is well known for mediating the electron-transfer oxidation of electron-rich aromatic 'donors' such as phenols and anilines, but has not been described to oxidize aliphatic amines. We here confirm the inability of HRP to oxidize typical aliphatic amines, even those which would exist significantly as free bases at the operative pH. In contrast, trans-2-phenylcyclopropylamine (2-PCPA) is both a substrate (turnover product is cinnamaldehyde) and a time-dependent inactivator of HRP. These activities of 2-PCPA are consistent with either a concerted or rapid sequential one-electron-oxidation/ring-opening to give an intermediate capable of covalent binding to the enzyme. 2-PCPA is the first known example of a simple aliphatic amine which serves as a substrate for HRP under turnover conditions.
Published: 1996
Full Text: View/download PDF

8. A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools

Author: Jinho D. Choi, Miriam R. Eckert, Colin Warner, Christophe Roeder, Michael Bada, Nianwen Xue, Helen L. Johnson, William A. Baumgartner, Lawrence Hunter, Yuriy Malenkiy, Christopher S. Funk, Kevin Bretonnel Cohen, Arrick Lanfranchi, Martha Palmer, and Karin Verspoor
Subjects: Text corpus, Computer science, 02 engineering and technology, lcsh:Computer applications to medicine. Medical informatics, computer.software_genre, Biochemistry, 03 medical and health sciences, Annotation, Resource (project management), Named-entity recognition, Structural Biology, 0202 electrical engineering, electronic engineering, information engineering, Data Mining, lcsh:QH301-705.5, Molecular Biology, Natural Language Processing, 030304 developmental biology, 0303 health sciences, Parsing, business.industry, Applied Mathematics, Lexical analysis, Computer Science Applications, Information extraction, Tokenization (data security), lcsh:Biology (General), lcsh:R858-859.7, 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Software, Natural language processing, Sentence, Research Article
Abstract: Background We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus. Results Many biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data. Conclusions The finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides a valuable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications.
Published: 2012
Full Text: View/download PDF

9. Mining biochemical information: lessons taught by the ribosome

Author: D. Rey Banatao, Michelle Whirl-Carrillo, Irene S. Gabashvili, Russ B. Altman, and Michael Bada
Subjects: Genetics, Models, Molecular, Databases, Factual, Bacterial ribosome, DNA Footprinting, Experimental data, Proteins, Biology, Data Interpretation, Statistical, Protein Biosynthesis, Data analysis, Humans, RNA, Experimental methods, Biological system, Molecular Biology, Ribosomes, Algorithms, Research Article
Abstract: The publication of the crystal structures of the ribosome offers an opportunity to retrospectively evaluate the information content of hundreds of qualitative biochemical and biophysical studies of these structures. We assessed the correspondence between more than 2,500 experimental proximity measurements and the distances observed in the ribosomal crystals. Although detailed experimental procedures and protocols are unique in almost each analyzed paper, the data can be grouped into subsets with similar patterns and analyzed in an integrative fashion. We found that, for crosslinking, footprinting, and cleavage data, the corresponding distances observed in crystal structures generally did not exceed the maximum values expected (from the estimated length of the agent and maximal anticipated deviations from the conformations found in crystals). However, the distribution of distances had heavier tails than those typically assumed when building three-dimensional models, and the fraction of incompatible distances was greater than expected. Some of these incompatibilities can be attributed to the experimental methods used. In addition, the accuracy of these procedures appears to be sensitive to the different reactivities, flexibilities, and interactions among the components. These findings demonstrate the necessity of a very careful analysis of data used for structural modeling and consideration of all possible parameters that could potentially influence the quality of measurements. We conclude that experimental proximity measurements can provide useful distance information for structural modeling, but with a broad distribution of inferred distance ranges. We also conclude that development of automated modeling approaches would benefit from better annotations of experimental data for detection and interpretation of their significance.
Published: 2002

10. Cross-product extensions of the Gene Ontology

Author: Michael Bada, Tanya Z. Berardini, Jane Lomax, Midori A. Harris, Jennifer I. Deegan, Amelia Ireland, Christopher J. Mungall, and David P. Hill
Subjects: Computer science, Process ontology, Cross product, Ontology (information science), computer.software_genre, Mutually exclusive events, Genetics & Genomics, Gene, 0302 clinical medicine, Ontology components, Databases, Genetic, General Materials Science, 030212 general & internal medicine, 0303 health sciences, Hierarchy (mathematics), Gene ontology, Ontology, Ontology-based data integration, Computer Science Applications, Vocabulary, Controlled, GO, Data mining, Term enrichment, Anatomy, Data integration, Bioinformatics, Logic, Cells, Data_MISCELLANEOUS, Health Informatics, Article, Open Biomedical Ontologies, Cross-products, 03 medical and health sciences, OBO Foundry, Controlled vocabulary, Genetics, Upper ontology, Animals, Humans, Pathways, Molecular Biology, CHEBI, 030304 developmental biology, OWL, Information retrieval, Cell Biology, Reasoning, Genes, Database Management Systems, ComputingMethodologies_GENERAL, Gene expression, computer, 030217 neurology & neurosurgery
Abstract: The Gene Ontology (GO) consists of nearly 30,000 classes for describing the activities and locations of gene products. Manual maintenance of ontology of this size is a considerable effort, and errors and inconsistencies inevitably arise. Reasoners can be used to assist with ontology development, automatically placing classes in a subsumption hierarchy based on their properties. However, the historic lack of computable definitions within the GO has prevented the user of these tools.In this paper, we present preliminary results of an ongoing effort to normalize the GO by explicitly stating the definitions of compositional classes in a form that can be used by reasoners. These definitions are partitioned into mutually exclusive cross-product sets, many of which reference other OBO Foundry candidate ontologies for chemical entities, proteins, biological qualities and anatomical entities. Using these logical definitions we are gradually beginning to automate many aspects of ontology development, detecting errors and filling in missing relationships. These definitions also enhance the GO by weaving it into the fabric of a wider collection of interoperating ontologies, increasing opportunities for data integration and enhancing genomic analyses.
Full Text: View/download PDF

11. KaBOB: ontology-based semantic integration of biomedical databases

Author: Michael Bada, William A. Baumgartner, Lawrence Hunter, and Kevin Livingston
Subjects: PubMed, Biomedical Research, Databases, Factual, Biomedical, Knowledge representation and reasoning, Computer science, Knowledge Bases, Information Storage and Retrieval, Ontology (information science), computer.software_genre, Semantic data model, Biochemistry, RDF, Open Biomedical Ontologies, Databases, Text mining, Structural Biology, Humans, Semantic integration, Open biomedical ontologies, Semantic Web, Molecular Biology, OWL, Internet, Semantic data integration, Information retrieval, Database, business.industry, Data Collection, Applied Mathematics, Ontology-based data integration, Computational Biology, Biological Ontologies, computer.file_format, Data science, Semantics, Computer Science Applications, Ontology, business, computer, Semantic web, Research Article, Data integration
Abstract: Background The ability to query many independent biological databases using a common ontology-based semantic model would facilitate deeper integration and more effective utilization of these diverse and rapidly growing resources. Despite ongoing work moving toward shared data formats and linked identifiers, significant problems persist in semantic data integration in order to establish shared identity and shared meaning across heterogeneous biomedical data sources. Results We present five processes for semantic data integration that, when applied collectively, solve seven key problems. These processes include making explicit the differences between biomedical concepts and database records, aggregating sets of identifiers denoting the same biomedical concepts across data sources, and using declaratively represented forward-chaining rules to take information that is variably represented in source databases and integrating it into a consistent biomedical representation. We demonstrate these processes and solutions by presenting KaBOB (the Knowledge Base Of Biomedicine), a knowledge base of semantically integrated data from 18 prominent biomedical databases using common representations grounded in Open Biomedical Ontologies. An instance of KaBOB with data about humans and seven major model organisms can be built using on the order of 500 million RDF triples. All source code for building KaBOB is available under an open-source license. Conclusions KaBOB is an integrated knowledge base of biomedical data representationally based in prominent, actively maintained Open Biomedical Ontologies, thus enabling queries of the underlying data in terms of biomedical concepts (e.g., genes and gene products, interactions and processes) rather than features of source-specific data schemas or file formats. KaBOB resolves many of the issues that routinely plague biomedical researchers intending to work with data from multiple data sources and provides a platform for ongoing data integration and development and for formal reasoning over a wealth of integrated biomedical data. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0559-3) contains supplementary material, which is available to authorized users.
Full Text: View/download PDF

12. Concept annotation in the CRAFT corpus

Author: Donald Evans, Dmitry Sitnikov, William A. Baumgartner, Lawrence Hunter, Michael Bada, Judith A. Blake, Kristin Garcia, Miriam R. Eckert, Krista Shipley, K. Bretonnel Cohen, and Karin Verspoor
Subjects: Markup language, Databases, Factual, Computer science, 0206 medical engineering, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, Information Storage and Retrieval, 02 engineering and technology, Ontology (information science), lcsh:Computer applications to medicine. Medical informatics, computer.software_genre, Biochemistry, Open Biomedical Ontologies, 03 medical and health sciences, Annotation, Structural Biology, Controlled vocabulary, Data Mining, Sequence Ontology, lcsh:QH301-705.5, Molecular Biology, Natural Language Processing, 030304 developmental biology, 0303 health sciences, Information retrieval, Applied Mathematics, Entrez Gene, Computational Biology, Biomedical text mining, Semantics, Computer Science Applications, Information extraction, ComputingMethodologies_PATTERNRECOGNITION, lcsh:Biology (General), Vocabulary, Controlled, Ontology, ComputingMethodologies_DOCUMENTANDTEXTPROCESSING, lcsh:R858-859.7, computer, 020602 bioinformatics, Research Article
Abstract: Background Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text. Results This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement. Conclusions As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

12 results on '"Michael Bada"'

1. Concept recognition as a machine translation problem

2. Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles

3. Identification of OBO nonalignments and its implications for OBO enrichment

4. Ribosomal dynamics inferred from variations in experimental measurements

5. Large-scale biomedical concept recognition: an evaluation of current automatic annotators and their parameters

6. Solution structural studies and low-resolution model of the Schizosaccharomyces pombe sap1 protein

7. trans-2-Phenylcyclopropylamine is a substrate for and inactivator of horseradish peroxidase

8. A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools

9. Mining biochemical information: lessons taught by the ribosome

10. Cross-product extensions of the Gene Ontology

11. KaBOB: ontology-based semantic integration of biomedical databases

12. Concept annotation in the CRAFT corpus

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

12 results on '"Michael Bada"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources