Descriptor: "Sequence Ontology" / Search Limiters: Peer Reviewed - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Sequence Ontology"' showing total 15 results

Start Over Descriptor "Sequence Ontology" Search Limiters Peer Reviewed

15 results on '"Sequence Ontology"'

1. GAD: A Python Script for Dividing Genome Annotation Files into Feature-Based Files

Author: Norhan Yasser and Ahmed Karam
Subjects: Untranslated region, Computer science, Big data, Health Informatics, Genomics, Computational biology, Genome, DNA sequencing, General Biochemistry, Genetics and Molecular Biology, Exon, 03 medical and health sciences, Annotation, Intergenic region, Humans, Sequence Ontology, Gene, 030304 developmental biology, computer.programming_language, Whole genome sequencing, 0303 health sciences, Information retrieval, Genome, Human, business.industry, 030302 biochemistry & molecular biology, Intron, Computational Biology, Molecular Sequence Annotation, Genome project, Gene Annotation, Python (programming language), File format, Computer Science Applications, ComputingMethodologies_PATTERNRECOGNITION, business, computer, Software
Abstract: Nowadays, manipulating and analyzing publicly available genomic datasets become a daily task in bioinformatics and genomics laboratories. The release of several genome sequencing projects prompts bioinformaticians to develop automated scripts and pipelines which analyze genomic datasets in particular gene annotation pipelines. Handling genome annotation files with fully-featured programs used by non-developers is necessary, furthermore, accelerating genomic data analysis with a focus on diminishing the genome annotation and sequence files based on specific features is required. Consequently, to extract genome features from GTF or GFF3 in a precise manner, GAD script (https://github.com/bio-projects/GAD) provides a simple graphical user interface which interpreted by all python versions installed in different operating systems. GAD script contains unique entry widgets which are capable to analyze multiple genome sequence and annotation files by a click. With highly influential coded functions, genome features such upstream genes, downstream genes, intergenic regions, genes, transcripts, exons, introns, coding sequences, five prime untranslated regions, and three prime untranslated regions and other ambiguous sequence ontology terms will be extracted. GAD script outputs the results in diverse file formats such as BED, GTF/GFF3 and FASTA files which supported by other bioinformatics programs. Our script could be incorporated into various pipelines in all genomics laboratories with the aim of accelerating data analysis.
Published: 2020
Full Text: View/download PDF

2. FALDO: a semantic standard for describing the location of nucleotide and protein feature annotation.

Author: Bolleman, Jerven T., Mungall, Christopher J., Strozzi, Francesco, Baran, Joachim, Dumontier, Michel, Bonnal, Raoul J. P., Buels, Robert, Hoehndorf, Robert, Fujisawa, Takatomo, Katayama, Toshiaki, and Cock, Peter J. A.
Subjects: *NUCLEOTIDE sequence, *AMINO acid sequence, *GLYCANS
Abstract: Background: Nucleotide and protein sequence feature annotations are essential to understand biology on the genomic, transcriptomic, and proteomic level. Using Semantic Web technologies to query biological annotations, there was no standard that described this potentially complex location information as subject-predicate-object triples. Description: We have developed an ontology, the Feature Annotation Location Description Ontology (FALDO), to describe the positions of annotated features on linear and circular sequences. FALDO can be used to describe nucleotide features in sequence records, protein annotations, and glycan binding sites, among other features in coordinate systems of the aforementioned "omics" areas. Using the same data format to represent sequence positions that are independent of file formats allows us to integrate sequence data from multiple sources and data types. The genome browser JBrowse is used to demonstrate accessing multiple SPARQL endpoints to display genomic feature annotations, as well as protein annotations from UniProt mapped to genomic locations. Conclusions: Our ontology allows users to uniformly describe - and potentially merge - sequence annotations from multiple sources. Data sources using FALDO can prospectively be retrieved using federalised SPARQL queries against public SPARQL endpoints and/or local private triple stores. [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

3. Evolution of the Sequence Ontology terms and relationships.

Author: Mungall, Christopher J., Batchelor, Colin, and Eilbeck, Karen
Abstract: Abstract: The Sequence Ontology is an established ontology, with a large user community, for the purpose of genomic annotation. We are reforming the ontology to provide better terms and relationships to describe the features of biological sequence, for both genomic and derived sequence. The SO is working within the guidelines of the OBO Foundry to provide interoperability between SO and the other related OBO ontologies. Here, we report changes and improvements made to SO including new relationships to better define the mereological, spatial and temporal aspects of biological sequence. [Copyright &y& Elsevier]
Published: 2011
Full Text: View/download PDF

4. Bovine Genome Database: supporting community annotation and analysis of the Bos taurus genome.

Author: Reese, Justin T, Childers, Christopher P, Sundaram, Jaideep P, Dickens, C Michael, Childs, Kevin L, Vile, Donald C, and Elsik, Christine G
Subjects: *CATTLE, *EUKARYOTIC genomes, *COMMUNITY support, *ANNOTATIONS, *GENOMES, *BIOLOGICAL databases, *MULTIAGENT systems
Abstract: Background: A goal of the Bovine Genome Database (BGD; http://BovineGenome.org) has been to support the Bovine Genome Sequencing and Analysis Consortium (BGSAC) in the annotation and analysis of the bovine genome. We were faced with several challenges, including the need to maintain consistent quality despite diversity in annotation expertise in the research community, the need to maintain consistent data formats, and the need to minimize the potential duplication of annotation effort. With new sequencing technologies allowing many more eukaryotic genomes to be sequenced, the demand for collaborative annotation is likely to increase. Here we present our approach, challenges and solutions facilitating a large distributed annotation project. Results and Discussion: BGD has provided annotation tools that supported 147 members of the BGSAC in contributing 3,871 gene models over a fifteen-week period, and these annotations have been integrated into the bovine Official Gene Set. Our approach has been to provide an annotation system, which includes a BLAST site, multiple genome browsers, an annotation portal, and the Apollo Annotation Editor configured to connect directly to our Chado database. In addition to implementing and integrating components of the annotation system, we have performed computational analyses to create gene evidence tracks and a consensus gene set, which can be viewed on individual gene pages at BGD. Conclusions: We have provided annotation tools that alleviate challenges associated with distributed annotation. Our system provides a consistent set of data to all annotators and eliminates the need for annotators to format data. Involving the bovine research community in genome annotation has allowed us to leverage expertise in various areas of bovine biology to provide biological insight into the genome sequence. [ABSTRACT FROM AUTHOR]
Published: 2010
Full Text: View/download PDF

5. Using semantic web rules to reason on an ontology of pseudogenes

Author: Ekta Khurana, Matthew E. Holford, Mark Gerstein, and Kei-Hoi Cheung
Subjects: Statistics and Probability, Databases, Factual, Computer science, Databases and Ontologies, Information Storage and Retrieval, 02 engineering and technology, Ontology (information science), Biochemistry, World Wide Web, Open Biomedical Ontologies, 03 medical and health sciences, Text mining, 0202 electrical engineering, electronic engineering, information engineering, SPARQL, Sequence Ontology, Molecular Biology, Semantic Web, 030304 developmental biology, Internet, 0303 health sciences, Hierarchy, Information retrieval, Hierarchy (mathematics), business.industry, computer.file_format, Semantic reasoner, Ismb 2010 Conference Proceedings July 11 to July 13, 2010, Boston, Ma, Usa, Original Papers, Semantics, Computer Science Applications, Computational Mathematics, Vocabulary, Controlled, Computational Theory and Mathematics, Knowledge base, Ontology, 020201 artificial intelligence & image processing, business, computer, Pseudogenes
Abstract: Motivation: Recent years have seen the development of a wide range of biomedical ontologies. Notable among these is Sequence Ontology (SO) which offers a rich hierarchy of terms and relationships that can be used to annotate genomic data. Well-designed formal ontologies allow data to be reasoned upon in a consistent and logically sound way and can lead to the discovery of new relationships. The Semantic Web Rules Language (SWRL) augments the capabilities of a reasoner by allowing the creation of conditional rules. To date, however, formal reasoning, especially the use of SWRL rules, has not been widely used in biomedicine. Results: We have built a knowledge base of human pseudogenes, extending the existing SO framework to incorporate additional attributes. In particular, we have defined the relationships between pseudogenes and segmental duplications. We then created a series of logical rules using SWRL to answer research questions and to annotate our pseudogenes appropriately. Finally, we were left with a knowledge base which could be queried to discover information about human pseudogene evolution. Availability: The fully populated knowledge base described in this document is available for download from http://ontology.pseudogene.org. A SPARQL endpoint from which to query the dataset is also available at this location. Contact: matthew.holford@yale.edu; mark.gerstein@yale.edu
Published: 2010
Full Text: View/download PDF

6. SOBA: sequence ontology bioinformatics analysis

Author: Karen Eilbeck, Guozhen Fan, and Barry Moore
Subjects: Genomics, Biology, Bioinformatics, Domain (software engineering), Terminology, World Wide Web, User-Computer Interface, 03 medical and health sciences, Annotation, Consistency (database systems), 0302 clinical medicine, Software, Genetics, Sequence Ontology, GeneralLiterature_REFERENCE(e.g.,dictionaries,encyclopedias,glossaries), 030304 developmental biology, Internet, 0303 health sciences, business.industry, Software development, Articles, Data Interpretation, Statistical, business, Sequence Analysis, 030217 neurology & neurosurgery
Abstract: The advent of cheaper, faster sequencing technologies has pushed the task of sequence annotation from the exclusive domain of large-scale multi-national sequencing projects to that of research laboratories and small consortia. The bioinformatics burden placed on these laboratories, some with very little programming experience can be daunting. Fortunately, there exist software libraries and pipelines designed with these groups in mind, to ease the transition from an assembled genome to an annotated and accessible genome resource. We have developed the Sequence Ontology Bioinformatics Analysis (SOBA) tool to provide a simple statistical and graphical summary of an annotated genome. We envisage its use during annotation jamborees, genome comparison and for use by developers for rapid feedback during annotation software development and testing. SOBA also provides annotation consistency feedback to ensure correct use of terminology within annotations, and guides users to add new terms to the Sequence Ontology when required. SOBA is available at http://www.sequenceontology.org/cgi-bin/soba.cgi.
Published: 2010
Full Text: View/download PDF

7. Using Semantic Web Technologies to Annotate and Align Microarray Designs

Author: Sebastian Szpakowski, Michael Krauthammer, and James P. McCusker
Subjects: Cancer Research, Genomics, Biology, Ontology (information science), computer.software_genre, lcsh:RC254-282, Annotation, semantic web, genomics, SPARQL, ontology, Sequence Ontology, Semantic Web, data integration, computer.programming_language, Information retrieval, Methodology, computer.file_format, lcsh:Neoplasms. Tumors. Oncology. Including cancer and carcinogens, Data science, ComputingMethodologies_PATTERNRECOGNITION, Oncology, annotation, computer, Data integration, RDF query language
Abstract: In this paper, we annotate and align two different gene expression microarray designs using the Genomic ELement Ontology (GELO). GELO is a new ontology that leverages an existing community resource, Sequence Ontology (SO), to create views of genomically-aligned data in a semantic web environment. We start the process by mapping array probes to genomic coordinates. The coordinates represent an implicit link between the probes and multiple genomic elements, such as genes, transcripts, miRNA, and repetitive elements, which are represented using concepts in SO. We then use the RDF Query Language (SPARQL) to create explicit links between the probes and the elements. We show how the approach allows us to easily determine the element coverage and genomic overlap of the two array designs. We believe that the method will ultimately be useful for integration of cancer data across multiple omic studies. The ontology and other materials described in this paper are available at http://krauthammerlab.med.yale.edu/wiki/Gelo .
Published: 2009

8. The Protein Feature Ontology: a tool for the unification of protein feature annotations

Author: Gabrielle A. Reeves, Michele Magrane, Janet M. Thornton, Karen Eilbeck, Luisa Montecchi-Palazzi, Henning Hermjakob, Claire O'Donovan, Andreas Prlić, Midori A. Harris, Rafael C. Jimenez, Sandra Orchard, and Tim Hubbard
Subjects: Statistics and Probability, Proteome, Computer science, computer.internet_protocol, Process ontology, Ontology (information science), Biochemistry, Article, OWL-S, Structural genomics, Open Biomedical Ontologies, World Wide Web, Protein structure, Upper ontology, Databases, Protein, Sequence Ontology, Molecular Biology, Internet, Ontology-based data integration, Suggested Upper Merged Ontology, Computational Biology, Proteins, Computer Science Applications, Computational Mathematics, Vocabulary, Controlled, Computational Theory and Mathematics, Posttranslational modification, Ontology, computer, Ontology alignment, Software
Abstract: The advent of sequencing and structural genomics projects has provided a dramatic boost in the number of protein structures and sequences. Due to the high-throughput nature of these projects, many of the molecules are uncharacterised and their functions unknown. This, in turn, has led to the need for a greater number and diversity of tools and databases providing annotation through transfer based on homology and prediction methods. Though many such tools to annotate protein sequence and structure exist, they are spread throughout the world, often with dedicated individual web pages. This situation does not provide a consensus view of the data and hinders comparison between methods. Integration of these methods is needed. So far this has not been possible since there was no common vocabulary available that could be used as a standard language. A variety of terms could be used to describe any particular feature ranging from different spellings to completely different terms. The Protein Feature Ontology (http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=BS) is a structured controlled vocabulary for features of a protein sequence or structure. It provides a common language for tools and methods to use, so that integration and comparison of their annotations is possible. The Protein Feature Ontology comprises approximately 100 positional terms (located in a particular region of the sequence), which have been integrated into the Sequence Ontology (SO). 40 non-positional terms which describe general protein properties have also been defined and, in addition, post-translational modifications are described by using an already existing ontology, the Protein Modification Ontology (MOD). The Protein Feature Ontology has been used by the BioSapiens Network of Excellence, a consortium comprising 19 partner sites in 14 European countries generating over 150 distinct annotation types for protein sequences and structures.
Published: 2008
Full Text: View/download PDF

9. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration

Author: Susanna-Assunta Sansone, Louis J. Goldberg, Suzanna E. Lewis, Karen Eilbeck, Jonathan Bard, Michael Ashburner, Nigam H. Shah, Alan Ruttenberg, Amelia Ireland, Christopher J. Mungall, Patricia L. Whetzel, Philippe Rocca-Serra, Werner Ceusters, Neocles B. Leontis, Barry Smith, Richard H. Scheuermann, Cornelius Rosse, and William J. Bug
Subjects: Ontology for Biomedical Investigations, Biomedical Engineering, Information Storage and Retrieval, Bioengineering, Biological Ontologies, Ontology (information science), Bioinformatics, Nervous System, Applied Microbiology and Biotechnology, Data science, Basic Formal Ontology, Article, Open Biomedical Ontologies, Vocabulary, Controlled, Terminology as Topic, OBO Foundry, Humans, Molecular Medicine, Nervous System Physiological Phenomena, IDEF5, Sequence Ontology, Biotechnology
Abstract: The value of any kind of data is greatly enhanced when it exists in a form that allows it to be integrated with other data. One approach to integration is through the annotation of multiple bodies of data using common controlled vocabularies or ‘ontologies’. Unfortunately, the very success of this approach has led to a proliferation of ontologies, which itself creates obstacles to integration. The Open Biomedical Ontologies (OBO) consortium is pursuing a strategy to overcome this problem. Existing OBO ontologies, including the Gene Ontology, are undergoing coordinated reform, and new ontologies are being created on the basis of an evolving set of shared principles governing ontology development. The result is an expanding family of ontologies designed to be interoperable and logically well formed and to incorporate accurate representations of biological reality. We describe this OBO Foundry initiative and provide guidelines for those who might wish to become involved.
Published: 2007
Full Text: View/download PDF

10. Sequence Ontology Annotation Guide

Author: Karen Eilbeck and Suzanna E. Lewis
Subjects: Sequence, Information retrieval, Article Subject, lcsh:QH426-470, Computer science, Construct (python library), Ontology (information science), Visualization, Annotation, lcsh:Genetics, lcsh:Biology (General), Controlled vocabulary, Genetics, lcsh:Q, Line (text file), Sequence Ontology, lcsh:Science, Molecular Biology, lcsh:QH301-705.5, Research Article, Biotechnology
Abstract: This Sequence Ontology (SO) [13] aims to unify the way in which we describe sequence annotations, by providing a controlled vocabulary of terms and the relationships between them. Using SO terms to label the parts of sequence annotations greatly facilitates downstream analyses of their contents, as it ensures that annotations produced by different groups conform to a single standard. This greatly facilitates analyses of annotation contents and characteristics, e.g. comparisons of UTRs, alternative splicing, etc. Because SO also specifies the relationships between features, e.g. part_of, kind_of, annotations described with SO terms are also better substrates for validation and visualization software. This document provides a step-by-step guide to producing a SO compliant file describing a sequence annotation. We illustrate this by using an annotated gene as an example. First we show where the terms needed to describe the gene's features are located in SO and their relationships to one another. We then show line by line how to format the file to construct a SO compliant annotation of this gene.
Published: 2004

11. Bovine Genome Database: supporting community annotation and analysis of the Bos taurus genome

Author: Justin T. Reese, Kevin L. Childs, Christopher P. Childers, Donald C. Vile, Jaideep P. Sundaram, Christine G. Elsik, and C. Michael Dickens
Subjects: lcsh:QH426-470, lcsh:Biotechnology, Statistics as Topic, Vertebrate and Genome Annotation Project, Biology, computer.software_genre, Proteomics, Genome, Database, Annotation, lcsh:TP248.13-248.65, Databases, Genetic, Genetics, Animals, Sequence Ontology, GeneralLiterature_REFERENCE(e.g.,dictionaries,encyclopedias,glossaries), Internet, Molecular Sequence Annotation, Bovine genome, lcsh:Genetics, Cattle, DNA microarray, computer, Biotechnology
Abstract: Background A goal of the Bovine Genome Database (BGD; http://BovineGenome.org) has been to support the Bovine Genome Sequencing and Analysis Consortium (BGSAC) in the annotation and analysis of the bovine genome. We were faced with several challenges, including the need to maintain consistent quality despite diversity in annotation expertise in the research community, the need to maintain consistent data formats, and the need to minimize the potential duplication of annotation effort. With new sequencing technologies allowing many more eukaryotic genomes to be sequenced, the demand for collaborative annotation is likely to increase. Here we present our approach, challenges and solutions facilitating a large distributed annotation project. Results and Discussion BGD has provided annotation tools that supported 147 members of the BGSAC in contributing 3,871 gene models over a fifteen-week period, and these annotations have been integrated into the bovine Official Gene Set. Our approach has been to provide an annotation system, which includes a BLAST site, multiple genome browsers, an annotation portal, and the Apollo Annotation Editor configured to connect directly to our Chado database. In addition to implementing and integrating components of the annotation system, we have performed computational analyses to create gene evidence tracks and a consensus gene set, which can be viewed on individual gene pages at BGD. Conclusions We have provided annotation tools that alleviate challenges associated with distributed annotation. Our system provides a consistent set of data to all annotators and eliminates the need for annotators to format data. Involving the bovine research community in genome annotation has allowed us to leverage expertise in various areas of bovine biology to provide biological insight into the genome sequence.
Published: 2010

12. Protein Ontology and Community Curation

Author: Arighi, Cecilia
Published: 2009
Full Text: View/download PDF

13. Improving the Sequence Ontology terminology for genomic variant annotation

Author: Fiona Cunningham, Graham R. S. Ritchie, Nicole Ruiz-Schultz, Karen Eilbeck, and Barry Moore
Subjects: Structure (mathematical logic), 0303 health sciences, Computer Networks and Communications, Computer science, Short Report, Health Informatics, Computational biology, computer.software_genre, Genome, Computer Science Applications, Terminology, 03 medical and health sciences, Annotation, 0302 clinical medicine, 030220 oncology & carcinogenesis, Data mining, Sequence variation, Sequence Alteration, Sequence Ontology, computer, 030304 developmental biology, Information Systems
Abstract: Background The Genome Variant Format (GVF) uses the Sequence Ontology (SO) to enable detailed annotation of sequence variation. The annotation includes SO terms for the type of sequence alteration, the genomic features that are changed and the effect of the alteration. The SO maintains and updates the specification and provides the underlying ontologicial structure. Methods A requirements analysis was undertaken to gather terms missing in the SO release at the time, but needed to adequately describe the effects of sequence alteration on a set of variant genomic annotations. We have extended and remodeled the SO to include and define all terms that describe the effect of variation upon reference genomic features in the Ensembl variation databases. Results The new terminology was used to annotate the human reference genome with a set of variants from both COSMIC and dbSNP. A GVF file containing 170,853 sequence alterations was generated using the SO terminology to annotate the kinds of alteration, the effect of the alteration and the reference feature changed. There are four kinds of alteration and 24 kinds of effect seen in this dataset. (Ensembl Variation annotates 34 different SO consequence terms: http://www.ensembl.org/info/docs/variation/predicted_data.html). Conclusions We explain the updates to the Sequence Ontology to describe the effect of variation on existing reference features. We have provided a set of annotations using this terminology, and the well defined GVF specification. We have also provided a provisional exploration of this large annotation dataset.
Full Text: View/download PDF

14. Semantic integration of gene expression analysis tools and data sources using software connectors

Author: Flávia A Miyazaki, Ricardo Z. N. Vêncio, Gabriela D A Guardia, and Cléver Ricardo Guareis de Farias
Subjects: Biology, Ontology (information science), computer.software_genre, Semantics, Task (project management), User-Computer Interface, Software, Databases, Genetic, Genetics, Animals, Humans, Semantic integration, Sequence Ontology, Oligonucleotide Array Sequence Analysis, business.industry, Research, Ontology-based data integration, Computational Biology, Sequence Analysis, DNA, Gene Ontology, Gene Expression Regulation, Data exchange, Data mining, SEQUÊNCIA DO DNA, Software engineering, business, computer, Biotechnology
Abstract: Background The study and analysis of gene expression measurements is the primary focus of functional genomics. Once expression data is available, biologists are faced with the task of extracting (new) knowledge associated to the underlying biological phenomenon. Most often, in order to perform this task, biologists execute a number of analysis activities on the available gene expression dataset rather than a single analysis activity. The integration of heteregeneous tools and data sources to create an integrated analysis environment represents a challenging and error-prone task. Semantic integration enables the assignment of unambiguous meanings to data shared among different applications in an integrated environment, allowing the exchange of data in a semantically consistent and meaningful way. This work aims at developing an ontology-based methodology for the semantic integration of gene expression analysis tools and data sources. The proposed methodology relies on software connectors to support not only the access to heterogeneous data sources but also the definition of transformation rules on exchanged data. Results We have studied the different challenges involved in the integration of computer systems and the role software connectors play in this task. We have also studied a number of gene expression technologies, analysis tools and related ontologies in order to devise basic integration scenarios and propose a reference ontology for the gene expression domain. Then, we have defined a number of activities and associated guidelines to prescribe how the development of connectors should be carried out. Finally, we have applied the proposed methodology in the construction of three different integration scenarios involving the use of different tools for the analysis of different types of gene expression data. Conclusions The proposed methodology facilitates the development of connectors capable of semantically integrating different gene expression analysis tools and data sources. The methodology can be used in the development of connectors supporting both simple and nontrivial processing requirements, thus assuring accurate data exchange and information interpretation from exchanged data.
Full Text: View/download PDF

15. Concept annotation in the CRAFT corpus

Author: Donald Evans, Dmitry Sitnikov, William A. Baumgartner, Lawrence Hunter, Michael Bada, Judith A. Blake, Kristin Garcia, Miriam R. Eckert, Krista Shipley, K. Bretonnel Cohen, and Karin Verspoor
Subjects: Markup language, Databases, Factual, Computer science, 0206 medical engineering, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, Information Storage and Retrieval, 02 engineering and technology, Ontology (information science), lcsh:Computer applications to medicine. Medical informatics, computer.software_genre, Biochemistry, Open Biomedical Ontologies, 03 medical and health sciences, Annotation, Structural Biology, Controlled vocabulary, Data Mining, Sequence Ontology, lcsh:QH301-705.5, Molecular Biology, Natural Language Processing, 030304 developmental biology, 0303 health sciences, Information retrieval, Applied Mathematics, Entrez Gene, Computational Biology, Biomedical text mining, Semantics, Computer Science Applications, Information extraction, ComputingMethodologies_PATTERNRECOGNITION, lcsh:Biology (General), Vocabulary, Controlled, Ontology, ComputingMethodologies_DOCUMENTANDTEXTPROCESSING, lcsh:R858-859.7, computer, 020602 bioinformatics, Research Article
Abstract: Background Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text. Results This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement. Conclusions As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

15 results on '"Sequence Ontology"'

1. GAD: A Python Script for Dividing Genome Annotation Files into Feature-Based Files

2. FALDO: a semantic standard for describing the location of nucleotide and protein feature annotation.

3. Evolution of the Sequence Ontology terms and relationships.

4. Bovine Genome Database: supporting community annotation and analysis of the Bos taurus genome.

5. Using semantic web rules to reason on an ontology of pseudogenes

6. SOBA: sequence ontology bioinformatics analysis

7. Using Semantic Web Technologies to Annotate and Align Microarray Designs

8. The Protein Feature Ontology: a tool for the unification of protein feature annotations

9. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration

10. Sequence Ontology Annotation Guide

11. Bovine Genome Database: supporting community annotation and analysis of the Bos taurus genome

12. Protein Ontology and Community Curation

13. Improving the Sequence Ontology terminology for genomic variant annotation

14. Semantic integration of gene expression analysis tools and data sources using software connectors

15. Concept annotation in the CRAFT corpus

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

15 results on '"Sequence Ontology"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources