Back to Search
Start Over
Annotating genes and genomes with DNA sequences extracted from biomedical articles
- Source :
- Bioinformatics, Bioinformatics; Vol 27
- Publication Year :
- 2011
- Publisher :
- Oxford University Press, 2011.
-
Abstract
- Motivation: Increasing rates of publication and DNA sequencing make the problem of finding relevant articles for a particular gene or genomic region more challenging than ever. Existing text-mining approaches focus on finding gene names or identifiers in English text. These are often not unique and do not identify the exact genomic location of a study. Results: Here, we report the results of a novel text-mining approach that extracts DNA sequences from biomedical articles and automatically maps them to genomic databases. We find that ∼20% of open access articles in PubMed central (PMC) have extractable DNA sequences that can be accurately mapped to the correct gene (91%) and genome (96%). We illustrate the utility of data extracted by text2genome from more than 150 000 PMC articles for the interpretation of ChIP-seq data and the design of quantitative reverse transcriptase (RT)-PCR experiments. Conclusion: Our approach links articles to genes and organisms without relying on gene names or identifiers. It also produces genome annotation tracks of the biomedical literature, thereby allowing researchers to use the power of modern genome browsers to access and analyze publications in the context of genomic data. Availability and implementation: Source code is available under a BSD license from http://sourceforge.net/projects/text2genome/ and results can be browsed and downloaded at http://text2genome.org. Contact: maximilianh@gmail.com Supplementary information: Supplementary data are available at Bioinformatics online.
- Subjects :
- Statistics and Probability
Chromatin Immunoprecipitation
PubMed
Context (language use)
Computational biology
Biology
Biochemistry
Genome
DNA sequencing
03 medical and health sciences
Data Mining
Molecular Biology
Gene
030304 developmental biology
Genetics
0303 health sciences
Base Sequence
Reverse Transcriptase Polymerase Chain Reaction
030302 biochemistry & molecular biology
Molecular Sequence Annotation
Genome project
DNA
Sequence Analysis, DNA
Original Papers
Computer Science Applications
Identifier
Computational Mathematics
Gene nomenclature
Computational Theory and Mathematics
Genes
Data and Text Mining
Databases, Nucleic Acid
Software
Subjects
Details
- Language :
- English
- ISSN :
- 13674811 and 13674803
- Volume :
- 27
- Issue :
- 7
- Database :
- OpenAIRE
- Journal :
- Bioinformatics
- Accession number :
- edsair.doi.dedup.....82e1352669b23b2422f3318cadf97176