Back to Search
Start Over
Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies
- Source :
- PLoS Computational Biology, Vol 5, Iss 12, p e1000605 (2009), PLoS Computational Biology
- Publication Year :
- 2009
- Publisher :
- Public Library of Science (PLoS), 2009.
-
Abstract
- Due to the rapid release of new data from genome sequencing projects, the majority of protein sequences in public databases have not been experimentally characterized; rather, sequences are annotated using computational analysis. The level of misannotation and the types of misannotation in large public databases are currently unknown and have not been analyzed in depth. We have investigated the misannotation levels for molecular function in four public protein sequence databases (UniProtKB/Swiss-Prot, GenBank NR, UniProtKB/TrEMBL, and KEGG) for a model set of 37 enzyme families for which extensive experimental information is available. The manually curated database Swiss-Prot shows the lowest annotation error levels (close to 0% for most families); the two other protein sequence databases (GenBank NR and TrEMBL) and the protein sequences in the KEGG pathways database exhibit similar and surprisingly high levels of misannotation that average 5%–63% across the six superfamilies studied. For 10 of the 37 families examined, the level of misannotation in one or more of these databases is >80%. Examination of the NR database over time shows that misannotation has increased from 1993 to 2005. The types of misannotation that were found fall into several categories, most associated with “overprediction” of molecular function. These results suggest that misannotation in enzyme superfamilies containing multiple families that catalyze different reactions is a larger problem than has been recognized. Strategies are suggested for addressing some of the systematic problems contributing to these high levels of misannotation.<br />Author Summary One of the core elements of modern biological scientific investigation is the universal availability of millions of protein sequences from thousands of different organisms, allowing for exciting new investigations into biological questions. These sequences, found in large primary sequence databases such as GenBank NR or UniProt/TrEMBL, in secondary databases such as the valuable pathways database KEGG, or in highly curated databases such as UniProt/Swiss-Prot, are often annotated by computationally predicted protein functions. The scale of the available predicted function information is enormous but the accuracy of these predictions is essentially unknown. We investigate the critical question of the accuracy of functional predictions in these four public databases. We used 37 well-characterized enzyme families as a gold standard for comparing the accuracy of functional annotations in these databases. We find that function prediction error (i.e., misannotation) is a serious problem in all but the manually curated database Swiss-Prot. We discuss several approaches for mitigating the consequences of these high levels of misannotation.
- Subjects :
- QH301-705.5
Genomics
Biology
computer.software_genre
DNA sequencing
03 medical and health sciences
Cellular and Molecular Neuroscience
Annotation
Protein sequencing
Genetics
Biology (General)
KEGG
Databases, Protein
Molecular Biology
Ecology, Evolution, Behavior and Systematics
030304 developmental biology
0303 health sciences
Ecology
Database
030302 biochemistry & molecular biology
Computational Biology
Genome project
Genetics and Genomics/Bioinformatics
Genetics and Genomics/Gene Function
Biochemistry/Bioinformatics
Computational Theory and Mathematics
Modeling and Simulation
GenBank
Biocatalysis
Database Management Systems
UniProt
computer
Research Article
Subjects
Details
- ISSN :
- 15537358
- Volume :
- 5
- Database :
- OpenAIRE
- Journal :
- PLoS Computational Biology
- Accession number :
- edsair.doi.dedup.....98f3a1926b7b7646d59757d78709b789
- Full Text :
- https://doi.org/10.1371/journal.pcbi.1000605