Back to Search Start Over

Gene Unprediction with Spurio: A tool to identify spurious protein sequences.

Authors :
Höps W
Jeffryes M
Bateman A
Source :
F1000Research [F1000Res] 2018 Mar 02; Vol. 7, pp. 261. Date of Electronic Publication: 2018 Mar 02 (Print Publication: 2018).
Publication Year :
2018

Abstract

We now have access to the sequences of tens of millions of proteins. These protein sequences are essential for modern molecular biology and computational biology. The vast majority of protein sequences are derived from gene prediction tools and have no experimental supporting evidence for their translation.  Despite the increasing accuracy of gene prediction tools there likely exists a large number of spurious protein predictions in the sequence databases.  We have developed the Spurio tool to help identify spurious protein predictions in prokaryotes.  Spurio searches the query protein sequence against a prokaryotic nucleotide database using tblastn and identifies homologous sequences. The tblastn matches are used to score the query sequence's likelihood of being a spurious protein prediction using a Gaussian process model. The most informative feature is the appearance of stop codons within the presumed translation of homologous DNA sequences. Benchmarking shows that the Spurio tool is able to distinguish spurious from true proteins. However, transposon proteins are prone to be predicted as spurious because of the frequency of degraded homologs found in the DNA sequence databases. Our initial experiments suggest that less than 1% of the proteins in the UniProtKB sequence database are likely to be spurious and that Spurio is able to identify over 60 times more spurious proteins than the AntiFam resource. The Spurio software and source code is available under an MIT license at the following URL: https://bitbucket.org/bateman-group/spurio.<br />Competing Interests: No competing interests were disclosed.

Details

Language :
English
ISSN :
2046-1402
Volume :
7
Database :
MEDLINE
Journal :
F1000Research
Publication Type :
Academic Journal
Accession number :
29721311
Full Text :
https://doi.org/10.12688/f1000research.14050.1