Back to Search Start Over

Elimination of Foreign Sequences in Eukaryotic Viral Reference Genomes Improves the Accuracy of Virome Analysis

Authors :
Junjie Chen
Yue Sun
Xiaomin Yan
Zilin Ren
Guoshuai Wang
Yuhang Liu
Zihan Zhao
Le Yi
Changchun Tu
Biao He
Source :
mSystems. 7
Publication Year :
2022
Publisher :
American Society for Microbiology, 2022.

Abstract

Widespread in public databases, foreign contaminant sequences pose a substantial obstacle in genomic analyses. Such contamination in viral genome databases is also notorious but more complicated and often causes questionable results in various applications, particularly in virome-based virus detection. Here, we conducted comprehensive screening and identification of the foreign sequences hidden in the largest eukaryotic viral genome collections of GenBank and UniProt using a scrutiny pipeline, which enables us to rigorously detect those problematic viral sequences (PVSs) with origins in hosts, vectors, and laboratory components. As a result, a total of 766 nucleotide PVSs and 276 amino acid PVSs with lengths up to 6,605 bp were determined, which were widely distributed in 39 families with many involving highly public health-concerning viruses, such as hepatitis C virus, Crimean-Congo hemorrhagic fever virus, and filovirus. The majority of these PVSs are genomic fragments of hosts including humans and bacteria. However, they cannot simply be regarded as foreign contaminants, since parts of them are results of natural occurrence or artificial engineering of viruses. Nevertheless, they severely disturb such sequence-based analyses as genome annotation, taxonomic assignment, and virome profiling. Therefore, we provide a clean version of the eukaryotic viral reference data set by the removal of these PVSs, which allows more accurate virome analysis with less time consumed than with other comprehensive databases.

Details

ISSN :
23795077
Volume :
7
Database :
OpenAIRE
Journal :
mSystems
Accession number :
edsair.doi.dedup.....708769c31b5a344917ee8fa1a030e1f3
Full Text :
https://doi.org/10.1128/msystems.00907-22