1. iPHoP: An integrated machine learning framework to maximize host prediction for metagenome-derived viruses of archaea and bacteria
- Author
-
Roux, Simon, Camargo, Antonio Pedro, Coutinho, Felipe H., Dabdoub, Shareef M., Dutilh, Bas E., Nayfach, Stephen, Tritt, Andrew, Arumugam, Manimozhiyan, European Research Council, Deutsche Forschungsgemeinschaft, Alexander von Humboldt Foundation, European Commission, Agencia Estatal de Investigación (España), and Department of Energy (US)
- Subjects
Genome ,Bacteria ,Agricultural and Veterinary Sciences ,General Immunology and Microbiology ,General Neuroscience ,Bioengineering ,Biological Sciences ,Archaea ,Medical and Health Sciences ,General Biochemistry, Genetics and Molecular Biology ,Machine Learning ,Networking and Information Technology R&D (NITRD) ,Viruses ,Metagenome ,2.2 Factors relating to the physical environment ,Viral ,Metagenomics ,Aetiology ,Conserve and sustainably use the oceans, seas and marine resources for sustainable development ,Infection ,General Agricultural and Biological Sciences ,Developmental Biology - Abstract
26 pages, 5 figures, supporting information https://doi.org/10.1371/journal.pbio.3002083.-- Data Availability: All data directly relevant are within the paper and its Supporting Information files. Benchmarking analysis were based on the publicly available IMG/VR v3 database (doi 10.1093/nar/gkaa946 - https://genome.jgi.doe.gov/portal/IMG_VR/IMG_VR.home.html), The extraordinary diversity of viruses infecting bacteria and archaea is now primarily studied through metagenomics. While metagenomes enable high-throughput exploration of the viral sequence space, metagenome-derived sequences lack key information compared to isolated viruses, in particular host association. Different computational approaches are available to predict the host(s) of uncultivated viruses based on their genome sequences, but thus far individual approaches are limited either in precision or in recall, i.e., for a number of viruses they yield erroneous predictions or no prediction at all. Here, we describe iPHoP, a two-step framework that integrates multiple methods to reliably predict host taxonomy at the genus rank for a broad range of viruses infecting bacteria and archaea, while retaining a low false discovery rate. Based on a large dataset of metagenome-derived virus genomes from the IMG/VR database, we illustrate how iPHoP can provide extensive host prediction and guide further characterization of uncultivated viruses, BED was supported by the European Research Council (ERC) Consolidator grant 865694: DiversiPHI, the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy – EXC 2051 – Project-ID 390713860, the Alexander von Humboldt Foundation in the context of an Alexander von Humboldt Professorship funded by the German Federal Ministry of Education and Research, and the European Union’s Horizon 2020 research and innovation program, under the Marie Skłodowska-Curie Actions Innovative Training Networks grant agreement no. 955974 (VIROINF). FHC was supported by a Juan de la Cierva - Incoporación fellowship (Grant IJC2019-039859-I), and had the institutional support of the “Severo Ochoa Centre of Excellence'' accreditation (CEX2019-000928-S). This work was supported by the U.S. Department of Energy, Office of Science, Biological and Environmental Research, Early Career Research Program (SR) awarded under UC-DOE Prime Contract DE-AC02-05CH11231. The work conducted by the U.S. Department of Energy Joint Genome Institute (https://ror.org/04xm1d337), a DOE Office of Science User Facility, is supported by the Office of Science of the U.S. Department of Energy operated under Contract No. DE-AC02-05CH11231 (SR, APC, SN)
- Published
- 2023
- Full Text
- View/download PDF