1. Metaviromic identification of discriminative genomic features in SARS-CoV-2 using machine learning
- Author
-
Jonathan J. Park and Sidi Chen
- Subjects
Coronavirus disease 2019 (COVID-19) ,Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) ,viruses ,coronavirus ,General Decision Sciences ,metavirome ,Biology ,Machine learning ,computer.software_genre ,medicine.disease_cause ,Genome ,Epitope ,Article ,Discriminative model ,vaccine ,medicine ,pathogenicity ,genome ,Sequence (medicine) ,Coronavirus ,business.industry ,SARS-CoV-2 ,virus diseases ,COVID-19 ,machine learning ,Identification (biology) ,Artificial intelligence ,business ,computer - Abstract
The COVID-19 pandemic caused by SARS-CoV-2 has become a major threat across the globe. Here, we developed machine learning approaches to identify key pathogenic regions in coronavirus genomes. We trained and evaluated 7,562,625 models on 3,665 genomes including SARS-CoV-2, MERS-CoV, SARS-CoV, and other coronaviruses of human and animal origins to return quantitative and biologically interpretable signatures at nucleotide and amino acid resolutions. We identified hotspots across the SARS-CoV-2 genome, including previously unappreciated features in spike, RdRp, and other proteins. Finally, we integrated pathogenicity genomic profiles with B cell and T cell epitope predictions for enrichment of sequence targets to help guide vaccine development. These results provide a systematic map of predicted pathogenicity in SARS-CoV-2 that incorporates sequence, structural, and immunologic features, providing an unbiased collection of genetic elements for functional studies. This metavirome-based framework can also be applied for rapid characterization of new coronavirus strains or emerging pathogenic viruses., Graphical Abstract, The bigger picture Identifying which genomic regions of the novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus are pathogenic remains a major challenge in COVID-19 research. However, there is currently a lack of systematic and unbiased methods for such functional characterization. In this study, we set up a machine learning-based approach to identify which genomic regions distinguish SARS-CoV-2 and other high case fatality rate coronaviruses from other coronaviruses. Discriminative scores were obtained for every nucleotide in the SARS-CoV-2 genome. We then performed a series of evolutionary and structural analyses of candidate hotspots, as well as integrative analyses with predicted B cell and T cell epitopes and emerging variants of concern. Our approach can be extended to other viral genomes or microbial pathogens to gain insights on which sequence features are pathogenic or immunogenic., To identify key pathogenic regions in coronavirus genomes, this study developed machine learning approaches and provide a systematic map of predicted descriptive genomic features in SARS-CoV-2.
- Published
- 2021