51. PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph
- Author
-
Adelme Bazin, Stéphane Cruveiller, Claudine Médigue, David Vallenet, Guillaume Gautreau, Rémi Planel, Eduardo P. C. Rocha, Mathieu Gachet, Amandine Perrin, Mathieu Dubois, Christophe Ambroise, Alexandra Calteau, Catherine Matias, Laura Burlot, Analyse Bio-Informatique pour la Génomique et le Métabolisme (LABGeM), Génomique métabolique (UMR 8030), Genoscope - Centre national de séquençage [Evry] (GENOSCOPE), Université Paris-Saclay-Direction de Recherche Fondamentale (CEA) (DRF (CEA)), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université Paris-Saclay-Direction de Recherche Fondamentale (CEA) (DRF (CEA)), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université d'Évry-Val-d'Essonne (UEVE)-Centre National de la Recherche Scientifique (CNRS)-Genoscope - Centre national de séquençage [Evry] (GENOSCOPE), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université d'Évry-Val-d'Essonne (UEVE)-Centre National de la Recherche Scientifique (CNRS), Génomique évolutive des Microbes / Microbial Evolutionary Genomics, Institut Pasteur [Paris] (IP)-Centre National de la Recherche Scientifique (CNRS), Collège Doctoral, Sorbonne Université (SU), Laboratoire de Probabilités, Statistique et Modélisation (LPSM (UMR_8001)), Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)-Université Paris Cité (UPCité), Laboratoire de Mathématiques et Modélisation d'Evry (LaMME), Université d'Évry-Val-d'Essonne (UEVE)-ENSIIE-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche pour l’Agriculture, l’Alimentation et l’Environnement (INRAE), ANR-11-INBS-0013,IFB (ex Renabi-IFB),Institut français de bioinformatique(2011), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Centre National de la Recherche Scientifique (CNRS)-Université d'Évry-Val-d'Essonne (UEVE)-Genoscope - Centre national de séquençage [Evry] (GENOSCOPE), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Centre National de la Recherche Scientifique (CNRS)-Université d'Évry-Val-d'Essonne (UEVE), Institut Pasteur [Paris]-Centre National de la Recherche Scientifique (CNRS), ED 515 - Complexité du vivant, Laboratoire de Probabilités, Statistiques et Modélisations (LPSM (UMR_8001)), Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)-Université de Paris (UP), This research was supported in part by the IRTELIS and Phare PhD programs of the French Alternative Energies and Atomic Energy Commission (CEA) for GG and AB respectively, the French Government 'Investissements d’Avenir' programs (namely FRANCE GENOMIQUE [ANR-10-INBS-09-08], the INSTITUT FRANÇAIS DE BIOINFORMATIQUE [ANR-11-INBS-0013], and the Agence Nationale de la Recherche [Projet ANR-16-CE12-29 for EPCR])., We acknowledge Alexandre Renaux and Jonathan Mercier for their preliminary insights on pangenome graphs. We thank Mélanie Buy for drawing the PPanGGOLiN logo. Finally, we thank Guilhem Royer, Valentin Sabatet, Johan Rollin, Mohammed-Amin Madoui, Tom Delmont, Nicolas Pons and Pierre Peterlongo for all their advice along this work., Direction de Recherche Fondamentale (CEA) (DRF (CEA)), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Université d'Évry-Val-d'Essonne (UEVE)-Genoscope - Centre national de séquençage [Evry] (GENOSCOPE), Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Commissariat à l'énergie atomique et aux énergies alternatives (CEA)-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Université d'Évry-Val-d'Essonne (UEVE), and Collège doctoral [Sorbonne universités]
- Subjects
Computer science ,[SDV]Life Sciences [q-bio] ,01 natural sciences ,Genome ,010104 statistics & probability ,Database and Informatics Methods ,Biology (General) ,Genome Evolution ,Genomic organization ,Data Management ,0303 health sciences ,Markov random field ,Applied Mathematics ,Simulation and Modeling ,Genomics ,Genomic Databases ,Physical Sciences ,Engineering and Technology ,Synthetic Biology ,Algorithms ,Research Article ,Computer and Information Sciences ,QH301-705.5 ,Computational biology ,Research and Analysis Methods ,Molecular Evolution ,03 medical and health sciences ,[SDV.BBM.GTP]Life Sciences [q-bio]/Biochemistry, Molecular Biology/Genomics [q-bio.GN] ,Genetics ,Gene family ,0101 mathematics ,Gene ,Genome size ,030304 developmental biology ,Taxonomy ,Comparative genomics ,Evolutionary Biology ,Bacteria ,Correction ,Biology and Life Sciences ,Computational Biology ,15. Life on land ,Comparative Genomics ,Synthetic Genomics ,Genome Analysis ,Biological Databases ,Multivariate Analysis ,Genome dynamics ,Genome, Bacterial ,Software ,Mathematics - Abstract
The use of comparative genomics for functional, evolutionary, and epidemiological studies requires methods to classify gene families in terms of occurrence in a given species. These methods usually lack multivariate statistical models to infer the partitions and the optimal number of classes and don’t account for genome organization. We introduce a graph structure to model pangenomes in which nodes represent gene families and edges represent genomic neighborhood. Our method, named PPanGGOLiN, partitions nodes using an Expectation-Maximization algorithm based on multivariate Bernoulli Mixture Model coupled with a Markov Random Field. This approach takes into account the topology of the graph and the presence/absence of genes in pangenomes to classify gene families into persistent, cloud, and one or several shell partitions. By analyzing the partitioned pangenome graphs of isolate genomes from 439 species and metagenome-assembled genomes from 78 species, we demonstrate that our method is effective in estimating the persistent genome. Interestingly, it shows that the shell genome is a key element to understand genome dynamics, presumably because it reflects how genes present at intermediate frequencies drive adaptation of species, and its proportion in genomes is independent of genome size. The graph-based approach proposed by PPanGGOLiN is useful to depict the overall genomic diversity of thousands of strains in a compact structure and provides an effective basis for very large scale comparative genomics. The software is freely available at https://github.com/labgem/PPanGGOLiN., Author summary Microorganisms have the greatest biodiversity and evolutionary history on earth. At the genomic level, it is reflected by a highly variable gene content even among organisms from the same species which explains the ability of microbes to be pathogenic or to grow in specific environments. We developed a new method called PPanGGOLiN which accurately represents the genomic diversity of a species (i.e. its pangenome) using a compact graph structure. Based on this pangenome graph, we classify genes by a statistical method according to their occurrence in the genomes. This method allowed us to build pangenomes even for uncultivated species at an unprecedented scale. We applied our method on all available genomes in databanks in order to depict the overall diversity of hundreds of species. Overall, our work enables microbiologists to explore and visualize pangenomes alike a subway map.
- Published
- 2020
- Full Text
- View/download PDF