Back to Search Start Over

CALDERA: Finding all significant de Bruijn subgraphs for bacterial GWAS

Authors :
Laurent Jacob
Fanny Perraudeau
Leandro Lima
Sandrine Dudoit
Hector Roux de Bézieux
Arnaud Mary
Pendulum Therapeutics [San Francisco]
European Bioinformatics Institute [Hinxton] (EMBL-EBI)
EMBL Heidelberg
Baobab
Département PEGASE [LBBE] (PEGASE)
Laboratoire de Biométrie et Biologie Evolutive - UMR 5558 (LBBE)
Université Claude Bernard Lyon 1 (UCBL)
Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-VetAgro Sup - Institut national d'enseignement supérieur et de recherche en alimentation, santé animale, sciences agronomiques et de l'environnement (VAS)-Centre National de la Recherche Scientifique (CNRS)-Université Claude Bernard Lyon 1 (UCBL)
Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-VetAgro Sup - Institut national d'enseignement supérieur et de recherche en alimentation, santé animale, sciences agronomiques et de l'environnement (VAS)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire de Biométrie et Biologie Evolutive - UMR 5558 (LBBE)
Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-VetAgro Sup - Institut national d'enseignement supérieur et de recherche en alimentation, santé animale, sciences agronomiques et de l'environnement (VAS)-Centre National de la Recherche Scientifique (CNRS)
Equipe de recherche européenne en algorithmique et biologie formelle et expérimentale (ERABLE)
Inria Grenoble - Rhône-Alpes
Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)
University of California [Berkeley]
University of California
Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-VetAgro Sup - Institut national d'enseignement supérieur et de recherche en alimentation, santé animale, sciences agronomiques et de l'environnement (VAS)-Centre National de la Recherche Scientifique (CNRS)-Inria Lyon
Institut National de Recherche en Informatique et en Automatique (Inria)
University of California [Berkeley] (UC Berkeley)
University of California (UC)
ANR-17-CE23-0011,FAST-BIG,Tests Statistiques efficaces pour les données de grande dimension: application à l'imagerie cérébrale et la génétique(2017)
Source :
Bioinformatics (Oxford, England), vol 38, iss Suppl 1, Bioinformatics, Bioinformatics, 2022, ⟨10.1093/bioinformatics/btac238⟩
Publication Year :
2021
Publisher :
HAL CCSD, 2021.

Abstract

Genome wide association studies (GWAS), aiming to find genetic variants associated with a trait, have widely been used on bacteria to identify genetic determinants of drug resistance or hypervirulence. Recent bacterial GWAS methods usually rely on k-mers, whose presence in a genome can denote variants ranging from single nucleotide polymorphisms to mobile genetic elements. Since many bacterial species include genes that are not shared among all strains, this approach avoids the reliance on a common reference genome. However, the same gene can exist in slightly different versions across different strains, leading to diluted effects when trying to detect its association to a phenotype through k-mer based GWAS. Here we propose to overcome this by testing covariates built from closed connected subgraphs of the De Bruijn graph defined over genomic k-mers. These covariates are able to capture polymorphic genes as a single entity, improving k-mer based GWAS in terms of power and interpretability. As the number of subgraphs is exponential in the number of nodes in the DBG, a method naively testing all possible subgraphs would result in very low statistical power due to multiple testing corrections, and the mere exploration of these subgraphs would quickly become computationally intractable. The concept of testable hypothesis has successfully been used to address both problems in similar contexts. We leverage this concept to test all closed connected subgraphs by proposing a novel enumeration scheme for these objects which fully exploits the pruning opportunity offered by testability, resulting in drastic improvements in computational efficiency. We illustrate this on both real and simulated datasets and also demonstrate how considering subgraphs leads to a more powerful and interpretable method. Our method integrates with existing visual tools to facilitate interpretation. We also provide an implementation of our method, as well as code to reproduce all results at https://github.com/HectorRDB/Caldera_Recomb.

Details

Language :
English
ISSN :
13674803 and 13674811
Database :
OpenAIRE
Journal :
Bioinformatics (Oxford, England), vol 38, iss Suppl 1, Bioinformatics, Bioinformatics, 2022, ⟨10.1093/bioinformatics/btac238⟩
Accession number :
edsair.doi.dedup.....30f5a7e26ccae000c1d4760f1e10d13a
Full Text :
https://doi.org/10.1093/bioinformatics/btac238⟩