Back to Search Start Over

Systems-based approach for optimization of a scalable bacterial ST mapping assembly-free algorithm

Authors :
João Carlos Gomes-Neto
Jitender S. Deogun
Andrew K. Benson
Natasha Pavlovikj
Publication Year :
2021
Publisher :
Cold Spring Harbor Laboratory, 2021.

Abstract

Epidemiological surveillance of bacterial pathogens requires real-time data analysis with a fast turn-around, while aiming at generating two main outcomes: 1) Species level identification; and 2) Variant mapping at different levels of genotypic resolution for population-based tracking, in addition to predicting traits such as antimicrobial resistance (AMR). With the recent advances and continual dissemination of whole-genome sequencing technologies, large-scale population-based genotyping of bacterial pathogens has become possible. Since bacterial populations often present a high degree of clonality in the genomic backbone (i.e., low genetic diversity), the choice of genotyping scheme can even facilitate the understanding of ancestral relationships and can be used for prediction of co-inherited traits such as AMR. Multi-locus sequence typing (MLST) fits that purpose and can identify sequence types (ST) based on seven ubiquitous genome-scattered loci that aid in genotyping isolates beneath the species level. ST-based mapping also standardizes genotyping across laboratories and is used by laboratories worldwide. However, algorithms for inferring ST from Illumina paired-end sequencing data typically rely on genome assembly prior to classification. Genome assembly is computationally intensive and is a bottleneck for speed and scalability, which are important aspects of genomic epidemiology. The stringMLST program uses an assembly-free, kmer-based algorithm for inferring STs, which can overcome the speed and scalability bottlenecks. Here we have systematically studied the accuracy and scalability of stringMLST relative to the standard MLST program across a wide array of phylogenetically divergent Public Health-relevant bacterial pathogens. Our data shows that optimal kmer length for stringMLST is species-specific and that genome-intrinsic and -extrinsic features can affect performance and accuracy of the program. While suitable parameters could be identified for most organisms, there were a few instances where this program may not be directly deployable in its current format. More importantly, we integrated stringMLST into our freely available and scalable hierarchical-based population genomics platform, ProkEvo, and further demonstrated how the implementation facilitates automated, reproducible bacterial population analysis. The ProkEvo implementation provides a rapidly deployable genomic epidemiology tool for ST mapping along with other pan-genomic data mining strategies, while providing specific guidance on how to optimize stringMLST performance for a wide variety of bacterial pathogens.

Details

Database :
OpenAIRE
Accession number :
edsair.doi...........17fae1c86f6ab0f731c2457cae860928
Full Text :
https://doi.org/10.1101/2021.10.28.466354