Back to Search Start Over

A framework for the indexing, querying, clustering, and visualization of microbial genomes for surveillance and outbreak investigation

Authors :
Tremblay-Savard, Olivier (Computer Science)
Van Domselaar, Gary (Medical Microbiology and Infectious Diseases)
Taboada, Eduardo (Food and Human Nutritional Sciences)
Domaratzki, Mike (Computer Science)
Tremblay-Savard, Olivier
Van Domselaar, Gary
Petkau, Aaron
Tremblay-Savard, Olivier (Computer Science)
Van Domselaar, Gary (Medical Microbiology and Infectious Diseases)
Taboada, Eduardo (Food and Human Nutritional Sciences)
Domaratzki, Mike (Computer Science)
Tremblay-Savard, Olivier
Van Domselaar, Gary
Petkau, Aaron
Publication Year :
2022

Abstract

Whole-genome sequencing (WGS) has increasingly become a routine part of monitoring infectious diseases. The genomes of bacteria, viruses, or other infectious agents are sequenced and used to identify nucleotide variants or other genetic differences—providing a wealth of detailed information. This has particularly become relevant with the COVID-19 pandemic, where sequencing of millions of viral genomes over the course of the pandemic has been essential in early identification of new viral lineages. The continuous generation of WGS data at this scale has introduced a number of challenges for efficiently generating timely reports and searching for epidemiologically significant patterns. I have designed and implemented a framework to address these problems—the Genomics Data Index (https://github.com/apetkau/genomics-data-index)—which uses ideas from the field of information retrieval to transform WGS data into a collection of genomics features (nucleotide variants, kmers, and genes) and index these features for rapid querying. I provide a command-line interface and Python API for incrementally adding new data and querying the index. The query API integrates with existing methods for working with tabular and phylogenetic data to provide a common interface for clustering, visualization, and statistical analysis of microbial genomes. I evaluated this framework using three datasets containing assembled genomes and sequence reads. Indexing assemblies was more sensitive for nucleotide variant detection when there were fewer variants (sensitivity = 0.948 for 6.77% divergence compared to reads sensitivity = 0.663), but sensitivity when indexing with reads surpassed assemblies as variants increased. The software was able to scale to tens of thousands of SARS-CoV-2 genomes (2.17 hours for loading 20,000 genomes) and construct phylogenies consistent with the existing Pangolin lineage system. Constructing phylogenies using nucleotide variants derived from bacterial WGS reads was fo

Details

Database :
OAIster
Notes :
English
Publication Type :
Electronic Resource
Accession number :
edsoai.on1346232169
Document Type :
Electronic Resource