1. A model-based clustering method to detect infectious disease transmission outbreaks from sequence variation
- Author
-
Art F. Y. Poon and Rosemary M. McCloskey
- Subjects
RNA viruses ,0301 basic medicine ,Epidemiology ,Computer science ,Pathology and Laboratory Medicine ,computer.software_genre ,Disease Outbreaks ,Database and Informatics Methods ,0302 clinical medicine ,Immunodeficiency Viruses ,Medicine and Health Sciences ,Cluster Analysis ,030212 general & internal medicine ,Biology (General) ,Data Management ,0303 health sciences ,Ecology ,Infectious disease transmission ,Simulation and Modeling ,Phylogenetic Analysis ,Markov Chains ,3. Good health ,Phylogenetics ,Computational Theory and Mathematics ,Medical Microbiology ,Viral Pathogens ,Genetic Epidemiology ,Modeling and Simulation ,Viruses ,symbols ,Data mining ,Pathogens ,Sequence Analysis ,Research Article ,Computer and Information Sciences ,QH301-705.5 ,Bioinformatics ,Death Rates ,Poisson process ,Biology ,Research and Analysis Methods ,Communicable Diseases ,Models, Biological ,Microbiology ,Cellular and Molecular Neuroscience ,symbols.namesake ,03 medical and health sciences ,Population Metrics ,Model based clustering ,CURE data clustering algorithm ,Retroviruses ,Genetics ,Humans ,Computer Simulation ,Evolutionary Systematics ,Sequence variation ,Cluster analysis ,Microbial Pathogens ,Molecular Biology ,Ecology, Evolution, Behavior and Systematics ,Taxonomy ,030304 developmental biology ,Evolutionary Biology ,Population Biology ,Markov chain ,business.industry ,Lentivirus ,Organisms ,Nonparametric statistics ,Computational Biology ,Biology and Life Sciences ,HIV ,Outbreak ,Pattern recognition ,030104 developmental biology ,Infectious disease (medical specialty) ,Genetics of Disease ,HIV-1 ,Artificial intelligence ,business ,Sequence Alignment ,computer - Abstract
Clustering infections by genetic similarity is a popular technique for identifying potential outbreaks of infectious disease, in part because sequences are now routinely collected for clinical management of many infections. A diverse number of nonparametric clustering methods have been developed for this purpose. These methods are generally intuitive, rapid to compute, and readily scale with large data sets. However, we have found that nonparametric clustering methods can be biased towards identifying clusters of diagnosis—where individuals are sampled sooner post-infection—rather than the clusters of rapid transmission that are meant to be potential foci for public health efforts. We develop a fundamentally new approach to genetic clustering based on fitting a Markov-modulated Poisson process (MMPP), which represents the evolution of transmission rates along the tree relating different infections. We evaluated this model-based method alongside five nonparametric clustering methods using both simulated and actual HIV sequence data sets. For simulated clusters of rapid transmission, the MMPP clustering method obtained higher mean sensitivity (85%) and specificity (91%) than the nonparametric methods. When we applied these clustering methods to published sequences from a study of HIV-1 genetic clusters in Seattle, USA, we found that the MMPP method categorized about half (46%) as many individuals to clusters compared to the other methods. Furthermore, the mean internal branch lengths that approximate transmission rates were significantly shorter in clusters extracted using MMPP, but not by other methods. We determined that the computing time for the MMPP method scaled linearly with the size of trees, requiring about 30 seconds for a tree of 1,000 tips and about 20 minutes for 50,000 tips on a single computer. This new approach to genetic clustering has significant implications for the application of pathogen sequence analysis to public health, where it is critical to robustly and accurately identify clusters for the most cost-effective deployment of outbreak management and prevention resources., Author summary Many pathogens evolve so rapidly that they accumulate genetic differences within a host before becoming transmitted to the next host. Consequently, clusters of sampled infections with nearly identical genomes may reveal outbreaks of recent or ongoing transmissions. There is rapidly growing interest in using model-free genetic clustering methods to guide public health responses to epidemics in near real-time, including HIV, Ebola virus and tuberculosis. However, we show that current methods are relatively ineffective at detecting transmission outbreaks; instead, they are predominantly influenced by how infections are sampled from the population. We describe a fundamentally new approach to genetic clustering that is based on modelling changes in transmission rates during the spread of the epidemic. We use simulated and real pathogen sequence data sets to demonstrate that this model-based approach is substantially more effective for detecting transmission outbreaks, and remains fast enough for real-time applications to large sequence databases.
- Published
- 2017
- Full Text
- View/download PDF