1. Straintables: An application that extracts sequences from genome assemblies and generates dissimilarity matrices
- Author
-
Rangel Alp, Francis Rw, Araujo Gn, and Ferreira CdS
- Subjects
business.industry ,Computer science ,Pattern recognition ,computer.file_format ,Python (programming language) ,File format ,Visualization ,Set (abstract data type) ,Matrix (mathematics) ,Software ,Artificial intelligence ,Executable ,business ,Cluster analysis ,computer ,computer.programming_language - Abstract
Background and ObjectivesThe dissimilarity matrix (DM) is an important component of phylogenetic analysis, and many software packages exist to build and show DMs. However, as the common input for this type of software are sequences in FASTA file format, the process of extracting and aligning each set of sequences to produce a big number of matrices can be laborious. Additionally, existing software do not facilitate the comparison of clusters of similarity across several DMs built for the same group of individuals, using different genomic regions. To address our requirements of such a tool, we designed Straintables to extract specific genomic region sequences from a group of intraspecies genomic assemblies, using extracted sequences to build dissimilarity matrices.MethodsA Python module with executable scripts was developed for a study on genetic diversity across strains of Toxoplasma gondii, being a general purpose system for DM calculation and visualization for preliminary phylogenetic studies. For automatic region sequence extraction from genomic assemblies we assembled a system that designs virtual primers using reference sequences located at genomic annotations, then matches those primers on genome files by using regex patterns. Extracted sequences are then aligned using Clustal Omega and compared to generate matrices.ResultsUsing this software saves the user from manual preparation and alignment of the sequences, a process that can be laborious when a large number of assemblies or regions are involved. The automatic sequence extraction process can be checked against BLAST results using extracted sequence as queries, where correct results were observed for same-species pools for various organisms. The package also contains a matrix visualization tool focused on cluster visualization, capable of drawing matrices into image files with custom settings, and features methods of reordering matrices to facilitate the comparison of clustering patterns across two or more matrices.ConclusionStraintables may replace and extend the functionality of existing matrix-oriented phylogenetic software, featuring automatic region extraction from genomic assemblies and enhanced matrix visualization capabilities emphasizing cluster identification. This module is open source, available at GitHub (https://github.com/Gab0/straintables) under a MIT license and also as a PIPY package.HighlightsSimple in-silico protocol for generation, visualization and comparison of dissimilarity matrices.Accurate automatic sequence extraction from multiple genomic assemblies by using virtual primers built from reference sequences in an annotation file.Draws matrices as images, with enhanced cluster visualization and customized options.Supports reordering of matrix indices to better visualize clustering pattern conservation across multiple regions.
- Published
- 2021