1. FMAlign2: a novel fast multiple nucleotide sequence alignment method for ultralong datasets.
- Author
-
Zhang, Pinglu, Liu, Huan, Wei, Yanming, Zhai, Yixiao, Tian, Qinzhong, and Zou, Quan
- Subjects
SEQUENCE alignment ,NUCLEOTIDE sequence ,SOURCE code ,RESEARCH personnel ,BIOINFORMATICS - Abstract
Motivation In bioinformatics, multiple sequence alignment (MSA) is a crucial task. However, conventional methods often struggle with aligning ultralong sequences. To address this issue, researchers have designed MSA methods rooted in a vertical division strategy, which segments sequence data for parallel alignment. A prime example of this approach is FMAlign, which utilizes the FM-index to extract common seeds and segment the sequences accordingly. Results FMAlign2 leverages the suffix array to identify maximal exact matches, redefining the approach of FMAlign from searching for global chains to partial chains. By using a vertical division strategy, large-scale problem is deconstructed into manageable tasks, enabling parallel execution of subMSA. Furthermore, sequence-profile alignment and refinement are incorporated to concatenate subsets, yielding the final result seamlessly. Compared to FMAlign, FMAlign2 markedly augments the segmentation of sequences and significantly reduces the time while maintaining accuracy, especially on ultralong datasets. Importantly, FMAlign2 enhances existing MSA methods by conferring the capability to handle sequences reaching billions in length within an acceptable time frame. Availability and implementation Source code and datasets are available at https://github.com/malabz/FMAlign2 and https://zenodo.org/records/10435770. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF