436 results on '"FM-index"'
Search Results
2. Centrifuger: Lossless Compression of Microbial Genomes for Efficient and Accurate Metagenomic Sequence Classification
- Author
-
Song, Li, Langmead, Ben, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, van Leeuwen, Jan, Series Editor, Hutchison, David, Editorial Board Member, Kanade, Takeo, Editorial Board Member, Kittler, Josef, Editorial Board Member, Kleinberg, Jon M., Editorial Board Member, Kobsa, Alfred, Series Editor, Mattern, Friedemann, Editorial Board Member, Mitchell, John C., Editorial Board Member, Naor, Moni, Editorial Board Member, Nierstrasz, Oscar, Series Editor, Pandu Rangan, C., Editorial Board Member, Sudan, Madhu, Series Editor, Terzopoulos, Demetri, Editorial Board Member, Tygar, Doug, Editorial Board Member, Weikum, Gerhard, Series Editor, Vardi, Moshe Y, Series Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, and Ma, Jian, editor
- Published
- 2024
- Full Text
- View/download PDF
3. Centrifuger: lossless compression of microbial genomes for efficient and accurate metagenomic sequence classification
- Author
-
Li Song and Ben Langmead
- Subjects
FM-index ,r-index ,Metagenomic ,Compact data structure ,Biology (General) ,QH301-705.5 ,Genetics ,QH426-470 - Abstract
Abstract Centrifuger is an efficient taxonomic classification method that compares sequencing reads against a microbial genome database. In Centrifuger, the Burrows-Wheeler transformed genome sequences are losslessly compressed using a novel scheme called run-block compression. Run-block compression achieves sublinear space complexity and is effective at compressing diverse microbial databases like RefSeq while supporting fast rank queries. Combining this compression method with other strategies for compacting the Ferragina-Manzini (FM) index, Centrifuger reduces the memory footprint by half compared to other FM-index-based approaches. Furthermore, the lossless compression and the unconstrained match length help Centrifuger achieve greater accuracy than competing methods at lower taxonomic levels.
- Published
- 2024
- Full Text
- View/download PDF
4. Pfp-fm: an accelerated FM-index
- Author
-
Aaron Hong, Marco Oliva, Dominik Köppl, Hideo Bannai, Christina Boucher, and Travis Gagie
- Subjects
FM-index ,Pangenomics ,Word-based indexing ,Random access ,Biology (General) ,QH301-705.5 ,Genetics ,QH426-470 - Abstract
Abstract FM-indexes are crucial data structures in DNA alignment, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer [1] observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to parse it somehow before applying word-based FM-indexing. In 2022, Deng et al. [2] proposed parsing genomic data by induced suffix sorting, and showed that the resulting word-based FM-indexes support faster counting queries than standard FM-indexes when patterns are a few thousand characters or longer. In this paper we show that using prefix-free parsing—which takes parameters that let us tune the average length of the phrases—instead of induced suffix sorting, gives a significant speedup for patterns of only a few hundred characters. We implement our method and demonstrate it is between 3 and 18 times faster than competing methods on queries to GRCh38, and is consistently faster on queries made to 25,000, 50,000 and 100,000 SARS-CoV-2 genomes. Hence, it seems our method accelerates the performance of count over all state-of-the-art methods with a moderate increase in the memory. The source code for $$\texttt {PFP-FM}$$ PFP - FM is available at https://github.com/AaronHong1024/afm .
- Published
- 2024
- Full Text
- View/download PDF
5. Pfp-fm: an accelerated FM-index.
- Author
-
Hong, Aaron, Oliva, Marco, Köppl, Dominik, Bannai, Hideo, Boucher, Christina, and Gagie, Travis
- Subjects
- *
DNA structure , *SOURCE code , *DATA structures , *GENOMES , *SUFFIXES & prefixes (Grammar) - Abstract
FM-indexes are crucial data structures in DNA alignment, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer [1] observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to parse it somehow before applying word-based FM-indexing. In 2022, Deng et al. [2] proposed parsing genomic data by induced suffix sorting, and showed that the resulting word-based FM-indexes support faster counting queries than standard FM-indexes when patterns are a few thousand characters or longer. In this paper we show that using prefix-free parsing—which takes parameters that let us tune the average length of the phrases—instead of induced suffix sorting, gives a significant speedup for patterns of only a few hundred characters. We implement our method and demonstrate it is between 3 and 18 times faster than competing methods on queries to GRCh38, and is consistently faster on queries made to 25,000, 50,000 and 100,000 SARS-CoV-2 genomes. Hence, it seems our method accelerates the performance of count over all state-of-the-art methods with a moderate increase in the memory. The source code for PFP - FM is available at https://github.com/AaronHong1024/afm. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
6. WMSA 2: a multiple DNA/RNA sequence alignment tool implemented with accurate progressive mode and a fast win-win mode combining the center star and progressive strategies.
- Author
-
Chen, Juntao, Chao, Jiannan, Liu, Huan, Yang, Fenglong, Zou, Quan, and Tang, Furong
- Subjects
- *
SEQUENCE alignment , *NUCLEOTIDE sequence , *HIERARCHICAL clustering (Cluster analysis) , *SOURCE code , *SEQUENCE analysis , *STAR-branched polymers - Abstract
Multiple sequence alignment is widely used for sequence analysis, such as identifying important sites and phylogenetic analysis. Traditional methods, such as progressive alignment, are time-consuming. To address this issue, we introduce StarTree, a novel method to fast construct a guide tree by combining sequence clustering and hierarchical clustering. Furthermore, we develop a new heuristic similar region detection algorithm using the FM-index and apply the k-banded dynamic program to the profile alignment. We also introduce a win-win alignment algorithm that applies the central star strategy within the clusters to fast the alignment process, then uses the progressive strategy to align the central-aligned profiles, guaranteeing the final alignment's accuracy. We present WMSA 2 based on these improvements and compare the speed and accuracy with other popular methods. The results show that the guide tree made by the StarTree clustering method can lead to better accuracy than that of PartTree while consuming less time and memory than that of UPGMA and mBed methods on datasets with thousands of sequences. During the alignment of simulated data sets, WMSA 2 can consume less time and memory while ranking at the top of Q and TC scores. The WMSA 2 is still better at the time, and memory efficiency on the real datasets and ranks at the top on the average sum of pairs score. For the alignment of 1 million SARS-CoV-2 genomes, the win-win mode of WMSA 2 significantly decreased the consumption time than the former version. The source code and data are available at https://github.com/malabz/WMSA2. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
7. 一种针对大规模 Read Mapping 的高效 DBG 索引方法.
- Author
-
于长永, 李俊杰, 马海涛, and 海宇海
- Subjects
- *
DE Bruijn graph , *ESCHERICHIA coli , *GENOMES , *BIOINFORMATICS - Abstract
In order to answer two questions about de Bruijn graph(DBG) in bioinformatics: ① For any k-mer, answer whether it is the vertex of DBG; ② For any vertex of DBG, answer its adjacency information(incoming edge and outgoing edge), an efficient DBG indexing method for large-scale read mapping is proposed. In this paper, the above two questions are transformed into the exact finding problem of k-mer and (k+1)-mer on non-repetitive multipath, and are solved by using FM-index. Firstly, a given reference sequence is compressed, which is the discovery of non-repetitive multipaths, thereby compressing the presence of large-scale repetitive (k+1)-mers in the sequence. Secondly, the DBG is indexed based on the non-repetitive multipath FM-index. The k-mer is searched whether it appears on the DBG, and if so, the direct predecessor and direct successor nodes of the k-mer are provided to improve the spatial-temporal efficiency. Finally, experiments are performed on the genomes of 62 E. coli strains. The experimental results show that the method proposed can efficiently index the DBG of multi-reference sequences. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
8. Scalable Text Index Construction
- Author
-
Bingmann, Timo, Dinklage, Patrick, Fischer, Johannes, Kurpicz, Florian, Ohlebusch, Enno, Sanders, Peter, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Bast, Hannah, editor, Korzen, Claudius, editor, Meyer, Ulrich, editor, and Penschuck, Manuel, editor
- Published
- 2022
- Full Text
- View/download PDF
9. Approximate Pattern Matching Using Search Schemes and In-Text Verification
- Author
-
Renders, Luca, Depuydt, Lore, Fostier, Jan, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Rojas, Ignacio, editor, Valenzuela, Olga, editor, Rojas, Fernando, editor, Herrera, Luis Javier, editor, and Ortuño, Francisco, editor
- Published
- 2022
- Full Text
- View/download PDF
10. Practical Space-Efficient Index for Structural Pattern Matching
- Author
-
Kim, Sung-Hwan, Cho, Hwan-Gue, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Bazgan, Cristina, editor, and Fernau, Henning, editor
- Published
- 2022
- Full Text
- View/download PDF
11. Efficient privacy-preserving variable-length substring match for genome sequence
- Author
-
Yoshiki Nakagawa, Satsuya Ohata, and Kana Shimizu
- Subjects
Private genome sequence search ,Secure multiparty computation ,Secret sharing ,FM-index ,Suffix array ,LCP array ,Biology (General) ,QH301-705.5 ,Genetics ,QH426-470 - Abstract
Abstract The development of a privacy-preserving technology is important for accelerating genome data sharing. This study proposes an algorithm that securely searches a variable-length substring match between a query and a database sequence. Our concept hinges on a technique that efficiently applies FM-index for a secret-sharing scheme. More precisely, we developed an algorithm that can achieve a secure table lookup in such a way that $$V[V[\ldots V[p_0] \ldots ]]$$ V [ V [ … V [ p 0 ] … ] ] is computed for a given depth of recursion where $$p_0$$ p 0 is an initial position, and V is a vector. We used the secure table lookup for vectors created based on FM-index. The notable feature of the secure table lookup is that time, communication, and round complexities are not dependent on the table length N, after the query input. Therefore, a substring match by reference to the FM-index-based table can also be conducted independently against the database length, and the entire search time is dramatically improved compared to previous approaches. We conducted an experiment using a human genome sequence with the length of 10 million as the database and a query with the length of 100 and found that the query response time of our protocol was at least three orders of magnitude faster than a non-indexed database search protocol under the realistic computation/network environment.
- Published
- 2022
- Full Text
- View/download PDF
12. An optimized FM-index library for nucleotide and amino acid search
- Author
-
Tim Anderson and Travis J. Wheeler
- Subjects
FM-index ,String matching ,SIMD vectorization ,Biology (General) ,QH301-705.5 ,Genetics ,QH426-470 - Abstract
Abstract Background Pattern matching is a key step in a variety of biological sequence analysis pipelines. The FM-index is a compressed data structure for pattern matching, with search run time that is independent of the length of the database text. Implementation of the FM-index is reasonably complicated, so that increased adoption will be aided by the availability of a fast and flexible FM-index library. Results We present AvxWindowedFMindex (AWFM-index), a lightweight, open-source, thread-parallel FM-index library written in C that is optimized for indexing nucleotide and amino acid sequences. AWFM-index introduces a new approach to storing FM-index data in a strided bit-vector format that enables extremely efficient computation of the FM-index occurrence function via AVX2 bitwise instructions, and combines this with optional on-disk storage of the index’s suffix array and a cache-efficient lookup table for partial k-mer searches. The AWFM-index performs exact match count and locate queries faster than SeqAn3’s FM-index implementation across a range of comparable memory footprints. When optimized for speed, AWFM-index is $$\sim $$ ∼ 2–4x faster than SeqAn3 for nucleotide search, and $$\sim $$ ∼ 2–6x faster for amino acid search; it is also $$\sim $$ ∼ 4x faster with similar memory footprint when storing the suffix array in on-disk SSD storage. Conclusions AWFM-index is easy to incorporate into bioinformatics software, offers run-time performance parameterization, and provides clients with FM-index functionality at both a high-level (count or locate all instances of a query string) and low-level (step-wise control of the FM-index backward-search process). The open-source library is available for download at https://github.com/TravisWheelerLab/AvxWindowFmIndex.
- Published
- 2021
- Full Text
- View/download PDF
13. Pro k rustean Graph: A substring index for rapid k-mer size analysis.
- Author
-
Park A and Koslickia D
- Abstract
Despite the widespread adoption of k -mer-based methods in bioinformatics, understanding the influence of k -mer sizes remains a persistent challenge. Selecting an optimal k -mer size or employing multiple k -mer sizes is often arbitrary, application-specific, and fraught with computational complexities. Typically, the influence of k -mer size is obscured by the outputs of complex bioinformatics tasks, such as genome analysis, comparison, assembly, alignment, and error correction. However, it is frequently overlooked that every method is built above a well-defined k -mer-based object like Jaccard Similarity, de Bruijn graphs, k -mer spectra, and Bray-Curtis Dissimilarity. Despite these objects offering a clearer perspective on the role of k -mer sizes, the dynamics of k -mer-based objects with respect to k -mer sizes remain surprisingly elusive. This paper introduces a computational framework that generalizes the transition of k -mer-based objects across k -mer sizes, utilizing a novel substring index, the Pro k rustean graph. The primary contribution of this framework is to compute quantities associated with k -mer-based objects for all k -mer sizes, where the computational complexity depends solely on the number of maximal repeats and is independent of the range of k -mer sizes. For example, counting vertices of compacted de Bruijn graphs for k = 1 , … , 100 can be accomplished in mere seconds with our substring index constructed on a gigabase-sized read set. Additionally, we derive a space-efficient algorithm to extract the Pro k rustean graph from the Burrows-Wheeler Transform. It becomes evident that modern substring indices, mostly based on longest common prefixes of suffix arrays, inherently face difficulties at exploring varying k -mer sizes due to their limitations at grouping co-occurring substrings. We have implemented four applications that utilize quantities critical in modern pangenomics and metagenomics. The code for these applications and the construction algorithm is available at https://github.com/KoslickiLab/prokrustean.
- Published
- 2024
- Full Text
- View/download PDF
14. Cache-Efficient FM-Index Variants for Mapping of DNA Sequences
- Author
-
Sitarčík, Jozef, Lucká, Mária, Kacprzyk, Janusz, Series Editor, Pal, Nikhil R., Advisory Editor, Bello Perez, Rafael, Advisory Editor, Corchado, Emilio S., Advisory Editor, Hagras, Hani, Advisory Editor, Kóczy, László T., Advisory Editor, Kreinovich, Vladik, Advisory Editor, Lin, Chin-Teng, Advisory Editor, Lu, Jie, Advisory Editor, Melin, Patricia, Advisory Editor, Nedjah, Nadia, Advisory Editor, Nguyen, Ngoc Thanh, Advisory Editor, Wang, Jun, Advisory Editor, Fdez-Riverola, Florentino, editor, Rocha, Miguel, editor, Mohamad, Mohd Saberi, editor, Zaki, Nazar, editor, and Castellanos-Garzón, José A., editor
- Published
- 2020
- Full Text
- View/download PDF
15. Efficient privacy-preserving variable-length substring match for genome sequence.
- Author
-
Nakagawa, Yoshiki, Ohata, Satsuya, and Shimizu, Kana
- Subjects
- *
HUMAN genome , *DATABASE searching , *INFORMATION sharing , *GENOMES , *ALGORITHMS - Abstract
The development of a privacy-preserving technology is important for accelerating genome data sharing. This study proposes an algorithm that securely searches a variable-length substring match between a query and a database sequence. Our concept hinges on a technique that efficiently applies FM-index for a secret-sharing scheme. More precisely, we developed an algorithm that can achieve a secure table lookup in such a way that V [ V [ ... V [ p 0 ] ... ] ] is computed for a given depth of recursion where p 0 is an initial position, and V is a vector. We used the secure table lookup for vectors created based on FM-index. The notable feature of the secure table lookup is that time, communication, and round complexities are not dependent on the table length N, after the query input. Therefore, a substring match by reference to the FM-index-based table can also be conducted independently against the database length, and the entire search time is dramatically improved compared to previous approaches. We conducted an experiment using a human genome sequence with the length of 10 million as the database and a query with the length of 100 and found that the query response time of our protocol was at least three orders of magnitude faster than a non-indexed database search protocol under the realistic computation/network environment. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
16. On the Complexity of Recognizing Wheeler Graphs.
- Author
-
Gibney, Daniel and Thankachan, Sharma V.
- Subjects
- *
DE Bruijn graph , *POLYNOMIAL time algorithms , *CHARTS, diagrams, etc. , *GRAPH algorithms , *ISOMORPHISM (Mathematics) , *APPROXIMATION algorithms - Abstract
In recent years, several compressed indexes based on variants of the Burrows–Wheeler transform have been introduced. Some of these are used to index structures far more complex than a single string, as was originally done with the FM-index (Ferragina and Manzini in J. ACM 52(4):552–581, https://doi.org/10.1145/1082036.1082039, 2005). As such, there has been an increasing effort to better understand under which conditions such an indexing scheme is possible. This has led to the introduction of Wheeler graphs (Gagie et al. in Theor Comput Sci 698:67–78, https://doi.org/10.1016/j.tcs.2017.06.016, 2017). Gagie et al. showed that de Bruijn graphs, generalized compressed suffix arrays, and several other BWT related structures can be represented as Wheeler graphs, and that Wheeler graphs can be indexed in a space-efficient way. Hence, being able to recognize whether a given graph is a Wheeler graph, or being able to approximate a given graph by a Wheeler graph, could have numerous applications in indexing. Here we resolve the open question of whether there exists an efficient algorithm for recognizing if a given graph is a Wheeler graph. We show: The problem of recognizing whether a given graph G = (V , E) is a Wheeler graph is NP-complete for any edge label alphabet of size σ ≥ 2 , even when G is a DAG. This holds even on a restricted subset of graphs called d-NFAs for d ≥ 5 . This is in contrast to recent results demonstrating the problem can be solved in polynomial time for d-NFAs where d ≤ 2 . We also show that the recognition problem can be solved in linear time for σ = 1 on graphs without self-loops; There exists an 2 e log σ + O (n + e) time exact algorithm where n = | V | and e = | E | . This algorithm relies on graph isomorphism being computable in strictly sub-exponential time; We define an optimization variant of the problem called Wheeler Graph Violation, abbreviated WGV, where the aim is to identify the smallest set of edges that have to be removed from a graph to obtain a Wheeler graph. We show WGV is APX-hard, even when G is a DAG, implying there exists a constant C > 1 for which there is no C-approximation algorithm (unless P = NP). Also, conditioned on the Unique Games Conjecture, for all C > 1 , it is NP-hard to find a C-approximation, implying WGV is not in APX; We define the Wheeler Subgraph problem, abbreviated WS, where the aim is to find the largest subgraph which is a Wheeler Graph (the dual of WGV). In contrast to WGV, we give an O (σ) -approximation algorithm for the WS problem, implying it is in APX for σ = O (1) . The above findings suggest that most problems under this theme are computationally difficult. However, we identify a class of graphs for which the recognition problem is polynomial-time solvable, raising the question of which properties determine this problem's difficulty. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
17. Applications of V-Order: Suffix Arrays, the Burrows-Wheeler Transform & the FM-index
- Author
-
Alatabbi, Ali, Daykin, Jacqueline W., Mhaskar, Neerja, Rahman, M. Sohel, Smyth, W. F., Hutchison, David, Editorial Board Member, Kanade, Takeo, Editorial Board Member, Kittler, Josef, Editorial Board Member, Kleinberg, Jon M., Editorial Board Member, Mattern, Friedemann, Editorial Board Member, Mitchell, John C., Editorial Board Member, Naor, Moni, Editorial Board Member, Pandu Rangan, C., Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Terzopoulos, Demetri, Editorial Board Member, Tygar, Doug, Editorial Board Member, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Das, Gautam K., editor, Mandal, Partha S., editor, Mukhopadhyaya, Krishnendu, editor, and Nakano, Shin-ichi, editor
- Published
- 2019
- Full Text
- View/download PDF
18. novel fast multiple nucleotide sequence alignment method based on FM-index.
- Author
-
Liu, Huan, Zou, Quan, and Xu, Yun
- Subjects
- *
NUCLEOTIDE sequence , *SEQUENCE alignment , *HUMAN genome , *SOURCE code - Abstract
Multiple sequence alignment (MSA) is fundamental to many biological applications. But most classical MSA algorithms are difficult to handle large-scale multiple sequences, especially long sequences. Therefore, some recent aligners adopt an efficient divide-and-conquer strategy to divide long sequences into several short sub-sequences. Selecting the common segments (i.e. anchors) for division of sequences is very critical as it directly affects the accuracy and time cost. So, we proposed a novel algorithm, FMAlign, to improve the performance of multiple nucleotide sequence alignment. We use FM-index to extract long common segments at a low cost rather than using a space-consuming hash table. Moreover, after finding the longer optimal common segments, the sequences are divided by the longer common segments. FMAlign has been tested on virus and bacteria genome and human mitochondrial genome datasets, and compared with existing MSA methods such as MAFFT, HAlign and FAME. The experiments show that our method outperforms the existing methods in terms of running time, and has a high accuracy on long sequence sets. All the results demonstrate that our method is applicable to the large-scale nucleotide sequences in terms of sequence length and sequence number. The source code and related data are accessible in https://github.com/iliuh/FMAlign. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
19. Data Structures to Represent a Set of k-long DNA Sequences.
- Author
-
CHIKHI, RAYAN, HOLUB, JAN, and MEDVEDEV, PAUL
- Abstract
The analysis of biological sequencing data has been one of the biggest applications of string algorithms. The approaches used in many such applications are based on the analysis of k-mers, which are short fixed-length strings present in a dataset. While these approaches are rather diverse, storing and querying a k-mer set has emerged as a shared underlying component. A set of k-mers has unique features and applications that, over the past 10 years, have resulted in many specialized approaches for its representation. In this survey, we give a unified presentation and comparison of the data structures that have been proposed to store and query a k-mer set. We hope this survey will serve as a resource for researchers in the field as well as make the area more accessible to researchers outside the field. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
20. An optimized FM-index library for nucleotide and amino acid search.
- Author
-
Anderson, Tim and Wheeler, Travis J.
- Subjects
- *
OPTICAL disks , *DATA structures , *AMINO acids , *AMINO acid sequence , *BIOINFORMATICS software , *SEQUENCE analysis - Abstract
Background: Pattern matching is a key step in a variety of biological sequence analysis pipelines. The FM-index is a compressed data structure for pattern matching, with search run time that is independent of the length of the database text. Implementation of the FM-index is reasonably complicated, so that increased adoption will be aided by the availability of a fast and flexible FM-index library. Results: We present AvxWindowedFMindex (AWFM-index), a lightweight, open-source, thread-parallel FM-index library written in C that is optimized for indexing nucleotide and amino acid sequences. AWFM-index introduces a new approach to storing FM-index data in a strided bit-vector format that enables extremely efficient computation of the FM-index occurrence function via AVX2 bitwise instructions, and combines this with optional on-disk storage of the index's suffix array and a cache-efficient lookup table for partial k-mer searches. The AWFM-index performs exact match count and locate queries faster than SeqAn3's FM-index implementation across a range of comparable memory footprints. When optimized for speed, AWFM-index is ∼ 2–4x faster than SeqAn3 for nucleotide search, and ∼ 2–6x faster for amino acid search; it is also ∼ 4x faster with similar memory footprint when storing the suffix array in on-disk SSD storage. Conclusions: AWFM-index is easy to incorporate into bioinformatics software, offers run-time performance parameterization, and provides clients with FM-index functionality at both a high-level (count or locate all instances of a query string) and low-level (step-wise control of the FM-index backward-search process). The open-source library is available for download at https://github.com/TravisWheelerLab/AvxWindowFmIndex. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
21. OffScan: a universal and fast CRISPR off-target sites detection tool
- Author
-
Yingbo Cui, Xiangke Liao, Shaoliang Peng, Tao Tang, Chun Huang, and Canqun Yang
- Subjects
CRISPR/Cas ,sgRNA ,Off-target ,FM-index ,Biotechnology ,TP248.13-248.65 ,Genetics ,QH426-470 - Abstract
Abstract Background The Type II clustered regularly interspaced short palindromic repeats (CRISPR) and CRISPR-associated proteins (Cas) is a powerful genome editing technology, which is more and more popular in gene function analysis. In CRISPR/Cas, RNA guides Cas nuclease to the target site to perform DNA modification. Results The performance of CRISPR/Cas depends on well-designed single guide RNA (sgRNA). However, the off-target effect of sgRNA leads to undesired mutations in genome and limits the use of CRISPR/Cas. Here, we present OffScan, a universal and fast CRISPR off-target detection tool. Conclusions OffScan is not limited by the number of mismatches and allows custom protospacer-adjacent motif (PAM), which is the target site by Cas protein. Besides, OffScan adopts the FM-index, which efficiently improves query speed and reduce memory consumption.
- Published
- 2020
- Full Text
- View/download PDF
22. A Review on Sequence Alignment Algorithms for Short Reads Based on Next-Generation Sequencing
- Author
-
Jeongkyu Kim, Mingeun Ji, and Gangman Yi
- Subjects
NGS ,sequence alignment ,read alignment ,FM-index ,hashing ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
With recent advances in next-generation sequencing (NGS) technology, large volumes of data have been produced in the form of short reads. Sequence assembly involves using initial short reads to produce progressively longer contigs, and then using scaffolds to produce the final sequence. These processes each require evaluation of the extent of homology between different sequences. However, because the NGS platforms currently being developed are diverse, and the data being produced are of different sizes and read lengths, numerous algorithms are being developed with unique methodologies to process this complex data. It is difficult for biologists to manipulate the different features involved in these algorithms. Therefore, to reduce experimental trial-and-error, different strategies are required depending on the performance and purpose of the optimal algorithm, thereby facilitating understanding of algorithm methodologies and effective use of their various features. This study is a review of the different short read alignment algorithms and NGS platforms that have been developed to date, in order to aid efficient selection of algorithms for reference sequences and mapping of DNA data.
- Published
- 2020
- Full Text
- View/download PDF
23. Kohdista: an efficient method to index and query possible Rmap alignments
- Author
-
Martin D. Muggli, Simon J. Puglisi, and Christina Boucher
- Subjects
Optical mapping ,Index based data structures ,FM-index ,Graph algorithms ,Biology (General) ,QH301-705.5 ,Genetics ,QH426-470 - Abstract
Abstract Background Genome-wide optical maps are ordered high-resolution restriction maps that give the position of occurrence of restriction cut sites corresponding to one or more restriction enzymes. These genome-wide optical maps are assembled using an overlap-layout-consensus approach using raw optical map data, which are referred to as Rmaps. Due to the high error-rate of Rmap data, finding the overlap between Rmaps remains challenging. Results We present Kohdista, which is an index-based algorithm for finding pairwise alignments between single molecule maps (Rmaps). The novelty of our approach is the formulation of the alignment problem as automaton path matching, and the application of modern index-based data structures. In particular, we combine the use of the Generalized Compressed Suffix Array (GCSA) index with the wavelet tree in order to build Kohdista. We validate Kohdista on simulated E. coli data, showing the approach successfully finds alignments between Rmaps simulated from overlapping genomic regions. Conclusion we demonstrate Kohdista is the only method that is capable of finding a significant number of high quality pairwise Rmap alignments for large eukaryote organisms in reasonable time.
- Published
- 2019
- Full Text
- View/download PDF
24. b-move: faster bidirectional character extensions in a run-length compressed index.
- Author
-
Depuydt L, Renders L, de Vyver SV, Veys L, Gagie T, and Fostier J
- Abstract
Due to the increasing availability of high-quality genome sequences, pan-genomes are gradually replacing single consensus reference genomes in many bioinformatics pipelines to better capture genetic diversity. Traditional bioinformatics tools using the FM-index face memory limitations with such large genome collections. Recent advancements in run-length compressed indices like Gagie et al.'s r-index and Nishimoto and Tabei's move structure, alleviate memory constraints but focus primarily on backward search for MEM-finding. Arakawa et al.'s br-index initiates complete approximate pattern matching using bidirectional search in run-length compressed space, but with significant computational overhead due to complex memory access patterns. We introduce b-move, a novel bidirectional extension of the move structure, enabling fast, cache-efficient bidirectional character extensions in run-length compressed space. It achieves bidirectional character extensions up to 8 times faster than the br-index, closing the performance gap with FM-index-based alternatives, while maintaining the br-index's favorable memory characteristics. For example, all available complete E. coli genomes on NCBI's RefSeq collection can be compiled into a b-move index that fits into the RAM of a typical laptop. Thus, b-move proves practical and scalable for pan-genome indexing and querying. We provide a C++ implementation of b-move, supporting efficient lossless approximate pattern matching including locate functionality, available at https://github.com/biointec/b-move under the AGPL-3.0 license.
- Published
- 2024
- Full Text
- View/download PDF
25. Enabling fast and energy-efficient FM-index exact matching using processing-near-memory.
- Author
-
Herruzo, Jose M., Fernandez, Ivan, González-Navarro, Sonia, and Plata, Oscar
- Subjects
- *
PATTERN matching , *SEQUENCE alignment , *RADIO frequency modulation , *ALGORITHMS - Abstract
Memory bandwidth and latency constitutes a major performance bottleneck for many data-intensive applications. While high-locality access patterns take advantage of the deep cache hierarchies available in modern processors, unpredictable low-locality patterns cause a significant part of the execution time to be wasted waiting for data. An example of those memory bound applications is the exact matching algorithm based on FM-index, used in some well-known sequence alignment applications. Processing-Near-Memory (PNM) has been proposed as a strategy to overcome the memory wall problem, by placing computation close to data, speeding up memory bound workloads by reducing data movements. This paper presents a performance and energy evaluation of two classes of processor architectures when executing the FM-index exact matching algorithm, as a reference algorithm for exact sequence alignment. One architecture class is processor-centric, based on complex cores and DDR3/4 SDRAM memory technology. The other architecture class is memory-centric, based on simple cores and ultra-high-bandwidth hybrid memory cube (HMC) 3D-stacked memory technologies. The results show that the PNM solution improves performance between 1.26 × and 3.7 × and the energy consumption per operation is reduced between 21 × and 40 × . In addition, a synthetic benchmark RANDOM was developed that mimics the memory access pattern of the FM-index exact matching algorithm, but with a user configurable operational intensity. This benchmark allows us to extend the evaluation to the class of algorithms with similar memory behaviour but running over a range of operational intensity values. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
26. Computation of the suffix array, Burrows-Wheeler transform and FM-index in V-order.
- Author
-
Daykin, Jacqueline W., Mhaskar, Neerja, and Smyth, W.F.
- Subjects
- *
SUFFIXES & prefixes (Grammar) , *LINEAR orderings , *FACTORIZATION - Abstract
V -order is a total order on strings that determines an instance of Unique Maximal Factorization Families (UMFFs), a generalization of Lyndon words. The fundamental V -comparison of strings can be done in linear time and constant space. V -order has been proposed as an alternative to lexicographic order (lexorder) in the computation of suffix arrays and in the suffix-sorting induced by the Burrows-Wheeler transform (BWT). In line with the recent interest in the connection between suffix arrays and Lyndon factorization, in this paper we obtain similar results for the V -order factorization. Indeed, we show that the results describing the connection between suffix arrays and Lyndon factorization are matched by analogous V -order processing. We also describe a methodology for efficiently computing the FM-Index in V -order, as well as V -order substring pattern matching using backward search. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
27. Better quality score compression through sequence-based quality smoothing
- Author
-
Yoshihiro Shibuya and Matteo Comin
- Subjects
FASTQ compression ,BWT ,FM-Index ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Biology (General) ,QH301-705.5 - Abstract
Abstract Motivation Current NGS techniques are becoming exponentially cheaper. As a result, there is an exponential growth of genomic data unfortunately not followed by an exponential growth of storage, leading to the necessity of compression. Most of the entropy of NGS data lies in the quality values associated to each read. Those values are often more diversified than necessary. Because of that, many tools such as Quartz or GeneCodeq, try to change (smooth) quality scores in order to improve compressibility without altering the important information they carry for downstream analysis like SNP calling. Results We use the FM-Index, a type of compressed suffix array, to reduce the storage requirements of a dictionary of k-mers and an effective smoothing algorithm to maintain high precision for SNP calling pipelines, while reducing quality scores entropy. We present YALFF (Yet Another Lossy Fastq Filter), a tool for quality scores compression by smoothing leading to improved compressibility of FASTQ files. The succinct k-mers dictionary allows YALFF to run on consumer computers with only 5.7 GB of available free RAM. YALFF smoothing algorithm can improve genotyping accuracy while using less resources. Availability https://github.com/yhhshb/yalff
- Published
- 2019
- Full Text
- View/download PDF
28. FM-index for Dummies
- Author
-
Grabowski, Szymon, Raniszewski, Marcin, Deorowicz, Sebastian, Diniz Junqueira Barbosa, Simone, Series editor, Chen, Phoebe, Series editor, Du, Xiaoyong, Series editor, Filipe, Joaquim, Series editor, Kara, Orhun, Series editor, Kotenko, Igor, Series editor, Liu, Ting, Series editor, Sivalingam, Krishna M., Series editor, Washio, Takashi, Series editor, Kozielski, Stanisław, editor, Mrozek, Dariusz, editor, Kasprowski, Paweł, editor, Małysiak-Mrozek, Bożena, editor, and Kostrzewa, Daniel, editor
- Published
- 2017
- Full Text
- View/download PDF
29. Centrifuger: lossless compression of microbial genomes for efficient and accurate metagenomic sequence classification.
- Author
-
Song L and Langmead B
- Subjects
- Software, Genome, Microbial, Genome, Bacterial, Sequence Analysis, DNA methods, Data Compression methods, Metagenomics methods
- Abstract
Centrifuger is an efficient taxonomic classification method that compares sequencing reads against a microbial genome database. In Centrifuger, the Burrows-Wheeler transformed genome sequences are losslessly compressed using a novel scheme called run-block compression. Run-block compression achieves sublinear space complexity and is effective at compressing diverse microbial databases like RefSeq while supporting fast rank queries. Combining this compression method with other strategies for compacting the Ferragina-Manzini (FM) index, Centrifuger reduces the memory footprint by half compared to other FM-index-based approaches. Furthermore, the lossless compression and the unconstrained match length help Centrifuger achieve greater accuracy than competing methods at lower taxonomic levels., (© 2024. The Author(s).)
- Published
- 2024
- Full Text
- View/download PDF
30. Short Read Alignment Based on Maximal Approximate Match Seeds
- Author
-
Wei Quan, Dengfeng Guan, Guangri Quan, Bo Liu, and Yadong Wang
- Subjects
whole-genome resequencing ,next-generation sequencing ,repeats ,sequence alignment ,FM-index ,Biology (General) ,QH301-705.5 - Abstract
Sequence alignment is a critical step in many critical genomic studies, such as variant calling, quantitative transcriptome analysis (RNA-seq), and metagenomic sequence classification. However, the alignment performance is largely affected by repetitive sequences in the reference genome, which extensively exist in species from bacteria to mammals. Aligning repeating sequences might lead to tremendous candidate locations, bringing about a challenging computational burden. Thus, most alignment tools prefer to simply discard highly repetitive seeds, but this may cause the true alignment to be missed. Using maximal approximate matches (MAMs) as seeds is an option, but MEMs seeds may fail due to sequencing errors or genomic variations in MEMs seeds. Here, we propose a novel sequence alignment algorithm, named MAM, which can efficiently align short DNA sequences. MAM first builds a modified Burrows-Wheeler transform (BWT) structure of a reference genome to accelerate approximate seed matching. Then, MAM uses maximal approximate matches (MAMs) seeds to reduce the candidate locations. Finally, MAM applies an affine-gap-penalty dynamic programming to extend MAMs seeds. Experimental results on simulated and real sequencing datasets show that MAM achieves better performance in speed than other state-of-the-art alignment tools. The source code is available at https://github.com/weiquan/mam.
- Published
- 2020
- Full Text
- View/download PDF
31. HmmUFOtu: An HMM and phylogenetic placement based ultra-fast taxonomic assignment and OTU picking tool for microbiome amplicon sequencing studies
- Author
-
Qi Zheng, Casey Bartow-McKenney, Jacquelyn S. Meisel, and Elizabeth A. Grice
- Subjects
Microbiome ,16S rRNA gene ,FM-index ,HMM profile alignment ,Phylogenetic placement ,Taxonomic assignment ,Biology (General) ,QH301-705.5 ,Genetics ,QH426-470 - Abstract
Abstract Culture-independent analysis of microbial communities frequently relies on amplification and sequencing of the prokaryotic 16S ribosomal RNA gene. Typical analysis pipelines group sequences into operational taxonomic units (OTUs) to infer taxonomic and phylogenetic relationships. Here, we present HmmUFOtu, a novel tool for processing microbiome amplicon sequencing data, which performs rapid per-read phylogenetic placement, followed by phylogenetically informed clustering into OTUs and taxonomy assignment. Compared to standard pipelines, HmmUFOtu more accurately and reliably recapitulates microbial community diversity and composition in simulated and real datasets without relying on heuristics or sacrificing speed or accuracy.
- Published
- 2018
- Full Text
- View/download PDF
32. FMLRC: Hybrid long read error correction using an FM-index
- Author
-
Jeremy R. Wang, James Holt, Leonard McMillan, and Corbin D. Jones
- Subjects
de novo assembly ,Hybrid error correction ,Long read ,Pacbio ,BWT ,FM-Index ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Biology (General) ,QH301-705.5 - Abstract
Abstract Background Long read sequencing is changing the landscape of genomic research, especially de novo assembly. Despite the high error rate inherent to long read technologies, increased read lengths dramatically improve the continuity and accuracy of genome assemblies. However, the cost and throughput of these technologies limits their application to complex genomes. One solution is to decrease the cost and time to assemble novel genomes by leveraging “hybrid” assemblies that use long reads for scaffolding and short reads for accuracy. Results We describe a novel method leveraging a multi-string Burrows-Wheeler Transform with auxiliary FM-index to correct errors in long read sequences using a set of complementary short reads. We demonstrate that our method efficiently produces significantly more high quality corrected sequence than existing hybrid error-correction methods. We also show that our method produces more contiguous assemblies, in many cases, than existing state-of-the-art hybrid and long-read only de novo assembly methods. Conclusion Our method accurately corrects long read sequence data using complementary short reads. We demonstrate higher total throughput of corrected long reads and a corresponding increase in contiguity of the resulting de novo assemblies. Improved throughput and computational efficiency than existing methods will help better economically utilize emerging long read sequencing technologies.
- Published
- 2018
- Full Text
- View/download PDF
33. Accelerating Sequence Alignments Based on FM-Index Using the Intel KNL Processor.
- Author
-
Herruzo, Jose M., Gonzalez-Navarro, Sonia, Ibanez-Marin, Pablo, Vinals-Yufera, Victor, Alastruey-Benede, Jesus, and Plata, Oscar
- Abstract
FM-index is a compact data structure suitable for fast matches of short reads to large reference genomes. The matching algorithm using this index exhibits irregular memory access patterns that cause frequent cache misses, resulting in a memory bound problem. This paper analyzes different FM-index versions presented in the literature, focusing on those computing aspects related to the data access. As a result of the analysis, we propose a new organization of FM-index that minimizes the demand for memory bandwidth, allowing a great improvement of performance on processors with high-bandwidth memory, such as the second-generation Intel Xeon Phi (Knights Landing, or KNL), integrating ultra high-bandwidth stacked memory technology. As the roofline model shows, our implementation reaches 95 percent of the peak random access bandwidth limit when executed on the KNL and almost all of the available bandwidth when executed on other Intel Xeon architectures with conventional DDR memory. In addition, the obtained throughput in KNL is much higher than the results reported for GPUs in the literature. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
34. Refining the r-index.
- Author
-
Bannai, Hideo, Gagie, Travis, and I, Tomohiro
- Subjects
- *
WASTE products , *GENOMES , *DATABASES - Abstract
Gagie, Navarro and Prezza's r -index (SODA, 2018) promises to speed up DNA alignment and variation calling by allowing us to index entire genomic databases, provided certain obstacles can be overcome. In this paper we first strengthen and simplify Policriti and Prezza's Toehold Lemma (DCC '16; Algorithmica , 2017), which inspired the r -index and plays an important role in its implementation. We then show how to update the r -index efficiently after adding a new genome to the database, which is likely to be vital in practice. As a by-product of this result, we obtain an online version of Policriti and Prezza's algorithm for constructing the LZ77 parse from a run-length compressed Burrows-Wheeler Transform. Our experiments demonstrate the practicality of all three of these results. Finally, we show how to augment the r -index such that, given a new genome and fast random access to the database, we can quickly compute the matching statistics and maximal exact matches of the new genome with respect to the database. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
35. OffScan: a universal and fast CRISPR off-target sites detection tool.
- Author
-
Cui, Yingbo, Liao, Xiangke, Peng, Shaoliang, Tang, Tao, Huang, Chun, and Yang, Canqun
- Subjects
- *
GENOME editing , *DNA , *RNA - Abstract
Background: The Type II clustered regularly interspaced short palindromic repeats (CRISPR) and CRISPR-associated proteins (Cas) is a powerful genome editing technology, which is more and more popular in gene function analysis. In CRISPR/Cas, RNA guides Cas nuclease to the target site to perform DNA modification. Results: The performance of CRISPR/Cas depends on well-designed single guide RNA (sgRNA). However, the off-target effect of sgRNA leads to undesired mutations in genome and limits the use of CRISPR/Cas. Here, we present OffScan, a universal and fast CRISPR off-target detection tool. Conclusions: OffScan is not limited by the number of mismatches and allows custom protospacer-adjacent motif (PAM), which is the target site by Cas protein. Besides, OffScan adopts the FM-index, which efficiently improves query speed and reduce memory consumption. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
36. Kohdista: an efficient method to index and query possible Rmap alignments.
- Author
-
Muggli, Martin D., Puglisi, Simon J., and Boucher, Christina
- Subjects
- *
SINGLE molecules , *DATA structures , *GRAPH algorithms , *DATA mapping - Abstract
Background: Genome-wide optical maps are ordered high-resolution restriction maps that give the position of occurrence of restriction cut sites corresponding to one or more restriction enzymes. These genome-wide optical maps are assembled using an overlap-layout-consensus approach using raw optical map data, which are referred to as Rmaps. Due to the high error-rate of Rmap data, finding the overlap between Rmaps remains challenging. Results: We present Kohdista, which is an index-based algorithm for finding pairwise alignments between single molecule maps (Rmaps). The novelty of our approach is the formulation of the alignment problem as automaton path matching, and the application of modern index-based data structures. In particular, we combine the use of the Generalized Compressed Suffix Array (GCSA) index with the wavelet tree in order to build Kohdista. We validate Kohdista on simulated E. coli data, showing the approach successfully finds alignments between Rmaps simulated from overlapping genomic regions. Conclusion: we demonstrate Kohdista is the only method that is capable of finding a significant number of high quality pairwise Rmap alignments for large eukaryote organisms in reasonable time. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
37. Compressed Suffix Array
- Author
-
Belazzougui, Djamal, Mäkinen, Veli, Valenzuela, Daniel, and Kao, Ming-Yang, editor
- Published
- 2016
- Full Text
- View/download PDF
38. A Natural Encoding of Genetic Variation in a Burrows-Wheeler Transform to Enable Mapping and Genome Inference
- Author
-
Maciuca, Sorina, del Ojo Elias, Carlos, McVean, Gil, Iqbal, Zamin, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Frith, Martin, editor, and Storm Pedersen, Christian Nørgaard, editor
- Published
- 2016
- Full Text
- View/download PDF
39. Prefix Sorting DFAs: A Recursive Algorithm
- Author
-
Nicola Cotumaccio, Cotumaccio, Nicola, Nicola Cotumaccio, and Cotumaccio, Nicola
- Abstract
In the past thirty years, numerous algorithms for building the suffix array of a string have been proposed. In 2021, the notion of suffix array was extended from strings to DFAs, and it was shown that the resulting data structure can be built in O(m² + n^{5/2}) time, where n is the number of states and m is the number of edges [SODA 2021]. Recently, algorithms running in O(mn) and O(n²log n) time have been described [CPM 2023]. In this paper, we improve the previous bounds by proposing an O(n²) recursive algorithm inspired by Farach’s algorithm for building a suffix tree [FOCS 1997]. To this end, we provide insight into the rich lexicographic and combinatorial structure of a graph, so contributing to the fascinating journey which might lead to solve the long-standing open problem of building the suffix tree of a graph.
- Published
- 2023
- Full Text
- View/download PDF
40. Acceleration of FM-Index Queries Through Prefix-Free Parsing
- Author
-
Aaron Hong and Marco Oliva and Dominik Köppl and Hideo Bannai and Christina Boucher and Travis Gagie, Hong, Aaron, Oliva, Marco, Köppl, Dominik, Bannai, Hideo, Boucher, Christina, Gagie, Travis, Aaron Hong and Marco Oliva and Dominik Köppl and Hideo Bannai and Christina Boucher and Travis Gagie, Hong, Aaron, Oliva, Marco, Köppl, Dominik, Bannai, Hideo, Boucher, Christina, and Gagie, Travis
- Abstract
FM-indexes are a crucial data structure in DNA alignment, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer [Ferragina and Fischer, 2007] observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to parse it somehow before applying word-based FM-indexing. Last year, Deng et al. [Deng et al., 2022] proposed parsing genomic data by induced suffix sorting, and showed the resulting word-based FM-indexes support faster counting queries than standard FM-indexes when patterns are a few thousand characters or longer. In this paper we show that using prefix-free parsing - which takes parameters that let us tune the average length of the phrases - instead of induced suffix sorting, gives a significant speedup for patterns of only a few hundred characters. We implement our method and demonstrate it is between 3 and 18 times faster than competing methods on queries to GRCh38. And was consistently faster on queries made to 25,000, 50,000 and 100,000 SARS-CoV-2 genomes. Hence, it is very clear that our method accelerates the performance of count over all state-of-the-art methods with a minor increase in the memory.
- Published
- 2023
- Full Text
- View/download PDF
41. An efficient error correction algorithm using FM-index
- Author
-
Yao-Ting Huang and Yu-Wen Huang
- Subjects
FM-index ,Next-generation sequencing ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Biology (General) ,QH301-705.5 - Abstract
Abstract Background High-throughput sequencing offers higher throughput and lower cost for sequencing a genome. However, sequencing errors, including mismatches and indels, may be produced during sequencing. Because, errors may reduce the accuracy of subsequent de novo assembly, error correction is necessary prior to assembly. However, existing correction methods still face trade-offs among correction power, accuracy, and speed. Results We develop a novel overlap-based error correction algorithm using FM-index (called FMOE). FMOE first identifies overlapping reads by aligning a query read simultaneously against multiple reads compressed by FM-index. Subsequently, sequencing errors are corrected by k-mer voting from overlapping reads only. The experimental results indicate that FMOE has highest correction power with comparable accuracy and speed. Our algorithm performs better in long-read than short-read datasets when compared with others. The assembly results indicated different algorithms has its own strength and weakness, whereas FMOE is good for long or good-quality reads. Conclusions FMOE is freely available at https://github.com/ythuang0522/FMOC .
- Published
- 2017
- Full Text
- View/download PDF
42. Secure Wavelet Matrix: Alphabet-Friendly Privacy-Preserving String Search for Bioinformatics.
- Author
-
Sudo, Hiroki, Jimbo, Masanobu, Nuida, Koji, and Shimizu, Kana
- Abstract
Biomedical data often includes personal information, and the technology is demanded that enables the searching of such sensitive data while protecting privacy. We consider a case in which a server has a text database and a user searches the database to find substring matches. The user wants to conceal his/her query and the server wants to conceal the database except for the search results. The previous approach for this problem is based on a linear-time algorithm in terms of alphabet size $\mathbf{|\Sigma |}$|Σ|, and it cannot search on the database of large alphabet such as biomedical documents. We present a novel algorithm that can search a string in logarithmic time of $\mathbf{|\Sigma |}$|Σ|. In our algorithm, named secure wavelet matrix (sWM), we use an additively homomorphic encryption to build an efficient data structure called a wavelet matrix. In an experiment using a simulated string of length 10,000 whose alphabet size ranges from 4 to 1024, the run time of the sWM was up to around two orders of magnitude faster than that of the previous method. sWM enables the searching of a private database efficiently and thus it will facilitate utilizing sensitive biomedical information. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
43. Fixed Block Compression Boosting in FM-Indexes: Theory and Practice.
- Author
-
Gog, Simon, Kärkkäinen, Juha, Kempa, Dominik, Petri, Matthias, and Puglisi, Simon J.
- Subjects
- *
PATTERN matching , *THEORY-practice relationship , *DATA structures - Abstract
The FM index (Ferragina and Manzini in J ACM 52(4):552–581, 2005) is a widely-used compressed data structure that stores a string T in a compressed form and also supports fast pattern matching queries. In this paper, we describe new FM-index variants that combine nice theoretical properties, simple implementation and improved practical performance. Our main theoretical result is a new technique called fixed block compression boosting, which is a simpler and faster alternative to optimal compression boosting and implicit compression boosting used in previous FM-indexes. We also describe several new techniques for implementing fixed-block boosting efficiently, including a new, fast, and space-efficient implementation of wavelet trees. Our extensive experiments show the new indexes to be consistently fast and small relative to the state-of-the-art, and thus they make a good "off-the-shelf" choice for many applications. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
44. Ispravak pogrešaka u dugačkim očitanjima korištenjem FM-indeksa
- Author
-
Čeple, Kristijan and Domazet-Lošo, Mirjana
- Subjects
FMLRC ,TEHNIČKE ZNANOSTI. Računarstvo ,kratka očitanja ,sekvenciranje ,sequencing ,FM indeks ,long read ,TECHNICAL SCIENCES. Computing ,de Bruijn graf ,duga očitanja ,genome assembly ,short read ,sastavljanje genoma ,dugo očitanje ,kratko očitanje ,kmer ,FM-index ,de Bruijn graph - Abstract
Ovaj diplomski rad se bavi korištenjem FM-indeksa izgrađenog nad skupom kratkih očitanja u svrhu ispravljanja skupa dugih očitanja. Pomoću FM-indeksa implicitno se reprezentiraju svi de Bruijn grafovi za k-mer proizvoljne duljine. Pomoću 2 prolaska se ispravljaju duga očitanja, a zbog toga što su duga očitanja međusobno neovisna se ispravljanja mogu i paralelizirati. Prvi prolaz koristi manji k-mer i ispravlja većinu grešaka (orijentiranost na ispravak detalja), dok drugi prolaz koristi duži K-mer i ispravlja dugačke i/ili repetitivne regije. Nakon ispravljanja dugačkih očitanja, ona su puno veće točnosti, te se mogu dalje koristiti za sastavljanje genoma, koje je puno jednostavnije i efektivnije uspoređeno sa sastavljanjem genoma s kratkim očitanjima. This master’s thesis focuses on using an FM-index (built from a short read dataset) for correcting the long read dataset. The FM-index implicitly represents all de Bruijn graphs for arbitrary length kmers. Each long read shall be corrected using two passes - one with a shorter kmer, and the other with a longer K-mer. Short k-mer passes will do most of the correction, as they can pinpoint and correct defects in detail, while long K-mers will correct long and repetitive regions. After both passes are done, the accuracy of the long reads is increased. Because the long read dataset is made up of many independent long reads, this process can be parallelised. The corrected long reads can now be successfully used in downstream applications, such as genome assembly. The important thing to note is that genome assembly is much faster, simpler and more effective using the long read dataset, as compared to using the short read dataset.
- Published
- 2023
45. An optimized FM-index library for nucleotide and amino acid search
- Author
-
Travis J. Wheeler and Tim Anderson
- Subjects
Sequence analysis ,Computer science ,QH301-705.5 ,Context (language use) ,Parallel computing ,QH426-470 ,Compressed data structure ,Structural Biology ,Data_FILES ,Genetics ,Nucleotide ,Pattern matching ,Biology (General) ,Molecular Biology ,FM-index ,chemistry.chemical_classification ,Applied Mathematics ,Search engine indexing ,Software Article ,Amino acid ,Computational Theory and Mathematics ,chemistry ,Lookup table ,Key (cryptography) ,String matching ,SIMD vectorization - Abstract
Background Pattern matching is a key step in a variety of biological sequence analysis pipelines. The FM-index is a compressed data structure for pattern matching, with search run time that is independent of the length of the database text. Implementation of the FM-index is reasonably complicated, so that increased adoption will be aided by the availability of a fast and flexible FM-index library. Results We present AvxWindowedFMindex (AWFM-index), a lightweight, open-source, thread-parallel FM-index library written in C that is optimized for indexing nucleotide and amino acid sequences. AWFM-index introduces a new approach to storing FM-index data in a strided bit-vector format that enables extremely efficient computation of the FM-index occurrence function via AVX2 bitwise instructions, and combines this with optional on-disk storage of the index’s suffix array and a cache-efficient lookup table for partial k-mer searches. The AWFM-index performs exact match count and locate queries faster than SeqAn3’s FM-index implementation across a range of comparable memory footprints. When optimized for speed, AWFM-index is $$\sim $$ ∼ 2–4x faster than SeqAn3 for nucleotide search, and $$\sim $$ ∼ 2–6x faster for amino acid search; it is also $$\sim $$ ∼ 4x faster with similar memory footprint when storing the suffix array in on-disk SSD storage. Conclusions AWFM-index is easy to incorporate into bioinformatics software, offers run-time performance parameterization, and provides clients with FM-index functionality at both a high-level (count or locate all instances of a query string) and low-level (step-wise control of the FM-index backward-search process). The open-source library is available for download at https://github.com/TravisWheelerLab/AvxWindowFmIndex.
- Published
- 2021
46. Forty Years of Text Indexing
- Author
-
Apostolico, Alberto, Crochemore, Maxime, Farach-Colton, Martin, Galil, Zvi, Muthukrishnan, S., Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Fischer, Johannes, editor, and Sanders, Peter, editor
- Published
- 2013
- Full Text
- View/download PDF
47. FEMTO: Fast Search of Large Sequence Collections
- Author
-
Ferguson, Michael P., Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Kärkkäinen, Juha, editor, and Stoye, Jens, editor
- Published
- 2012
- Full Text
- View/download PDF
48. Centrifuger: lossless compression of microbial genomes for efficient and accurate metagenomic sequence classification.
- Author
-
Song L and Langmead B
- Abstract
Centrifuger is an efficient taxonomic classification method that compares sequencing reads against a microbial genome database. In Centrifuger, the Burrows-Wheeler transformed genome sequences are losslessly compressed using a novel scheme called run-block compression. Run-block compression achieves sublinear space complexity and is effective at compressing diverse microbial databases like RefSeq while supporting fast rank queries. Combining this compression method with other strategies for compacting the Ferragina-Manzini (FM) index, Centrifuger reduces the memory footprint by half compared to other FM-index-based approaches. Furthermore, the lossless compression and the unconstrained match length help Centrifuger achieve greater accuracy than competing methods at lower taxonomic levels.
- Published
- 2023
- Full Text
- View/download PDF
49. PFP-FM: An Accelerated FM-index.
- Author
-
Hong A, Oliva M, Köppl D, Bannai H, Boucher C, and Gagie T
- Abstract
FM-indexes are a crucial data structure in DNA alignment, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer [1] observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to parse it somehow before applying word-based FM-indexing. Last year, Deng et al. [2] proposed parsing genomic data by induced suffix sorting, and showed the resulting word-based FM-indexes support faster counting queries than standard FM-indexes when patterns are a few thousand characters or longer. In this paper we show that using prefix-free parsing-which takes parameters that let us tune the average length of the phrases-instead of induced suffix sorting, gives a significant speedup for patterns of only a few hundred characters. We implement our method and demonstrate it is between 3 and 18 times faster than competing methods on queries to GRCh38, and is consistently faster on queries made to 25,000, 50,000 and 100,000 SARS-CoV-2 genomes. Hence, it seems our method accelerates the performance of count over all state-of-the-art methods with a minor increase in the memory. The source code for PFP-FM is available at https://github.com/marco-oliva/afm.
- Published
- 2023
- Full Text
- View/download PDF
50. A 135-mW Fully Integrated Data Processor for Next-Generation Sequencing.
- Author
-
Wu, Yi-Chung, Chang, Chia-Hua, Hung, Jui-Hung, and Yang, Chia-Hsiang
- Abstract
Next-generation sequencing (NGS) enables high-throughput sequencing, in which short DNA fragments can be sequenced in a massively parallel fashion. However, the essential algorithm behind the succeeding NGS data analysis, DNA mapping, is still excessively time consuming. DNA mapping can be partitioned into two parts: suffix array (SA) sorting and backward searching. Dedicated hardware designs for the less-complex backward searching have been proposed, but feasible hardware for the most complicated part, SA sorting, has never been explored. Based on the memory-efficient sBWT algorithm, this work is the first integrated NGS data processor for the entire DNA mapping. The $k$-ordered Ferragina and Manzini index used in the sBWT algorithm is leveraged to improve storage capacity and reduce hardware complexity. The proposed NGS data processor realizes the sBWT algorithm through bucket sorting, suffix grouping, and suffix sorting circuits. Key design parameters are analyzed to achieve the optimal performance with respect to hardware cost and execution time. Fabricated in 40-nm CMOS, the NGS data processor dissipates 135 mW at 200 MHz from a 0.9-V supply. With 1-GB external memory, the chip can analyze human DNA within 10 min. This work achieves 43 065 $\times$ and 8 971 $\times$ [3208 $\times$ and 402$\times$ ] higher energy efficiency (throughput-to-area ratio) than the high-end CPU and GPU solutions, respectively. [ABSTRACT FROM PUBLISHER]
- Published
- 2017
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.