607 results on '"BURROWS-WHEELER TRANSFORM"'
Search Results
2. On arithmetically progressed suffix arrays and related Burrows–Wheeler transforms.
- Author
-
Daykin, Jacqueline W., Köppl, Dominik, Kübel, David, and Stober, Florian
- Subjects
- *
SUFFIXES & prefixes (Grammar) , *PERMUTATIONS , *PROBLEM solving , *ARITHMETIC series - Abstract
We characterize those strings whose suffix arrays are based on arithmetic progressions, in particular, arithmetically progressed permutations where all pairs of successive entries of the permutation have the same difference modulo the respective string length. We show that an arithmetically progressed permutation P coincides with the suffix array of a unary, binary, or ternary string. We further analyze the conditions of a given P under which we can find a uniquely defined string over either a binary or ternary alphabet having P as its suffix array. For the binary case, we show its connection to lower Christoffel words, balanced words, and Fibonacci words. In addition to solving the arithmetically progressed suffix array problem, we give the shape of the Burrows–Wheeler transform of those strings solving this problem. These results give rise to numerous future research directions. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
3. Co-lexicographically Ordering Automata and Regular Languages - Part I.
- Author
-
COTUMACCIO, NICOLA, D'AGOSTINO, GIOVANNA, POLICRITI, ALBERTO, and PREZZA, NICOLA
- Abstract
Having established that the co-lex width of an automaton is a fundamental complexity measure, we proceed by (i) determining its computational complexity and (ii) extending this notion from automata to regular languages by studying their smallest-width accepting NFAs and DFAs. In this work we focus on the deterministic case and prove that a canonical minimum-width DFA accepting a language L-dubbed the Hasse automaton H of L-can be exhibited. H provides, in a precise sense, the best possible way to (partially) order the states of any DFA accepting L, as long as we want to maintain an operational link with the (colexicographic) order of L's prefixes. Finally, we explore the relationship between two conflicting objectives: minimizing the width and minimizing the number of states of a DFA. In this context, we provide an analogue of the Myhill-Nerode Theorem for co-lexicographically ordered regular languages. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
4. How to Find Long Maximal Exact Matches and Ignore Short Ones
- Author
-
Gagie, Travis, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Day, Joel D., editor, and Manea, Florin, editor
- Published
- 2024
- Full Text
- View/download PDF
5. A BWT-Based Algorithm for Random de Bruijn Sequence Construction
- Author
-
Lipták, Zsuzsanna, Parmigiani, Luca, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Soto, José A., editor, and Wiese, Andreas, editor
- Published
- 2024
- Full Text
- View/download PDF
6. Longest Common Prefix Arrays for Succinct k-Spectra
- Author
-
Alanko, Jarno N., Biagi, Elena, Puglisi, Simon J., Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Nardini, Franco Maria, editor, Pisanti, Nadia, editor, and Venturini, Rossano, editor
- Published
- 2023
- Full Text
- View/download PDF
7. Bit Catastrophes for the Burrows-Wheeler Transform
- Author
-
Giuliani, Sara, Inenaga, Shunsuke, Lipták, Zsuzsanna, Romana, Giuseppe, Sciortino, Marinella, Urbina, Cristian, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Drewes, Frank, editor, and Volkov, Mikhail, editor
- Published
- 2023
- Full Text
- View/download PDF
8. Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space.
- Author
-
GAGIE, TRAVIS, NAVARRO, GONZALO, and PREZZA, NICOLA
- Subjects
PATTERNS (Mathematics) ,MAGNITUDE (Mathematics) ,TISSUE arrays ,TREES ,SPACE - Abstract
Indexing highly repetitive texts--such as genomic databases, software repositories and versioned text collections--has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in a text of length n (in O(m log log n) time, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r. In this article, we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently (in O(occ log log n) time) within O(r) space. By raising the space to O(r log log n), our index counts the occurrences in optimal time, O(m), and locates them in optimal time as well, O(m + occ). By further raising the space by an O(w/log σ) factor, where σ is the alphabet size and w = Ω (log n) is the RAM machine size in bits, we support count and locate in O(⌈ m log (σ)/w ⌉) and O(⌈ m log (σ)/w ⌉ + occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(r log (n/r)) space that replaces the text and extracts any text substring of length ℓ in the almost-optimal time O(log (n/r)+ℓ log (σ)/w). Within that space, we similarly provide access to arbitrary suffix array, inverse suffix array, and longest common prefix array cells in time O(log (n/r)), and extend these capabilities to full suffix tree functionality, typically in O(log (n/r)) time per operation. Our experiments show that our O(r)-space index outperforms the space-competitive alternatives by 1-2 orders of magnitude in time. Competitive implementations of the original FM-index are outperformed by 1-2 orders of magnitude in space and/or 2-3 in time. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
9. Genomic Phylogeny Using the Maxwell TM Classifier Based on Burrows–Wheeler Transform.
- Author
-
Demongeot, Jacques, Gardes, Joël, Maldivi, Christophe, Boisset, Denis, Boufama, Kenza, and Touzouti, Imène
- Subjects
PHYLOGENY ,CIRCULAR RNA ,PEPTIDES ,RIBOSOMAL proteins ,GENOMES ,ARCHAEBACTERIA ,RNA - Abstract
Background: In present genomes, current relics of a circular RNA appear which could have played a central role as a primitive catalyst of the peptide genesis. Methods: Using a proximity measure to this circular RNA and the distance, a new unsupervised classifier called Maxwell
TM has been constructed based on the Burrows–Wheeler transform algorithm. Results: By applying the classifier to numerous genomes from various realms (Bacteria, Archaea, Vegetables and Animals), we obtain phylogenetic trees that are coherent with biological trees based on pure evolutionary arguments. Discussion: We discuss the role of the combinatorial operators responsible for the evolution of the genome of many species. Conclusions: We opened up possibilities for understanding the mechanisms of a primitive factory of peptides represented by an RNA ring. We showed that this ring was able to transmit some of its sub-sequences in the sequences of genes involved in the mechanisms of the current ribosomal production of proteins. [ABSTRACT FROM AUTHOR]- Published
- 2023
- Full Text
- View/download PDF
10. Scalable Text Index Construction
- Author
-
Bingmann, Timo, Dinklage, Patrick, Fischer, Johannes, Kurpicz, Florian, Ohlebusch, Enno, Sanders, Peter, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Bast, Hannah, editor, Korzen, Claudius, editor, Meyer, Ulrich, editor, and Penschuck, Manuel, editor
- Published
- 2022
- Full Text
- View/download PDF
11. Accessing the Suffix Array via -Forest
- Author
-
Boucher, Christina, Köppl, Dominik, Perera, Herman, Rossi, Massimiliano, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Arroyuelo, Diego, editor, and Poblete, Barbara, editor
- Published
- 2022
- Full Text
- View/download PDF
12. Computing the Parameterized Burrows–Wheeler Transform Online
- Author
-
Hashimoto, Daiki, Hendrian, Diptarama, Köppl, Dominik, Yoshinaka, Ryo, Shinohara, Ayumi, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Arroyuelo, Diego, editor, and Poblete, Barbara, editor
- Published
- 2022
- Full Text
- View/download PDF
13. Logarithmic Equal-Letter Runs for BWT of Purely Morphic Words
- Author
-
Frosini, Andrea, Mancini, Ilaria, Rinaldi, Simone, Romana, Giuseppe, Sciortino, Marinella, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Diekert, Volker, editor, and Volkov, Mikhail, editor
- Published
- 2022
- Full Text
- View/download PDF
14. An efficient and secure compression technique for data protection using burrows-wheeler transform algorithm
- Author
-
M Baritha Begum, N. Deepa, Mueen Uddin, Rajesh Kaluri, Maha Abdelhaq, and Raed Alsaqour
- Subjects
Burrows-wheeler transform ,Run-length encoding (RLE) ,Encryption ,Move-to-front transform ,Keyed scrambling ,Science (General) ,Q1-390 ,Social sciences (General) ,H1-99 - Abstract
Data stored on physical storage devices and transmitted over communication channels often have a lot of redundant information, which can be reduced through compression techniques to conserve space and reduce the time it takes to transmit the data. The need for adequate security measures, such as secret key control in specific techniques, raises concerns about data exposure to potential attacks. Encryption plays a vital role in safeguarding information and maintaining its confidentiality by utilizing a secret key to make the data unreadable and unalterable. The focus of this paper is to tackle the challenge of simultaneously compressing and encrypting data without affecting the efficacy of either process. The authors propose an efficient and secure compression method incorporating a secret key to accomplish this goal. Encoding input data involves scrambling it with a generated key and then transforming it through the Burrows-Wheeler Transform (BWT). Subsequently, the output from the BWT is compressed through both Move-To-Front Transform and Run-Length Encoding. This method blends the cryptographic principles of confusion and diffusion into the compression process, enhancing its performance. The proposed technique is geared towards providing robust encryption and sufficient compression. Experimentation results show that it outperforms other techniques in terms of compression ratio. A security analysis of the technique has determined that it is susceptible to the secret key and plaintext, as measured by the unicity distance. Additionally, the results of the proposed technique showed a significant improvement with a compression ratio close to 90% after passing all the test text files.
- Published
- 2023
- Full Text
- View/download PDF
15. A fast algorithm for constructing suffix arrays for DNA alphabets
- Author
-
Zeinab Rabea, Sara El-Metwally, Samir Elmougy, and Magdi Zakaria
- Subjects
Suffixes ,Suffix arrays ,DNA alphabets ,Longest Common Prefix ,Burrows-Wheeler transform ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
The continuous improvement of sequencing technologies has been paralleled by the development of efficient algorithms and data structures for sequencing data analysis and processing. Suffix array is one of data structures that are used to construct the Burrows-Wheeler transform (BWT) for long length genomes. Building a suffix array itself is an expensive-resource process since the computations are dominant by sorting suffixes in a lexical order. Most of the suffix array construction algorithms consider the general and integer alphabets without utilizing special cases for fixed-size ones such as DNA alphabets. In this paper, we exploit the nature of four-sized DNA alphabets and utilize their predefined lexicographical ordering in order to construct suffix arrays for genomic data correctly and efficiently. The suffix array construction algorithm for DNA alphabets is evaluated using three real data sets with different lengths ranging from small E-coli genome to long length Homo sapiens GRCh38.p13 chromosomes. For long length genomes, their corresponding sequence is divided into parts (i.e. reads) with a minimum overlap length, the suffix array is computed for each part separately, and finally all partially computed arrays are merged together into a single one. We studied the effects of varying the reads/overlap lengths on the running time of the proposed suffix array construction algorithm and conclude that the minimum overlap length should be equal to the average length of the longest common prefix between the adjacent parts.
- Published
- 2022
- Full Text
- View/download PDF
16. Genomic Phylogeny Using the MaxwellTM Classifier Based on Burrows–Wheeler Transform
- Author
-
Jacques Demongeot, Joël Gardes, Christophe Maldivi, Denis Boisset, Kenza Boufama, and Imène Touzouti
- Subjects
ribosome evolution ,unsupervised clustering ,MaxwellTM classifier ,Burrows–Wheeler transform ,primitive peptide factory ,Archetypal Loop ring ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Background: In present genomes, current relics of a circular RNA appear which could have played a central role as a primitive catalyst of the peptide genesis. Methods: Using a proximity measure to this circular RNA and the distance, a new unsupervised classifier called MaxwellTM has been constructed based on the Burrows–Wheeler transform algorithm. Results: By applying the classifier to numerous genomes from various realms (Bacteria, Archaea, Vegetables and Animals), we obtain phylogenetic trees that are coherent with biological trees based on pure evolutionary arguments. Discussion: We discuss the role of the combinatorial operators responsible for the evolution of the genome of many species. Conclusions: We opened up possibilities for understanding the mechanisms of a primitive factory of peptides represented by an RNA ring. We showed that this ring was able to transmit some of its sub-sequences in the sequences of genes involved in the mechanisms of the current ribosomal production of proteins.
- Published
- 2023
- Full Text
- View/download PDF
17. A fast algorithm for constructing suffix arrays for DNA alphabets.
- Author
-
Rabea, Zeinab, El-Metwally, Sara, Elmougy, Samir, and Zakaria, Magdi
- Subjects
SUFFIXES & prefixes (Grammar) ,DNA ,DATA structures ,ALGORITHMS ,SEQUENCE analysis - Abstract
The continuous improvement of sequencing technologies has been paralleled by the development of efficient algorithms and data structures for sequencing data analysis and processing. Suffix array is one of data structures that are used to construct the Burrows-Wheeler transform (BWT) for long length genomes. Building a suffix array itself is an expensive-resource process since the computations are dominant by sorting suffixes in a lexical order. Most of the suffix array construction algorithms consider the general and integer alphabets without utilizing special cases for fixed-size ones such as DNA alphabets. In this paper, we exploit the nature of four-sized DNA alphabets and utilize their predefined lexicographical ordering in order to construct suffix arrays for genomic data correctly and efficiently. The suffix array construction algorithm for DNA alphabets is evaluated using three real data sets with different lengths ranging from small E-coli genome to long length Homo sapiens GRCh38.p13 chromosomes. For long length genomes, their corresponding sequence is divided into parts (i.e. reads) with a minimum overlap length, the suffix array is computed for each part separately, and finally all partially computed arrays are merged together into a single one. We studied the effects of varying the reads/overlap lengths on the running time of the proposed suffix array construction algorithm and conclude that the minimum overlap length should be equal to the average length of the longest common prefix between the adjacent parts. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
18. Wheeler graphs: A framework for BWT-based data structures.
- Author
-
Gagie, Travis, Manzini, Giovanni, and Sirén, Jouni
- Subjects
Burrows–Wheeler transform ,Compressed data structures ,Pattern matching - Abstract
The famous Burrows-Wheeler Transform (BWT) was originally defined for a single string but variations have been developed for sets of strings, labeled trees, de Bruijn graphs, etc. In this paper we propose a framework that includes many of these variations and that we hope will simplify the search for more. We first define Wheeler graphs and show they have a property we call path coherence. We show that if the state diagram of a finite-state automaton is a Wheeler graph then, by its path coherence, we can order the nodes such that, for any string, the nodes reachable from the initial state or states by processing that string are consecutive. This means that even if the automaton is non-deterministic, we can still store it compactly and process strings with it quickly. We then rederive several variations of the BWT by designing straightforward finite-state automata for the relevant problems and showing that their state diagrams are Wheeler graphs.
- Published
- 2017
19. r-indexing the eBWT.
- Author
-
Boucher, Christina, Cenzato, Davide, Lipták, Zsuzsanna, Rossi, Massimiliano, and Sciortino, Marinella
- Subjects
- *
DATA structures , *BACTERIAL genomes - Abstract
The extended Burrows-Wheeler Transform (eBWT) [Mantaci et al. TCS 2007] is a variant of the BWT, introduced for collections of strings. In this paper, we present the extended r-index , an analogous data structure to the r -index [Gagie et al. JACM 2020]. It occupies O (r) words, with r the number of runs of the eBWT, and offers the same functionalities as the r -index. We also show how to efficiently support finding maximal exact matches (MEMs). We implemented the extended r -index and tested it on circular bacterial genomes and plasmids, comparing it to five state-of-the-art compressed text indexes. While our data structure maintains similar time and memory requirements for answering pattern matching queries as the original r -index, it is the only index in the literature that can naturally be used for both circular and linear input collections. This is an extended version of [Boucher et al., r-indexing the eBWT, SPIRE 2021]. • Introducing the extended r -index (r -index for the eBWT). • Full cyclic pattern matching functionality in O (r) space. • Efficient construction of the extended r -index. • Computation of MEMs with the extended r -index. • Experimental comparison with several compressed indexes on real biological data. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
20. On the Complexity of Recognizing Wheeler Graphs.
- Author
-
Gibney, Daniel and Thankachan, Sharma V.
- Subjects
- *
DE Bruijn graph , *POLYNOMIAL time algorithms , *CHARTS, diagrams, etc. , *GRAPH algorithms , *ISOMORPHISM (Mathematics) , *APPROXIMATION algorithms - Abstract
In recent years, several compressed indexes based on variants of the Burrows–Wheeler transform have been introduced. Some of these are used to index structures far more complex than a single string, as was originally done with the FM-index (Ferragina and Manzini in J. ACM 52(4):552–581, https://doi.org/10.1145/1082036.1082039, 2005). As such, there has been an increasing effort to better understand under which conditions such an indexing scheme is possible. This has led to the introduction of Wheeler graphs (Gagie et al. in Theor Comput Sci 698:67–78, https://doi.org/10.1016/j.tcs.2017.06.016, 2017). Gagie et al. showed that de Bruijn graphs, generalized compressed suffix arrays, and several other BWT related structures can be represented as Wheeler graphs, and that Wheeler graphs can be indexed in a space-efficient way. Hence, being able to recognize whether a given graph is a Wheeler graph, or being able to approximate a given graph by a Wheeler graph, could have numerous applications in indexing. Here we resolve the open question of whether there exists an efficient algorithm for recognizing if a given graph is a Wheeler graph. We show: The problem of recognizing whether a given graph G = (V , E) is a Wheeler graph is NP-complete for any edge label alphabet of size σ ≥ 2 , even when G is a DAG. This holds even on a restricted subset of graphs called d-NFAs for d ≥ 5 . This is in contrast to recent results demonstrating the problem can be solved in polynomial time for d-NFAs where d ≤ 2 . We also show that the recognition problem can be solved in linear time for σ = 1 on graphs without self-loops; There exists an 2 e log σ + O (n + e) time exact algorithm where n = | V | and e = | E | . This algorithm relies on graph isomorphism being computable in strictly sub-exponential time; We define an optimization variant of the problem called Wheeler Graph Violation, abbreviated WGV, where the aim is to identify the smallest set of edges that have to be removed from a graph to obtain a Wheeler graph. We show WGV is APX-hard, even when G is a DAG, implying there exists a constant C > 1 for which there is no C-approximation algorithm (unless P = NP). Also, conditioned on the Unique Games Conjecture, for all C > 1 , it is NP-hard to find a C-approximation, implying WGV is not in APX; We define the Wheeler Subgraph problem, abbreviated WS, where the aim is to find the largest subgraph which is a Wheeler Graph (the dual of WGV). In contrast to WGV, we give an O (σ) -approximation algorithm for the WS problem, implying it is in APX for σ = O (1) . The above findings suggest that most problems under this theme are computationally difficult. However, we identify a class of graphs for which the recognition problem is polynomial-time solvable, raising the question of which properties determine this problem's difficulty. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
21. Burrows-Wheeler Transform of Words Defined by Morphisms
- Author
-
Brlek, Srecko, Frosini, Andrea, Mancini, Ilaria, Pergola, Elisa, Rinaldi, Simone, Hutchison, David, Editorial Board Member, Kanade, Takeo, Editorial Board Member, Kittler, Josef, Editorial Board Member, Kleinberg, Jon M., Editorial Board Member, Mattern, Friedemann, Editorial Board Member, Mitchell, John C., Editorial Board Member, Naor, Moni, Editorial Board Member, Pandu Rangan, C., Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Terzopoulos, Demetri, Editorial Board Member, Tygar, Doug, Editorial Board Member, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Colbourn, Charles J., editor, Grossi, Roberto, editor, and Pisanti, Nadia, editor
- Published
- 2019
- Full Text
- View/download PDF
22. Buffering updates enables efficient dynamic de Bruijn graphs
- Author
-
Jarno Alanko, Bahar Alipanahi, Jonathen Settle, Christina Boucher, and Travis Gagie
- Subjects
de Bruijn graph ,Dynamic data structures ,Succinct data structures ,Burrows-Wheeler transform ,Biotechnology ,TP248.13-248.65 - Abstract
Motivation: The de Bruijn graph has become a ubiquitous graph model for biological data ever since its initial introduction in the late 1990s. It has been used for a variety of purposes including genome assembly (Zerbino and Birney, 2008; Bankevich et al., 2012; Peng et al., 2012), variant detection (Alipanahi et al., 2020b; Iqbal et al., 2012), and storage of assembled genomes (Chikhi et al., 2016). For this reason, there have been over a dozen methods for building and representing the de Bruijn graph and its variants in a space and time efficient manner. Results: With the exception of a few data structures (Muggli et al., 2019; Holley and Melsted, 2020; Crawford et al.,2018), compressed and compact de Bruijn graphs do not allow for the graph to be efficiently updated, meaning that data can be added or deleted. The most recent compressed dynamic de Bruijn graph (Alipanahi et al., 2020a), relies on dynamic bit vectors which are slow in theory and practice. To address this shortcoming, we present a compressed dynamic de Bruijn graph that removes the necessity of dynamic bit vectors by buffering data that should be added or removed from the graph. We implement our method, which we refer to as BufBOSS, and compare its performance to Bifrost, DynamicBOSS, and FDBG. Our experiments demonstrate that BufBOSS achieves attractive trade-offs compared to other tools in terms of time, memory and disk, and has the best deletion performance by an order of magnitude.
- Published
- 2021
- Full Text
- View/download PDF
23. An effective image compression technique based on burrows wheeler transform with set partitioning in hierarchical trees.
- Author
-
Arunpandian, Sekar and Dhenakaran, Subbaiah S.
- Subjects
IMAGE compression ,TREES ,EROSION - Abstract
In this article, the efficient image compression which consists of Burrows–Wheeler transform (BWT) with set partitioning in hierarchical trees (SPIHT). The main phases of the proposed system are: partitioning, compression of non‐ROI areas, Fusion and compression of ROI areas. To enhance the propose of the proposed methodology, the morphological functions are updated by dividing two types of images with the consideration of dilation and erosion control. After that, the convolution and correlation in the deformation provide good accuracy of the segmentation at the fastest speed. In this proposed methodology, SPIHT encryption is a lossy compression technique aimed at understanding the non‐ROI area, while the BWT is a lossless compression technique performed to achieve a summary of the ROI area. Finally, separating these two parts of the image merges the image and reconstructs it to the desired quality. The test sends a variety of images and analyzes the performance of the proposed system using compression ratio and PSNR measurements. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
24. Clustering and presorting for parallel burrows wheeler-based compression.
- Author
-
Voronin, Sergey, Borovikov, Eugene, and Hasan, Raqibul
- Subjects
LOSSLESS data compression ,DATA compression ,BIG data ,PERMUTATIONS - Abstract
We describe practical improvements for parallel BWT-based lossless compressors frequently utilized in modern day big data applications. We propose a clustering-based data permutation approach for improving compression ratio for data with significant alphabet variation along with a faster string sorting approach based on the application of the O (n) complexity counting sort with permutation reindexing. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
25. gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections
- Author
-
Felipe A. Louza, Guilherme P. Telles, Simon Gog, Nicola Prezza, and Giovanna Rosone
- Subjects
Suffix array ,LCP array ,Burrows–Wheeler transform ,Document array ,String collections ,Biology (General) ,QH301-705.5 ,Genetics ,QH426-470 - Abstract
Abstract Background The construction of a suffix array for a collection of strings is a fundamental task in Bioinformatics and in many other applications that process strings. Related data structures, as the Longest Common Prefix array, the Burrows–Wheeler transform, and the document array, are often needed to accompany the suffix array to efficiently solve a wide variety of problems. While several algorithms have been proposed to construct the suffix array for a single string, less emphasis has been put on algorithms to construct suffix arrays for string collections. Result In this paper we introduce gsufsort, an open source software for constructing the suffix array and related data indexing structures for a string collection with N symbols in O(N) time. Our tool is written in ANSI/C and is based on the algorithm gSACA-K (Louza et al. in Theor Comput Sci 678:22–39, 2017), the fastest algorithm to construct suffix arrays for string collections. The tool supports large fasta, fastq and text files with multiple strings as input. Experiments have shown very good performance on different types of strings. Conclusions gsufsort is a fast, portable, and lightweight tool for constructing the suffix array and additional data structures for string collections.
- Published
- 2020
- Full Text
- View/download PDF
26. Dynamic construction of pan-genome subgraphs
- Author
-
Dede Kadir and Ohlebusch Enno
- Subjects
compressed de bruijn graph ,burrows-wheeler transform ,backward search ,pan-genome analysis ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Marcus et al. (Bioinformatics 2014) proposed to use a compressed de Bruijn graph as a description of a pan-genome, comprising the genomes of many individuals/strains of the same or closely related species. Subsequent work improved the construction of the compressed de Bruijn graph in terms of run-time and memory consumption. According to the Computational Pan-Genomics Consortium (Briefings in Bioinformatics 2016), a pan-genome data structure should support the following functionality: “All information within a data structure should be easily accessible for human eyes by visualization support on different scales.” However, a pan-genome graph can have thousands to millions of nodes and such an amount of information is certainly not easily accessible for human eyes. Thus, the possibility to construct pangenome subgraphs on demand would be quite valuable. In this article, we use the space-efficient representation of the compressed de Bruijn graph devised by Beller and Ohle-busch (Algorithms for Molecular Biology 2016) to construct pan-genome subgraphs on the fly. The user can specify a region in one of the genomes and the software tool will build a subgraph that contains the path corresponding to that region and all paths that are in the neighborhood of that path. The size of the neighborhood can be controlled by the user.
- Published
- 2020
- Full Text
- View/download PDF
27. Computing Burrows-Wheeler Similarity Distributions for String Collections
- Author
-
Louza, Felipe A., Telles, Guilherme P., Gog, Simon, Zhao, Liang, Hutchison, David, Series Editor, Kanade, Takeo, Series Editor, Kittler, Josef, Series Editor, Kleinberg, Jon M., Series Editor, Mattern, Friedemann, Series Editor, Mitchell, John C., Series Editor, Naor, Moni, Series Editor, Pandu Rangan, C., Series Editor, Steffen, Bernhard, Series Editor, Terzopoulos, Demetri, Series Editor, Tygar, Doug, Series Editor, Weikum, Gerhard, Series Editor, Gagie, Travis, editor, Moffat, Alistair, editor, Navarro, Gonzalo, editor, and Cuadros-Vargas, Ernesto, editor
- Published
- 2018
- Full Text
- View/download PDF
28. The Colored Longest Common Prefix Array Computed via Sequential Scans
- Author
-
Garofalo, Fabio, Rosone, Giovanna, Sciortino, Marinella, Verzotto, Davide, Hutchison, David, Series Editor, Kanade, Takeo, Series Editor, Kittler, Josef, Series Editor, Kleinberg, Jon M., Series Editor, Mattern, Friedemann, Series Editor, Mitchell, John C., Series Editor, Naor, Moni, Series Editor, Pandu Rangan, C., Series Editor, Steffen, Bernhard, Series Editor, Terzopoulos, Demetri, Series Editor, Tygar, Doug, Series Editor, Weikum, Gerhard, Series Editor, Gagie, Travis, editor, Moffat, Alistair, editor, Navarro, Gonzalo, editor, and Cuadros-Vargas, Ernesto, editor
- Published
- 2018
- Full Text
- View/download PDF
29. Computation of the suffix array, Burrows-Wheeler transform and FM-index in V-order.
- Author
-
Daykin, Jacqueline W., Mhaskar, Neerja, and Smyth, W.F.
- Subjects
- *
SUFFIXES & prefixes (Grammar) , *LINEAR orderings , *FACTORIZATION - Abstract
V -order is a total order on strings that determines an instance of Unique Maximal Factorization Families (UMFFs), a generalization of Lyndon words. The fundamental V -comparison of strings can be done in linear time and constant space. V -order has been proposed as an alternative to lexicographic order (lexorder) in the computation of suffix arrays and in the suffix-sorting induced by the Burrows-Wheeler transform (BWT). In line with the recent interest in the connection between suffix arrays and Lyndon factorization, in this paper we obtain similar results for the V -order factorization. Indeed, we show that the results describing the connection between suffix arrays and Lyndon factorization are matched by analogous V -order processing. We also describe a methodology for efficiently computing the FM-Index in V -order, as well as V -order substring pattern matching using backward search. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
30. Constructing and indexing the bijective and extended Burrows–Wheeler transform.
- Author
-
Bannai, Hideo, Kärkkäinen, Juha, Köppl, Dominik, and Pia̧tkowski, Marcin
- Subjects
- *
DATA compression , *INDEXING - Abstract
The Burrows–Wheeler transform (BWT) is a permutation whose applications are prevalent in data compression and text indexing. The bijective BWT is a bijective variant of it that has not yet been studied for text indexing applications. We fill this gap by proposing a self-index built on the bijective BWT. The self-index applies the backward search technique of the FM-index to find a pattern P with O (| P | lg | P |) backward search steps. Additionally, we propose the first linear-time construction algorithm that is based on SAIS, improving the best known result of O (n lg n / lg lg n) time to linear. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
31. Prefix-free parsing for building big BWTs
- Author
-
Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, and Taher Mun
- Subjects
Burrows-Wheeler Transform ,Prefix-free parsing ,Compression-aware algorithms ,Genomic databases ,Biology (General) ,QH301-705.5 ,Genetics ,QH426-470 - Abstract
Abstract High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive—a characteristic that can be exploited to ease the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass generates a dictionary D and a parse P of T with the property that the BWT of T can be constructed from D and P using workspace proportional to their total size and O(|T|)-time. Our experiments show that D and P are significantly smaller than T in practice, and thus, can fit in a reasonable internal memory even when T is very large. In particular, we show that with prefix-free parsing we can build an 131-MB run-length compressed FM-index (restricted to support only counting and not locating) for 1000 copies of human chromosome 19 in 2 h using 21 GB of memory, suggesting that we can build a 6.73 GB index for 1000 complete human-genome haplotypes in approximately 102 h using about 1 TB of memory.
- Published
- 2019
- Full Text
- View/download PDF
32. External memory BWT and LCP computation for sequence collections with applications
- Author
-
Lavinia Egidi, Felipe A. Louza, Giovanni Manzini, and Guilherme P. Telles
- Subjects
Burrows–Wheeler Transform ,Longest common prefix array ,Maximal repeats ,All pairs suffix–prefix overlaps ,Succinct de Bruijn graph ,External memory algorithms ,Biology (General) ,QH301-705.5 ,Genetics ,QH426-470 - Abstract
Abstract Background Sequencing technologies produce larger and larger collections of biosequences that have to be stored in compressed indices supporting fast search operations. Many compressed indices are based on the Burrows–Wheeler Transform (BWT) and the longest common prefix (LCP) array. Because of the sheer size of the input it is important to build these data structures in external memory and time using in the best possible way the available RAM. Results We propose a space-efficient algorithm to compute the BWT and LCP array for a collection of sequences in the external or semi-external memory setting. Our algorithm splits the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external or semi-external memory and in the process it also computes the LCP values. Our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP array, provide simple, scan-based, external memory algorithms for three well known problems in bioinformatics: the computation of maximal repeats, the all pairs suffix–prefix overlaps, and the construction of succinct de Bruijn graphs. Conclusions We prove that our algorithm performs $${\mathcal {O}}(n\, \mathsf {maxlcp})$$ O(nmaxlcp) sequential I/Os, where n is the total length of the collection and $$\mathsf {maxlcp}$$ maxlcp is the maximum LCP value. The experimental results show that our algorithm is only slightly slower than the state of the art for short sequences but it is up to 40 times faster for longer sequences or when the available RAM is at least equal to the size of the input.
- Published
- 2019
- Full Text
- View/download PDF
33. Greedy Shortest Common Superstring Approximation in Compact Space
- Author
-
Alanko, Jarno, Norri, Tuukka, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Fici, Gabriele, editor, Sciortino, Marinella, editor, and Venturini, Rossano, editor
- Published
- 2017
- Full Text
- View/download PDF
34. Burrows-Wheeler Transform and Run-Length Enconding
- Author
-
Mantaci, Sabrina, Restivo, Antonio, Rosone, Giovanna, Sciortino, Marinella, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Brlek, Srečko, editor, Dolce, Francesco, editor, Reutenauer, Christophe, editor, and Vandomme, Élise, editor
- Published
- 2017
- Full Text
- View/download PDF
35. Computing the multi-string BWT and LCP array in external memory.
- Author
-
Bonizzoni, Paola, Della Vedova, Gianluca, Pirola, Yuri, Previtali, Marco, and Rizzi, Raffaella
- Subjects
- *
MEMORY , *SUFFIXES & prefixes (Grammar) , *GENERALIZATION , *ALGORITHMS , *GENOMES - Abstract
Indexing very large collections of strings, such as those produced by the widespread next generation sequencing technologies, heavily relies on multi-string generalization of the Burrows–Wheeler Transform (BWT): large requirements of in-memory approaches have stimulated recent developments on external memory algorithms. The related problem of computing the Longest Common Prefix (LCP) array of a set of strings is instrumental to compute the suffix-prefix overlaps among strings, which is an essential step for many genome assembly algorithms. In a previous paper, we presented an in-memory divide-and-conquer method for building the BWT and LCP where we merge partial BWTs with a forward approach to sort suffixes. In this paper, we propose an alternative backward strategy to develop an external memory method to simultaneously build the BWT and the LCP array on a collection of m strings of different lengths. The algorithm over a set of strings having constant length k has O (m k l) time and I/O volume, using O (k + m) main memory, where l is the maximum value in the LCP array. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
36. Weighted Burrows–Wheeler Compression
- Author
-
Fruchtman, Aharon, Gross, Yoav, Klein, Shmuel T., and Shapira, Dana
- Published
- 2023
- Full Text
- View/download PDF
37. Space-efficient construction of compressed suffix trees.
- Author
-
Prezza, Nicola and Rosone, Giovanna
- Subjects
- *
SUFFIXES & prefixes (Grammar) , *TREES , *DATA structures , *TIME management - Abstract
We show how to build several data structures of central importance to string processing by taking as input the Burrows-Wheeler transform (BWT) and using small extra working space. Let n be the text length and σ be the alphabet size. We first provide two algorithms that enumerate all LCP values and suffix tree intervals in O (n log σ) time using just o (n log σ) bits of working space on top of the input re-writable BWT. Using these algorithms as building blocks, for any parameter 0 < ϵ ≤ 1 we show how to build the PLCP bitvector and the balanced parentheses representation of the suffix tree topology in O (n (log σ + ϵ − 1 ⋅ log log n)) time using at most n log σ ⋅ (ϵ + o (1)) bits of working space on top of the input re-writable BWT and the output. For example, we can build a compressed suffix tree from the BWT using just succinct working space (i.e. o (n log σ) bits) and Θ (n log σ + n (log log n) 1 + δ) time, for any constant δ > 0. This improves the previous most space-efficient algorithms, which worked in O (n) bits and O (n log n) time. We also consider the problem of merging BWTs of string collections, and provide a solution running in O (n log σ) time and using just o (n log σ) bits of working space. An efficient implementation of our LCP construction and BWT merge algorithms uses (in RAM) as few as n bits on top of a packed representation of the input/output and process data as fast as 2.92 megabases per second. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
38. A combinatorial view on string attractors.
- Author
-
Mantaci, Sabrina, Restivo, Antonio, Romana, Giuseppe, Rosone, Giovanna, and Sciortino, Marinella
- Subjects
- *
CONJUGACY classes - Abstract
The notion of string attractor has recently been introduced in [Prezza, 2017] and studied in [Kempa and Prezza, 2018] to provide a unifying framework for known dictionary-based compressors. A string attractor for a word w = w 1 w 2 ⋯ w n is a subset Γ of the positions { 1 , ... , n } , such that all distinct factors of w have an occurrence crossing at least one of the elements of Γ. In this paper we explore the notion of string attractor by focusing on its combinatorial properties. In particular, we show how the size of the smallest string attractor of a word varies when combinatorial operations are applied and we deduce that such a measure is not monotone. Moreover, we introduce a circular variant of the notion of string attractor to provide a characterization of the conjugacy classes of standard Sturmian words. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
39. A survey of BWT variants for string collections.
- Author
-
Cenzato D and Lipták Z
- Abstract
Motivation: In recent years, the focus of bioinformatics research has moved from individual sequences to collections of sequences. Given the fundamental role of the Burrows-Wheeler Transform (BWT) in string processing, a number of dedicated tools have been developed for computing the BWT of string collections. While the focus has been on improving efficiency, both in space and time, the exact definition of the BWT employed has not been at the center of attention. As we show in this paper, the different tools in use often compute non-equivalent BWT variants: the resulting transforms can differ from each other significantly, including the number r of runs, a central parameter of the BWT. Moreover, with many tools, the transform depends on the input order of the collection. In other words, on the same dataset, the same tool may output different transforms if the dataset is given in a different order., Results: We studied 18 dedicated tools for computing the BWT of string collections and were able to identify 6 different BWT variants computed by these tools. We review the differences between these BWT variants, both from a theoretical and from a practical point of view, comparing them on 8 real-life biological datasets with different characteristics. We find that the differences can be extensive, depending on the datasets, and are largest on collections of many similar short sequences. The parameter r, the number of runs of the BWT, also shows notable variation between the different BWT variants; on our datasets, it varied by a multiplicative factor of up to 4.2., Availability: Source code and scripts to replicate the results and download the data used in the article are available at https://github.com/davidecenzato/BWT-variants-for-string-collections., Supplementary Information: Supplementary data are available at Bioinformatics online., (© The Author(s) 2024. Published by Oxford University Press.)
- Published
- 2024
- Full Text
- View/download PDF
40. Large-Alphabet Semi-Static Entropy Coding Via Asymmetric Numeral Systems.
- Author
-
MOFFAT, ALISTAIR and PETRI, MATTHIAS
- Subjects
- *
NUMBER systems , *ENTROPY (Information theory) , *HUFFMAN codes , *VIDEO coding , *CODING theory , *TOPOLOGICAL entropy - Abstract
An entropy coder takes as input a sequence of symbol identifiers over some specified alphabet and represents that sequence as a bitstring using as few bits as possible, typically assuming that the elements of the sequence are independent of each other. Previous entropy coding methods include the well-known Huffman and arithmetic approaches. Here we examine the newer asymmetric numeral systems (ANS) technique for entropy coding and develop mechanisms that allow it to be efficiently used when the size of the source alphabet is large--thousands or millions of symbols. In particular, we examine different ways in which probability distributions over large alphabets can be approximated and in doing so infer techniques that allow the ANS mechanism to be extended to support large-alphabet entropy coding. As well as providing a full description of ANS, we also present detailed experiments using several different types of input, including data streams arising as typical output from the modeling stages of text compression software, and compare the proposed ANS variants with Huffman and arithmetic coding baselines, measuring both compression effectiveness and also encoding and decoding throughput. We demonstrate that in applications in which semi-static compression is appropriate, ANS-based coders can provide an excellent balance between compression effectiveness and speed, even when the alphabet is large. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
41. Burrows‐Wheeler post‐transformation with effective clustering and interpolative coding.
- Author
-
Niemi, Arto and Teuhola, Jukka
- Subjects
RUN-length encoding ,LOSSLESS data compression ,HUFFMAN codes ,CIPHERS ,INTEGERS - Abstract
Summary: Lossless compression methods based on the Burrows‐Wheeler transform (BWT) are regarded as an excellent compromise between speed and compression efficiency: they provide compression rates close to the PPM algorithms, with the speed of dictionary‐based methods. Instead of the laborious statistics‐gathering process used in PPM, the BWT reversibly sorts the input symbols, using as the sort key as many following characters as necessary to make the sort unique. Characters occurring in similar contexts are sorted close together, resulting in a clustered symbol sequence. Run‐length encoding and Move‐to‐Front (MTF) recoding, combined with a statistical Huffman or arithmetic coder, is then typically used to exploit the clustering. A drawback of the MTF recoding is that knowledge of the character that produced the MTF number is lost. In this paper, we present a new, competitive Burrows‐Wheeler posttransform stage that takes advantage of interpolative coding—a fast binary encoding method for integer sequences, being able to exploit clusters without requiring explicit statistics. We introduce a fast and simple way to retain knowledge of the run characters during the MTF recoding and use this to improve the clustering of MTF numbers and run‐lengths by applying reversible, stable sorting, with the run characters as sort keys, achieving significant improvement in the compression rate, as shown here by experiments on common text corpora. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
42. A hybrid encoding method for reducing code stream in EICA-optimized SMVQ reversible data hiding.
- Author
-
Prabha, K. R. and Jagadeeswari, M.
- Subjects
- *
IMPERIALIST competitive algorithm , *VECTOR quantization , *RIVERS , *DATA compression , *CIPHERS , *CODING theory - Abstract
An enhanced imperialist competitive algorithm (EICA)-optimized side-match vector quantization (SMVQ)-based reversible data hiding (VQ-SMVQ with EICA) was developed for embedding higher number of secret bits. This technique improved the embedding capacity rate of compression and quality of reconstruction. However, for improving security and minimizing the length of the output code stream left-upper coding (LUC) was along with SMVQ, since LUC algorithm is more appropriate for compression of SMVQ index table. In this paper, the code stream size is reduced by proposing an efficient encoding approach. Initially an improved locally adaptive coding (ILA) scheme is used for effective distribution of the position of SMVQ indices based on adaptive threshold value. A locally adaptive data compression scheme (LAS) is utilized with IAS to improve the compression rate of secret data simultaneously with embedding capacity. The combined technique is achieved highly reduced length of output code stream. Thus, EICA-optimized SMVQ reversible data hiding with hybridized ILA and LAS outperforms than existing schemes. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
43. Parallel computation of the Burrows Wheeler Transform in compact space.
- Author
-
Fuentes-Sepúlveda, José, Navarro, Gonzalo, and Nekrich, Yakov
- Subjects
- *
PARALLEL algorithms , *DATA structures , *ALGORITHMS - Abstract
The Burrows-Wheeler Transform (BWT) has become since its introduction a key tool for representing large text collections in compressed space while supporting indexed searching: on a text of length n over an alphabet of size σ , it requires O (n lg σ) bits of space, instead of the O (n lg n) bits required by classical indexes. A challenge for its adoption is to build it within O (n lg σ) bits as well. There are some sequential algorithms building it within this space, but no such a parallel algorithm. In this paper we present a PRAM CREW algorithm to build the BWT using O (n lg n) work, O (lg 3 n / lg σ) depth, and O (n lg σ) bits. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
44. Efficient Construction of a Complete Index for Pan-Genomics Read Alignment.
- Author
-
Kuhnle, Alan, Mun, Taher, Boucher, Christina, Gagie, Travis, Langmead, Ben, and Manzini, Giovanni
- Subjects
- *
DATA structures , *CONSTRUCTION , *SEQUENCE alignment , *GENOMES - Abstract
Short-read aligners predominantly use the FM-index, which is easily able to index one or a few human genomes. However, it does not scale well to indexing collections of thousands of genomes. Driving this issue are the two chief components of the index: (1) a rank data structure over the Burrows–Wheeler Transform (BWT) of the string that will allow us to find the interval in the string's suffix array (SA), and (2) a sample of the SA that—when used with the rank data structure—allows us to access the SA. The rank data structure can be kept small even for large genomic databases, by run-length compressing the BWT, but until recently there was no means known to keep the SA sample small without greatly slowing down access to the SA. Now that (SODA 2018) has defined an SA sample that takes about the same space as the run-length compressed BWT, we have the design for efficient FM-indexes of genomic databases but are faced with the problem of building them. In 2018, we showed how to build the BWT of large genomic databases efficiently (WABI 2018), but the problem of building the sample efficiently was left open. We compare our approach to state-of-the-art methods for constructing the SA sample, and demonstrate that it is the fastest and most space-efficient method on highly repetitive genomic databases. Lastly, we apply our method for indexing partial and whole human genomes and show that it improves over the FM-index-based Bowtie method with respect to both memory and time and over the hybrid index-based CHIC method with respect to query time and memory required for indexing. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
45. Matching Reads to Many Genomes with the r-Index.
- Author
-
Mun, Taher, Kuhnle, Alan, Boucher, Christina, Gagie, Travis, Langmead, Ben, and Manzini, Giovanni
- Subjects
- *
GENOMES , *FILES (Records) , *PATTERN matching - Abstract
The r-index is a tool for compressed indexing of genomic databases for exact pattern matching, which can be used to completely align reads that perfectly match some part of a genome in the database or to find seeds for reads that do not. This article shows how to download and install the programs ri-buildfasta and ri-align; how to call ri-buildfasta on an FASTA file to build an r-index for that file; and how to query that index with ri-align. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
46. Linear-time String Indexing and Analysis in Small Space.
- Author
-
BELAZZOUGUI, DJAMAL, CUNIAL, FABIO, KÄRKKÄINEN, JUHA, and MÄKINEN, VELI
- Subjects
TIME management ,GENETIC techniques ,SPACETIME ,DATA structures ,NUCLEOTIDE sequence - Abstract
The field of succinct data structures has flourished over the past 16 years. Starting from the compressed suffix array by Grossi and Vitter (STOC 2000) and the FM-index by Ferragina and Manzini (FOCS 2000), a number of generalizations and applications of string indexes based on the Burrows-Wheeler transform (BWT) have been developed, all taking an amount of space that is close to the input size in bits. In many largescale applications, the construction of the index and its usage need to be considered as one unit of computation. For example, one can compare two genomes by building a common index for their concatenation and by detecting common substructures by querying the index. Efficient string indexing and analysis in small space lies also at the core of a number of primitives in the data-intensive field of high-throughput DNA sequencing. We report the following advances in string indexing and analysis: We show that the BWT of a string T â {1, . . ., σ }n can be built in deterministic O(n) time using just O(n log σ) bits of space, where σ ⤠n. Deterministic linear time is achieved by exploiting a new partial rank data structure that supports queries in constant time and that might have independent interest. Within the same time and space budget, we can build an index based on the BWT that allows one to enumerate all the internal nodes of the suffix tree of T. Many fundamental string analysis problems, such as maximal repeats, maximal unique matches, and string kernels, can be mapped to such enumeration and can thus be solved in deterministic O(n) time and in O(n log σ) bits of space fromthe input string by tailoring the enumeration algorithm to some problem-specific computations. We also show how to build many of the existing indexes based on the BWT, such as the compressed suffix array, the compressed suffix tree, and the bidirectional BWT index, in randomized O(n) time and in O(n log σ) bits of space. The previously fastest construction algorithms for BWT, compressed suffix array and compressed suffix tree, which used O(n log σ) bits of space, took O(n log log σ) time for the first two structures and O(n logϵ n) time for the third, where ϵ is any positive constant smaller than one. Alternatively, the BWT could be previously built in linear time if onewas willing to spendO(n log σ log logσ n) bits of space. Contrary to the state-of-the-art, our bidirectional BWT index supports every operation in constant time per element in its output. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
47. Dynamic construction of pan-genome subgraphs.
- Author
-
Dede, Kadir and Ohlebusch, Enno
- Published
- 2020
- Full Text
- View/download PDF
48. Generalized enhanced suffix array construction in external memory
- Author
-
Felipe A. Louza, Guilherme P. Telles, Steve Hoffmann, and Cristina D. A. Ciferri
- Subjects
Suffix array ,LCP array ,Burrows–Wheeler transform ,External memory algorithms ,String collections ,Biology (General) ,QH301-705.5 ,Genetics ,QH426-470 - Abstract
Abstract Background Suffix arrays, augmented by additional data structures, allow solving efficiently many string processing problems. The external memory construction of the generalized suffix array for a string collection is a fundamental task when the size of the input collection or the data structure exceeds the available internal memory. Results In this article we present and analyze $$\mathsf {eGSA}$$ eGSA [introduced in CPM (External memory generalized suffix and $$\mathsf {LCP}$$ LCP arrays construction. In: Proceedings of CPM. pp 201–10, 2013)], the first external memory algorithm to construct generalized suffix arrays augmented with the longest common prefix array for a string collection. Our algorithm relies on a combination of buffers, induced sorting and a heap to avoid direct string comparisons. We performed experiments that covered different aspects of our algorithm, including running time, efficiency, external memory access, internal phases and the influence of different optimization strategies. On real datasets of size up to 24 GB and using 2 GB of internal memory, $$\mathsf {eGSA}$$ eGSA showed a competitive performance when compared to $$\mathsf {eSAIS}$$ eSAIS and $$\mathsf {SAscan}$$ SAscan , which are efficient algorithms for a single string according to the related literature. We also show the effect of disk caching managed by the operating system on our algorithm. Conclusions The proposed algorithm was validated through performance tests using real datasets from different domains, in various combinations, and showed a competitive performance. Our algorithm can also construct the generalized Burrows-Wheeler transform of a string collection with no additional cost except by the output time.
- Published
- 2017
- Full Text
- View/download PDF
49. Compressed Suffix Array
- Author
-
Belazzougui, Djamal, Mäkinen, Veli, Valenzuela, Daniel, and Kao, Ming-Yang, editor
- Published
- 2016
- Full Text
- View/download PDF
50. Burrows-Wheeler Transform
- Author
-
Ferragina, Paolo, Manzini, Giovanni, and Kao, Ming-Yang, editor
- Published
- 2016
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.