191 results on '"Puglisi, Simon J."'
Search Results
2. Computing Longest Lyndon Subsequences and Longest Common Lyndon Subsequences
- Author
-
Bannai, Hideo, I., Tomohiro, Kociumaka, Tomasz, Köppl, Dominik, and Puglisi, Simon J.
- Published
- 2024
- Full Text
- View/download PDF
3. String inference from longest-common-prefix array
- Author
-
Kärkkäinen, Juha, Piątkowski, Marcin, and Puglisi, Simon J.
- Published
- 2023
- Full Text
- View/download PDF
4. Fast and Simple Compact Hashing via Bucketing
- Author
-
Köppl, Dominik, Puglisi, Simon J., and Raman, Rajeev
- Published
- 2022
- Full Text
- View/download PDF
5. Block trees
- Author
-
Belazzougui, Djamal, Cáceres, Manuel, Gagie, Travis, Gawrychowski, Paweł, Kärkkäinen, Juha, Navarro, Gonzalo, Ordóñez, Alberto, Puglisi, Simon J., and Tabei, Yasuo
- Published
- 2021
- Full Text
- View/download PDF
6. Tight upper and lower bounds on suffix tree breadth
- Author
-
Badkobeh, Golnaz, Gawrychowski, Paweł, Kärkkäinen, Juha, Puglisi, Simon J., and Zhukova, Bella
- Published
- 2021
- Full Text
- View/download PDF
7. Lempel–Ziv-Like Parsing in Small Space
- Author
-
Kosolobov, Dmitry, Valenzuela, Daniel, Navarro, Gonzalo, and Puglisi, Simon J.
- Published
- 2020
- Full Text
- View/download PDF
8. Fixed Block Compression Boosting in FM-Indexes: Theory and Practice
- Author
-
Gog, Simon, Kärkkäinen, Juha, Kempa, Dominik, Petri, Matthias, and Puglisi, Simon J.
- Published
- 2019
- Full Text
- View/download PDF
9. Closed factorization
- Author
-
Badkobeh, Golnaz, Bannai, Hideo, Goto, Keisuke, I, Tomohiro, Iliopoulos, Costas S., Inenaga, Shunsuke, Puglisi, Simon J., and Sugimoto, Shiho
- Published
- 2016
- Full Text
- View/download PDF
10. Longest Common Prefix Arrays for Succinct k-Spectra
- Author
-
Alanko, Jarno N., Biagi, Elena, and Puglisi, Simon J.
- Subjects
FOS: Computer and information sciences ,Computer Science - Data Structures and Algorithms ,Data Structures and Algorithms (cs.DS) - Abstract
The k-spectrum of a string is the set of all distinct substrings of length k occurring in the string. K-spectra have many applications in bioinformatics including pseudoalignment and genome assembly. The Spectral Burrows-Wheeler Transform (SBWT) has been recently introduced as an algorithmic tool to efficiently represent and query these objects. The longest common prefix (LCP) array for a k-spectrum is an array of length n that stores the length of the longest common prefix of adjacent k-mers as they occur in lexicographical order. The LCP array has at least two important applications, namely to accelerate pseudoalignment algorithms using the SBWT and to allow simulation of variable-order de Bruijn graphs within the SBWT framework. In this paper we explore algorithms to compute the LCP array efficiently from the SBWT representation of the k-spectrum. Starting with a straightforward O(nk) time algorithm, we describe algorithms that are efficient in both theory and practice. We show that the LCP array can be computed in optimal O(n) time, where n is the length of the SBWT of the spectrum. In practical genomics scenarios, we show that this theoretically optimal algorithm is indeed practical, but is often outperformed on smaller values of k by an asymptotically suboptimal algorithm that interacts better with the CPU cache. Our algorithms share some features with both classical Burrows-Wheeler inversion algorithms and LCP array construction algorithms for suffix arrays.
- Published
- 2023
11. Kohdista: an efficient method to index and query possible Rmap alignments
- Author
-
Muggli, Martin D., Puglisi, Simon J., and Boucher, Christina
- Published
- 2019
- Full Text
- View/download PDF
12. Approximate pattern matching in LZ77-compressed texts
- Author
-
Gagie, Travis, Gawrychowski, Paweł, and Puglisi, Simon J.
- Published
- 2015
- Full Text
- View/download PDF
13. Block Graphs in Practice
- Author
-
Gagie, Travis, Hoobin, Christopher, and Puglisi, Simon J.
- Published
- 2017
- Full Text
- View/download PDF
14. Document retrieval on repetitive string collections
- Author
-
Gagie, Travis, Hartikainen, Aleksi, Karhu, Kalle, Kärkkäinen, Juha, Navarro, Gonzalo, Puglisi, Simon J., and Sirén, Jouni
- Published
- 2017
- Full Text
- View/download PDF
15. Simple Runs-Bounded FM-Index Designs Are Fast
- Author
-
Díaz-Domínguez, Diego, Dönges, Saska, Puglisi, Simon J., and Salmela, Leena
- Subjects
data structures ,Theory of computation → Design and analysis of algorithms ,efficient algorithms - Abstract
Given a string X of length n on alphabet σ, the FM-index data structure allows counting all occurrences of a pattern P of length m in O(m) time via an algorithm called backward search. An important difficulty when searching with an FM-index is to support queries on L, the Burrows-Wheeler transform of X, while L is in compressed form. This problem has been the subject of intense research for 25 years now. Run-length encoding of L is an effective way to reduce index size, in particular when the data being indexed is highly-repetitive, which is the case in many types of modern data, including those arising from versioned document collections and in pangenomics. This paper takes a back-to-basics look at supporting backward search in FM-indexes, exploring and engineering two simple designs. The first divides the BWT string into blocks containing b symbols each and then run-length compresses each block separately, possibly introducing new runs (compared to applying run-length encoding once, to the whole string). Each block stores counts of each symbol that occurs before the block. This method supports the operation rank_c(L, i) (i.e., count the number of times c occurs in the prefix L[1..i]) by first determining the block i/b in which i falls and scanning the block to the appropriate position counting occurrences of c along the way. This partial answer to rank_c(L, i) is then added to the stored count of c symbols before the block to determine the final answer. Our second design has a similar structure, but instead divides the run-length-encoded version of L into blocks containing an equal number of runs. The trick then is to determine the block in which a query falls, which is achieved via a predecessor query over the block starting positions. We show via extensive experiments on a wide range of repetitive text collections that these FM-indexes are not only easy to implement, but also fast and space efficient in practice., LIPIcs, Vol. 265, 21st International Symposium on Experimental Algorithms (SEA 2023), pages 7:1-7:16
- Published
- 2023
- Full Text
- View/download PDF
16. Subset Wavelet Trees
- Author
-
Alanko, Jarno N., Biagi, Elena, Puglisi, Simon J., and Vuohtoniemi, Jaakko
- Subjects
degenerate strings ,compressed data structures ,data structures ,string processing ,succinct data structures ,Theory of computation → Design and analysis of algorithms ,efficient algorithms ,Theory of computation → Data structures design and analysis - Abstract
Given an alphabet Σ of σ = |Σ| symbols, a degenerate (or indeterminate) string X is a sequence X = X[0],X[1]…, X[n-1] of n subsets of Σ. Since their introduction in the mid 70s, degenerate strings have been widely studied, with applications driven by their being a natural model for sequences in which there is a degree of uncertainty about the precise symbol at a given position, such as those arising in genomics and proteomics. In this paper we introduce a new data structural tool for degenerate strings, called the subset wavelet tree (SubsetWT). A SubsetWT supports two basic operations on degenerate strings: subset-rank(i,c), which returns the number of subsets up to the i-th subset in the degenerate string that contain the symbol c; and subset-select(i,c), which returns the index in the degenerate string of the i-th subset that contains symbol c. These queries are analogs of rank and select queries that have been widely studied for ordinary strings. Via experiments in a real genomics application in which degenerate strings are fundamental, we show that subset wavelet trees are practical data structures, and in particular offer an attractive space-time tradeoff. Along the way we investigate data structures for supporting (normal) rank queries on base-4 and base-3 sequences, which may be of independent interest. Our C++ implementations of the data structures are available at https://github.com/jnalanko/SubsetWT., LIPIcs, Vol. 265, 21st International Symposium on Experimental Algorithms (SEA 2023), pages 4:1-4:14
- Published
- 2023
- Full Text
- View/download PDF
17. Order-preserving matching
- Author
-
Kim, Jinil, Eades, Peter, Fleischer, Rudolf, Hong, Seok-Hee, Iliopoulos, Costas S., Park, Kunsoo, Puglisi, Simon J., and Tokuyama, Takeshi
- Published
- 2014
- Full Text
- View/download PDF
18. Enhanced string covering
- Author
-
Flouri, Tomáš, Iliopoulos, Costas S., Kociumaka, Tomasz, Pissis, Solon P., Puglisi, Simon J., Smyth, W.F., and Tyczyński, Wojciech
- Published
- 2013
- Full Text
- View/download PDF
19. Colored range queries and document retrieval
- Author
-
Gagie, Travis, Kärkkäinen, Juha, Navarro, Gonzalo, and Puglisi, Simon J.
- Published
- 2013
- Full Text
- View/download PDF
20. Hierarchical Relative Lempel-Ziv Compression
- Author
-
Bille, Philip, Gørtz, Inge Li, Puglisi, Simon J., and Tarnow, Simon R.
- Subjects
FOS: Computer and information sciences ,compressed representation ,data structures ,LZ77 ,RLZ ,Theory of computation → Design and analysis of algorithms ,Computer Science - Data Structures and Algorithms ,string collections ,efficient algorithms ,Data Structures and Algorithms (cs.DS) ,Lempel-Ziv compression ,Relative compression - Abstract
Relative Lempel-Ziv (RLZ) parsing is a dictionary compression method in which a string S is compressed relative to a second string R (called the reference) by parsing S into a sequence of substrings that occur in R. RLZ is particularly effective at compressing sets of strings that have a high degree of similarity to the reference string, such as a set of genomes of individuals from the same species. With the now cheap cost of DNA sequencing, such datasets have become extremely abundant and are rapidly growing. In this paper, instead of using a single reference string for the entire collection, we investigate the use of different reference strings for subsets of the collection, with the aim of improving compression. In particular, we propose a new compression scheme hierarchical relative Lempel-Ziv (HRLZ) which form a rooted tree (or hierarchy) on the strings and then compress each string using RLZ with parent as reference, storing only the root of the tree in plain text. To decompress, we traverse the tree in BFS order starting at the root, decompressing children with respect to their parent. We show that this approach leads to a twofold improvement in compression on bacterial genome datasets, with negligible effect on decompression time compared to the standard single reference approach. We show that an effective hierarchy for a given set of strings can be constructed by computing the optimal arborescence of a completed weighted digraph of the strings, with weights as the number of phrases in the RLZ parsing of the source and destination vertices. We further show that instead of computing the complete graph, a sparse graph derived using locality-sensitive hashing can significantly reduce the cost of computing a good hierarchy, without adversely effecting compression performance., LIPIcs, Vol. 265, 21st International Symposium on Experimental Algorithms (SEA 2023), pages 18:1-18:16
- Published
- 2022
21. Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes.
- Author
-
Alanko, Jarno N, Vuohtoniemi, Jaakko, Mäklin, Tommi, and Puglisi, Simon J
- Subjects
MICROBIAL genomes ,SALMONELLA enterica ,BACTERIAL genomes ,DATA structures ,METAGENOMICS ,C++ - Abstract
Motivation Huge datasets containing whole-genome sequences of bacterial strains are now commonplace and represent a rich and important resource for modern genomic epidemiology and metagenomics. In order to efficiently make use of these datasets, efficient indexing data structures—that are both scalable and provide rapid query throughput—are paramount. Results Here, we present Themisto, a scalable colored k -mer index designed for large collections of microbial reference genomes, that works for both short and long read data. Themisto indexes 179 thousand Salmonella enterica genomes in 9 h. The resulting index takes 142 gigabytes. In comparison, the best competing tools Metagraph and Bifrost were only able to index 11 000 genomes in the same time. In pseudoalignment, these other tools were either an order of magnitude slower than Themisto, or used an order of magnitude more memory. Themisto also offers superior pseudoalignment quality, achieving a higher recall than previous methods on Nanopore read sets. Availability and implementation Themisto is available and documented as a C++ package at https://github.com/algbio/themisto available under the GPLv2 license. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
22. New algorithms on wavelet trees and applications to information retrieval
- Author
-
Gagie, Travis, Navarro, Gonzalo, and Puglisi, Simon J.
- Published
- 2012
- Full Text
- View/download PDF
23. Editorial: Special Issue on “Combinatorial Algorithms” (IWOCA 2016)
- Author
-
Mäkinen, Veli and Puglisi, Simon J.
- Published
- 2018
- Full Text
- View/download PDF
24. Computing Longest (Common) Lyndon Subsequences
- Author
-
Bannai, Hideo, I, Tomohiro, Kociumaka, Tomasz, K��ppl, Dominik, and Puglisi, Simon J.
- Subjects
FOS: Computer and information sciences ,Computer Science - Data Structures and Algorithms ,Data Structures and Algorithms (cs.DS) - Abstract
Given a string $T$ with length $n$ whose characters are drawn from an ordered alphabet of size $\sigma$, its longest Lyndon subsequence is a longest subsequence of $T$ that is a Lyndon word. We propose algorithms for finding such a subsequence in $O(n^3)$ time with $O(n)$ space, or online in $O(n^3 \sigma)$ space and time. Our first result can be extended to find the longest common Lyndon subsequence of two strings of length $n$ in $O(n^4 \sigma)$ time using $O(n^3)$ space.
- Published
- 2022
25. Suffix Sorting via Matching Statistics
- Author
-
Lipták, Zsuzsanna, Masillo, Francesco, Puglisi, Simon J., Boucher, Christina, Rahmann, Sven, Department of Computer Science, Bioinformatics, Algorithmic Bioinformatics, and Helsinki Institute for Information Technology
- Subjects
FOS: Computer and information sciences ,compressed representation ,data structures ,Theory of computation → Design and analysis of algorithms ,Computer Science - Data Structures and Algorithms ,Generalized suffix array ,string collections ,efficient algorithms ,Data Structures and Algorithms (cs.DS) ,Generalized suffix array, matching statistics, string collections, compressed representation, data structures, efficient algorithms ,matching statistics ,113 Computer and information sciences - Abstract
We introduce a new algorithm for constructing the generalized suffix array of a collection of highly similar strings. As a first step, we construct a compressed representation of the matching statistics of the collection with respect to a reference string. We then use this data structure to distribute suffixes into a partial order, and subsequently to speed up suffix comparisons to complete the generalized suffix array. Our experimental evidence with a prototype implementation (a tool we call sacamats) shows that on string collections with highly similar strings we can construct the suffix array in time competitive with or faster than the fastest available methods. Along the way, we describe a heuristic for fast computation of the matching statistics of two strings, which may be of independent interest., Comment: 16 pages, 4 figures; accepted at WABI 2022 (Workshop on Algorithms in Bioinformatics, Sept. 5-9, 2022, Potsdam, Germany)
- Published
- 2022
- Full Text
- View/download PDF
26. Document retrieval hacks
- Author
-
Puglisi, Simon J., Zhukova, Bella, Coudert, David, Natale, Emanuele, Helsinki Institute for Information Technology, Algorithmic Bioinformatics, Bioinformatics, and Department of Computer Science
- Subjects
Document retrieval ,Document listing ,Repetitive text collections ,String processing ,Succinct data structures ,Information systems → Data compression ,Pattern matching ,113 Computer and information sciences - Abstract
Given a collection of strings, document listing refers to the problem of finding all the strings (or documents) where a given query string (or pattern) appears. Index data structures that support efficient document listing for string collections have been the focus of intense research in the last decade, with dozens of papers published describing exotic and elegant compressed data structures. The problem is now quite well understood in theory and many of the solutions have been implemented and evaluated experimentally. A particular recent focus has been on highly repetitive document collections, which have become prevalent in many areas (such as version control systems and genomics - to name just two very different sources). The aim of this paper is to describe simple and efficient document listing algorithms that can be used in combination with more sophisticated techniques, or as baselines against which the performance of new document listing indexes can be measured. Our approaches are based on simple combinations of scanning and hashing, which we show to combine very well with dictionary compression to achieve small space usage. Our experiments show these methods to be often much faster and less space consuming than the best specialized indexes for the problem., LIPIcs, Vol. 190, 19th International Symposium on Experimental Algorithms (SEA 2021), pages 12:1-12:12
- Published
- 2021
27. How many runs can a string contain?
- Author
-
Puglisi, Simon J., Simpson, Jamie, and Smyth, W.F.
- Published
- 2008
- Full Text
- View/download PDF
28. Fast, Practical Algorithms for Computing All the Repeats in a String
- Author
-
Puglisi, Simon J., Smyth, W. F., and Yusufu, Munina
- Published
- 2010
- Full Text
- View/download PDF
29. Weighted Ancestors in Suffix Trees Revisited
- Author
-
Belazzougui, Djamal, Kosolobov, Dmitry, Puglisi, Simon J., Raman, Rajeev, Gawrychowski, Pawel, Starikovskaya, Tatiana, Department of Computer Science, Helsinki Institute for Information Technology, Algorithmic Bioinformatics, and Bioinformatics
- Subjects
FOS: Computer and information sciences ,Computer Science - Data Structures and Algorithms ,suffix tree ,Data Structures and Algorithms (cs.DS) ,deterministic substring hashing ,Theory of computation → Pattern matching ,weighted ancestors ,113 Computer and information sciences ,irreducible LCP - Abstract
The weighted ancestor problem is a well-known generalization of the predecessor problem to trees. It is known to require $\Omega(\log\log n)$ time for queries provided $O(n\mathop{\mathrm{polylog}} n)$ space is available and weights are from $[0..n]$, where $n$ is the number of tree nodes. However, when applied to suffix trees, the problem, surprisingly, admits an $O(n)$-space solution with constant query time, as was shown by Gawrychowski, Lewenstein, and Nicholson (Proc. ESA 2014). This variant of the problem can be reformulated as follows: given the suffix tree of a string $s$, we need a data structure that can locate in the tree any substring $s[p..q]$ of $s$ in $O(1)$ time (as if one descended from the root reading $s[p..q]$ along the way). Unfortunately, the data structure of Gawrychowski et al. has no efficient construction algorithm, limiting its wider usage as an algorithmic tool. In this paper we resolve this issue, describing a data structure for weighted ancestors in suffix trees with constant query time and a linear construction algorithm. Our solution is based on a novel approach using so-called irreducible LCP values., Comment: 15 pages, 5 figures
- Published
- 2021
30. Front Matter, Table of Contents, Preface, Conference Organization
- Author
-
Bonchi, Filippo and Puglisi, Simon J.
- Subjects
Conference Organization ,Table of Contents ,Preface ,Front Matter ,Theory of computation - Abstract
Front Matter, Table of Contents, Preface, Conference Organization, LIPIcs, Vol. 202, 46th International Symposium on Mathematical Foundations of Computer Science (MFCS 2021), pages 0:i-0:xvi
- Published
- 2021
- Full Text
- View/download PDF
31. LIPIcs, Volume 202, MFCS 2021, Complete Volume
- Author
-
Bonchi, Filippo and Puglisi, Simon J.
- Subjects
LIPIcs, Volume 202, MFCS 2021, Complete Volume ,Data processing Computer science ,ddc:004 ,Theory of computation - Abstract
LIPIcs, Volume 202, MFCS 2021, Complete Volume, LIPIcs, Vol. 202, 46th International Symposium on Mathematical Foundations of Computer Science (MFCS 2021), pages 1-1560
- Published
- 2021
- Full Text
- View/download PDF
32. Lempel–Ziv Factorization Using Less Time & Space
- Author
-
Chen, Gang, Puglisi, Simon J., and Smyth, W. F.
- Published
- 2008
- Full Text
- View/download PDF
33. Space-Efficient Indexing of Spaced Seeds for Accurate Overlap Computation of Raw Optical Mapping Data.
- Author
-
Walve, Riku, Puglisi, Simon J., and Salmela, Leena
- Abstract
A key problem in processing raw optical mapping data (Rmaps) is finding Rmaps originating from the same genomic region. These sets of related Rmaps can be used to correct errors in Rmap data, and to find overlaps between Rmaps to assemble consensus optical maps. Previous Rmap overlap aligners are computationally very expensive and do not scale to large eukaryotic data sets. We present Selkie, an Rmap overlap aligner based on a spaced $(\ell,k)$ (ℓ , k) -mer index which was pioneered in the Rmap error correction tool Elmeri. Here we present a space efficient version of the index which is twice as fast as prior art while using just a quarter of the memory on a human data set. Moreover, our index can be used for filtering candidates for Rmap overlap computation, whereas Elmeri used the index only for error correction of Rmaps. By combining our filtering of Rmaps with the exhaustive, but highly accurate, algorithm of Valouev et al. (2006), Selkie maintains or increases the accuracy of finding overlapping Rmaps on a bacterial dataset while being at least four times faster. Furthermore, for finding overlaps in a human dataset, Selkie is up to two orders of magnitude faster than previous methods. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
34. A fast hybrid short read fragment assembly algorithm
- Author
-
Schmidt, Bertil, Sinha, Ranjan, Beresford-Smith, Bryan, and Puglisi, Simon J.
- Published
- 2009
35. SHREC: a short-read error correction method
- Author
-
Schröder, Jan, Schröder, Heiko, Puglisi, Simon J., Sinha, Ranjan, and Schmidt, Bertil
- Published
- 2009
36. Fast and Simple Compact Hashing via Bucketing
- Author
-
Köppl, Dominik, Puglisi, Simon J., Raman, Rajeev, University of Helsinki, Department of Computer Science, Helsinki Institute for Information Technology, Algorithmic Bioinformatics, and Bioinformatics
- Subjects
Dynamic dictionary ,General Computer Science ,separate chaining ,Applied Mathematics ,BONSAI ,Closed addressing ,Word-packing ,113 Computer and information sciences ,compact hashing ,Computer Science Applications ,DYNAMIC DICTIONARIES ,TABLES ,hash table - Abstract
Compact hash tables store a set S of n key-value pairs, where the keys are from the universe U = {0,…,u-1}, and the values are v-bit integers, in close to B(u, n) + nv bits of space, where {b(u, n)} = log₂ binom(u,n) is the information-theoretic lower bound for representing the set of keys in S, and support operations insert, delete and lookup on S. Compact hash tables have received significant attention in recent years, and approaches dating back to Cleary [IEEE T. Comput, 1984], as well as more recent ones have been implemented and used in a number of applications. However, the wins on space usage of these approaches are outweighed by their slowness relative to conventional hash tables. In this paper, we demonstrate that compact hash tables based upon a simple idea of bucketing practically outperform existing compact hash table implementations in terms of memory usage and construction time, and existing fast hash table implementations in terms of memory usage (and sometimes also in terms of construction time). A related notion is that of a compact Hash ID map, which stores a set Ŝ of n keys from U, and implicitly associates each key in Ŝ with a unique value (its ID), chosen by the data structure itself, which is an integer of magnitude O(n), and supports inserts and lookups on Ŝ, while using close to B(u,n) bits. One of our approaches is suitable for use as a compact Hash ID map., LIPIcs, Vol. 160, 18th International Symposium on Experimental Algorithms (SEA 2020), pages 7:1-7:14
- Published
- 2020
- Full Text
- View/download PDF
37. Algorithms for anti-powers in strings
- Author
-
Badkobeh, Golnaz, Fici, Gabriele, and Puglisi, Simon J.
- Published
- 2018
- Full Text
- View/download PDF
38. Succinct dynamic de Bruijn graphs.
- Author
-
Alipanahi, Bahar, Kuhnle, Alan, Puglisi, Simon J, Salmela, Leena, and Boucher, Christina
- Subjects
DE Bruijn graph ,NUCLEOTIDE sequencing ,COMPLETE graphs ,DATA structures - Abstract
Motivation The de Bruijn graph is one of the fundamental data structures for analysis of high throughput sequencing data. In order to be applicable to population-scale studies, it is essential to build and store the graph in a space- and time-efficient manner. In addition, due to the ever-changing nature of population studies, it has become essential to update the graph after construction, e.g. add and remove nodes and edges. Although there has been substantial effort on making the construction and storage of the graph efficient, there is a limited amount of work in building the graph in an efficient and mutable manner. Hence, most space efficient data structures require complete reconstruction of the graph in order to add or remove edges or nodes. Results In this article, we present DynamicBOSS , a succinct representation of the de Bruijn graph that allows for an unlimited number of additions and deletions of nodes and edges. We compare our method with other competing methods and demonstrate that DynamicBOSS is the only method that supports both addition and deletion and is applicable to very large samples (e.g. greater than 15 billion k -mers). Competing dynamic methods, e.g. FDBG cannot be constructed on large scale datasets, or cannot support both addition and deletion, e.g. BiFrost. Availability and implementation DynamicBOSS is publicly available at https://github.com/baharpan/dynboss. Supplementary information Supplementary data are available at Bioinformatics online. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
39. Pattern Discovery in Colored Strings.
- Author
-
Lipták, Zsuzsanna, Puglisi, Simon J., and Rossi, Massimiliano
- Subjects
PATTERNS (Mathematics) ,TRACE analysis ,RUNNING speed - Abstract
In this article, we consider the problem of identifying patterns of interest in colored strings. A colored string is a string where each position is assigned one of a finite set of colors. Our task is to find substrings of the colored string that always occur followed by the same color at the same distance. The problem is motivated by applications in embedded systems verification, in particular, assertion mining. The goal there is to automatically find properties of the embedded system from the analysis of its simulation traces. We show that, in our setting, the number of patterns of interest is upper-bounded by O(n
2 ), where n is the length of the string. We introduce a baseline algorithm, running in O(n2 ) time, which identifies all patterns of interest satisfying certain minimality conditions for all colors in the string. For the case where one is interested in patterns related to one color only, we also provide a second algorithm that runs in O(n2 log n) time in the worst case but is faster than the baseline algorithm in practice. Both solutions use suffix trees, and the second algorithm also uses an appropriately defined priority queue, which allows us to reduce the number of computations. We performed an experimental evaluation of the proposed approaches over both synthetic and real-world datasets, and found that the second algorithm outperforms the first algorithm on all simulated data, while on the real-world data, the performance varies between a slight slowdown (on half of the datasets) and a speedup by a factor of up to 11. [ABSTRACT FROM AUTHOR]- Published
- 2020
- Full Text
- View/download PDF
40. LIPIcs, Volume 75, SEA'17, Complete Volume
- Author
-
Iliopoulos, Costas S., Pissis, Solon P., Puglisi, Simon J., and Raman, Rajeev
- Subjects
Data processing Computer science ,ddc:004 ,Analysis of Algorithms and Problem Complexity, Algorithms - Abstract
LIPIcs, Volume 75, SEA'17, Complete Volume, LIPIcs, Vol. 75, 16th International Symposium on Experimental Algorithms (SEA 2017), pages 0-0
- Published
- 2017
- Full Text
- View/download PDF
41. Front Matter, Table of Contents, Preface, Conference Organization, External Reviewers
- Author
-
Iliopoulos, Costas S., Pissis, Solon P., Puglisi, Simon J., and Raman, Rajeev
- Subjects
000 Computer science, knowledge, general works ,Computer Science - Abstract
Front Matter, Table of Contents, Preface, Conference Organization, External Reviewers
- Published
- 2017
- Full Text
- View/download PDF
42. A taxonomy of suffix array construction algorithms
- Author
-
Puglisi, Simon J., Smyth, W.F., and Turpin, Andrew H.
- Subjects
Algorithm ,Algorithms -- Analysis - Abstract
In 1990, Manber and Myers proposed suffix arrays as a space-saving alternative to suffix trees and described the first algorithms for suffix array construction and use. Since that time, and especially in the last few years, suffix array construction algorithms have proliferated in bewildering abundance. This survey paper attempts to provide simple high-level descriptions of these numerous algorithms that highlight both their distinctive features and their commonalities, while avoiding as much as possible the complexities of implementation details. New hybrid algorithms are also described. We provide comparisons of the algorithms' worst-case time complexity and use of additional space, together with results of recent experimental test runs on many of their implementations. Categories and Subject Descriptors: E.1 [Data Structures]: Arrays; F.2.2 [Analysis of Algorithms and Problem Complexity]: Nonnumerical Algorithms and Problems; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms: Algorithms, Design, Performance Additional Key Words and Phrases: Suffix array, suffix tree, suffix sorting, Burrows--Wheeler transform ACM Reference Format: DOI = 10.1145/1242471.1242472 http://doi.acm.org/ 10.1145/1242471.1242472
- Published
- 2007
43. Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval
- Author
-
Kuruppu, Shanika, Puglisi, Simon J., and Zobel, Justin
- Subjects
Base Sequence ,Related Genome ,Individual Genome ,Compression Performance ,Article ,Query Length - Abstract
Self-indexes – data structures that simultaneously provide fast search of and access to compressed text – are promising for genomic data but in their usual form are not able to exploit the high level of replication present in a collection of related genomes. Our ‘RLZ’ approach is to store a self-index for a base sequence and then compress every other sequence as an LZ77 encoding relative to the base. For a collection of r sequences totaling N bases, with a total of s point mutations from a base sequence of length n, this representation requires just \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$nH_k(T) + s\log n + s\log \frac{N}{s} + O(s)$\end{document} bits. At the cost of negligible extra space, access to ℓ consecutive symbols requires \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\O(\ell + \log n)$\end{document} time. Our experiments show that, for example, RLZ can represent individual human genomes in around 0.1 bits per base while supporting rapid access and using relatively little memory.
- Published
- 2010
44. Deterministic sub-linear space LCE data structures with efficient construction
- Author
-
Tanimura, Yuka, I, Tomohiro, Bannai, Hideo, Inenaga, Shunsuke, Puglisi, Simon J., and Takeda, Masayuki
- Subjects
FOS: Computer and information sciences ,000 Computer science, knowledge, general works ,010201 computation theory & mathematics ,Computer Science ,Computer Science - Data Structures and Algorithms ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Data Structures and Algorithms (cs.DS) ,0102 computer and information sciences ,02 engineering and technology ,01 natural sciences - Abstract
Given a string $S$ of $n$ symbols, a longest common extension query $\mathsf{LCE}(i,j)$ asks for the length of the longest common prefix of the $i$th and $j$th suffixes of $S$. LCE queries have several important applications in string processing, perhaps most notably to suffix sorting. Recently, Bille et al. (J. Discrete Algorithms 25:42-50, 2014, Proc. CPM 2015: 65-76) described several data structures for answering LCE queries that offers a space-time trade-off between data structure size and query time. In particular, for a parameter $1 \leq \tau \leq n$, their best deterministic solution is a data structure of size $O(n/\tau)$ which allows LCE queries to be answered in $O(\tau)$ time. However, the construction time for all deterministic versions of their data structure is quadratic in $n$. In this paper, we propose a deterministic solution that achieves a similar space-time trade-off of $O(\tau\min\{\log\tau,\log\frac{n}{\tau}\})$ query time using $O(n/\tau)$ space, but significantly improve the construction time to $O(n\tau)$., Comment: updated title
- Published
- 2016
45. Fast and accurate correction of optical mapping data via spaced seeds.
- Author
-
Salmela, Leena, Mukherjee, Kingshuk, Puglisi, Simon J, Muggli, Martin D, and Boucher, Christina
- Subjects
DATA mapping ,ERROR correction (Information theory) ,HUMAN genome ,SEEDS ,GENOMES - Abstract
Motivation Optical mapping data is used in many core genomics applications, including structural variation detection, scaffolding assembled contigs and mis-assembly detection. However, the pervasiveness of spurious and deleted cut sites in the raw data, which are called Rmaps, make assembly and alignment of them challenging. Although there exists another method to error correct Rmap data, named cOMet, it is unable to scale to even moderately large sized genomes. The challenge faced in error correction is in determining pairs of Rmaps that originate from the same region of the same genome. Results We create an efficient method for determining pairs of Rmaps that contain significant overlaps between them. Our method relies on the novel and nontrivial adaption and application of spaced seeds in the context of optical mapping, which allows for spurious and deleted cut sites to be accounted for. We apply our method to detecting and correcting these errors. The resulting error correction method, referred to as Elmeri , improves upon the results of state-of-the-art correction methods but in a fraction of the time. More specifically, cOMet required 9.9 CPU days to error correct Rmap data generated from the human genome, whereas Elmeri required less than 15 CPU hours and improved the quality of the Rmaps by more than four times compared to cOMet. Availability and implementation Elmeri is publicly available under GNU Affero General Public License at https://github.com/LeenaSalmela/Elmeri. Supplementary information Supplementary data are available at Bioinformatics online. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
46. AliBI: An Alignment-Based Index for Genomic Datasets
- Author
-
Ferrada, Hector, Gagie, Travis, Hirvola, Tommi, and Puglisi, Simon J.
- Subjects
Computational Engineering, Finance, and Science (cs.CE) ,FOS: Computer and information sciences ,Computer Science - Data Structures and Algorithms ,Data Structures and Algorithms (cs.DS) ,Computer Science - Computational Engineering, Finance, and Science - Abstract
With current hardware and software, a standard computer can now hold in RAM an index for approximate pattern matching on about half a dozen human genomes. Sequencing technologies have improved so quickly, however, that scientists will soon demand indexes for thousands of genomes. Whereas most researchers who have addressed this problem have proposed completely new kinds of indexes, we recently described a simple technique that scales standard indexes to work on more genomes. Our main idea was to filter the dataset with LZ77, build a standard index for the filtered file, and then create a hybrid of that standard index and an LZ77-based index. In this paper we describe how to our technique to use alignments instead of LZ77, in order to simplify and speed up both preprocessing and random access.
- Published
- 2013
47. m-Bonsai: A Practical Compact Dynamic Trie.
- Author
-
Poyias, Andreas, Puglisi, Simon J., and Raman, Rajeev
- Subjects
- *
DATA structures , *MATHEMATICAL bounds , *INFORMATION theory , *DIGITAL technology , *PERFORMANCE evaluation - Abstract
We consider the problem of implementing a space-efficient dynamic trie, with an emphasis on good practical performance. For a trie with n nodes with an alphabet of size σ , the information-theoretic space lower bound is n log σ + O (n) bits. The Bonsai data structure is a compact trie proposed by Darragh et al. (Softw. Pract. Exper.23(3) (1993) 277–291). Its disadvantages include the user having to specify an upper bound M on the trie size in advance (which cannot be changed easily after initalization), a space usage of M log σ + O (M log log M) (which is asymptotically non-optimal for smaller σ or if n ≪ M) and a lack of support for deletions. It supports traversal and update operations in O (1 / 𝜖) expected time (based on assumptions about the behaviour of hash functions), where 𝜖 = (M − n) / M and has excellent speed performance in practice. We propose an alternative, m-Bonsai, that addresses the above problems, obtaining a trie that uses (1 + β) n (log σ + O (1)) bits in expectation, and supports traversal and update operations in O (1 / β) expected time and O (1 / β 2) amortized expected time, for any user-specified parameter β > 0 (again based on assumptions about the behaviour of hash functions). We give an implementation of m-Bonsai which uses considerably less memory and is slightly faster than the original Bonsai. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
48. Bidirectional Variable-Order de Bruijn Graphs.
- Author
-
Belazzougui, Djamal, Gagie, Travis, Mäkinen, Veli, Previtali, Marco, and Puglisi, Simon J.
- Subjects
QUERYING (Computer science) ,GRAPH theory ,DATA compression ,REPRESENTATION theory ,MATHEMATICAL variables - Abstract
Compressed suffix trees and bidirectional FM-indexes can store a set of strings and support queries that let us explore the set of substrings they contain, adding and deleting characters on both the left and right, but they can use much more space than a de Bruijn graph for the strings. Bowe et al.'s BWT-based de Bruijn graph representation (Proc. Workshop on Algorithms for Bioinformatics, pp. 225–235, 2012) can be made bidirectional as well, at the cost of increasing its space usage by a small constant, but it fixes the length of the substrings. Boucher et al. (Proc. Data Compression Conference, pp. 383–392, 2015) generalized Bowe et al.'s representation to support queries about variable-length substrings, but at the cost of bidirectionality. In this paper we show how to make Boucher et al.'s variable-order implementation of de Bruijn graphs bidirectional. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
49. Relative Suffix Trees.
- Author
-
Farruggia, Andrea, Gagie, Travis, Navarro, Gonzalo, Puglisi, Simon J, and Sirén, Jouni
- Subjects
BIOINFORMATICS ,NUMERICAL analysis ,MATHEMATICAL analysis ,GENOMES ,COMPUTATIONAL biology - Abstract
Suffix trees are one of the most versatile data structures in stringology, with many applications in bioinformatics. Their main drawback is their size, which can be tens of times larger than the input sequence. Much effort has been put into reducing the space usage, leading ultimately to compressed suffix trees. These compressed data structures can efficiently simulate the suffix tree, while using space proportional to a compressed representation of the sequence. In this work, we take a new approach to compressed suffix trees for repetitive sequence collections, such as collections of individual genomes. We compress the suffix trees of individual sequences relative to the suffix tree of a reference sequence. These relative data structures provide competitive time/space trade-offs, being almost as small as the smallest compressed suffix trees for repetitive collections and competitive in time with the largest and fastest compressed suffix trees. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
50. Lightweight Lempel-Ziv Parsing
- Author
-
K��rkk��inen, Juha, Kempa, Dominik, and Puglisi, Simon J.
- Subjects
FOS: Computer and information sciences ,Computer Science - Data Structures and Algorithms ,Data Structures and Algorithms (cs.DS) - Abstract
We introduce a new approach to LZ77 factorization that uses O(n/d) words of working space and O(dn) time for any d >= 1 (for polylogarithmic alphabet sizes). We also describe carefully engineered implementations of alternative approaches to lightweight LZ77 factorization. Extensive experiments show that the new algorithm is superior in most cases, particularly at the lowest memory levels and for highly repetitive data. As a part of the algorithm, we describe new methods for computing matching statistics which may be of independent interest., 12 pages
- Published
- 2013
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.