119 results on '"approximate pattern matching"'
Search Results
2. Automated Design of Efficient Search Schemes for Lossless Approximate Pattern Matching
- Author
-
Renders, Luca, Depuydt, Lore, Rahmann, Sven, Fostier, Jan, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, van Leeuwen, Jan, Series Editor, Hutchison, David, Editorial Board Member, Kanade, Takeo, Editorial Board Member, Kittler, Josef, Editorial Board Member, Kleinberg, Jon M., Editorial Board Member, Kobsa, Alfred, Series Editor, Mattern, Friedemann, Editorial Board Member, Mitchell, John C., Editorial Board Member, Naor, Moni, Editorial Board Member, Nierstrasz, Oscar, Series Editor, Pandu Rangan, C., Editorial Board Member, Sudan, Madhu, Series Editor, Terzopoulos, Demetri, Editorial Board Member, Tygar, Doug, Editorial Board Member, Weikum, Gerhard, Series Editor, Vardi, Moshe Y, Series Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, and Ma, Jian, editor
- Published
- 2024
- Full Text
- View/download PDF
3. Pan-genome de Bruijn graph using the bidirectional FM-index
- Author
-
Lore Depuydt, Luca Renders, Thomas Abeel, and Jan Fostier
- Subjects
Approximate pattern matching ,Sequence-to-graph alignment ,Search schemes ,Lossless alignment ,Pan-genome visualization ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Biology (General) ,QH301-705.5 - Abstract
Abstract Background Pan-genome graphs are gaining importance in the field of bioinformatics as data structures to represent and jointly analyze multiple genomes. Compacted de Bruijn graphs are inherently suited for this purpose, as their graph topology naturally reveals similarity and divergence within the pan-genome. Most state-of-the-art pan-genome graphs are represented explicitly in terms of nodes and edges. Recently, an alternative, implicit graph representation was proposed that builds directly upon the unidirectional FM-index. As such, a memory-efficient graph data structure is obtained that inherits the FM-index’ backward search functionality. However, this representation suffers from a number of shortcomings in terms of functionality and algorithmic performance. Results We present a data structure for a pan-genome, compacted de Bruijn graph that aims to address these shortcomings. It is built on the bidirectional FM-index, extending the ability of its unidirectional counterpart to navigate and search the graph in both directions. All basic graph navigation steps can be performed in constant time. Based on these features, we implement subgraph visualization as well as lossless approximate pattern matching to the graph using search schemes. We demonstrate that we can retrieve all occurrences corresponding to a read within a certain edit distance in a very efficient manner. Through a case study, we show the potential of exploiting the information embedded in the graph’s topology through visualization and sequence alignment. Conclusions We propose a memory-efficient representation of the pan-genome graph that supports subgraph visualization and lossless approximate pattern matching of reads against the graph using search schemes. The C++ source code of our software, called Nexus, is available at https://github.com/biointec/nexus under AGPL-3.0 license.
- Published
- 2023
- Full Text
- View/download PDF
4. Approximate Cartesian Tree Matching: An Approach Using Swaps
- Author
-
Auvray, Bastien, David, Julien, Groult, Richard, Lecroq, Thierry, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Nardini, Franco Maria, editor, Pisanti, Nadia, editor, and Venturini, Rossano, editor
- Published
- 2023
- Full Text
- View/download PDF
5. Multiallelic Maximal Perfect Haplotype Blocks with Wildcards via PBWT
- Author
-
Bonizzoni, Paola, Della Vedova, Gianluca, Pirola, Yuri, Rizzi, Raffaella, Sgrò, Mattia, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Rojas, Ignacio, editor, Valenzuela, Olga, editor, Rojas Ruiz, Fernando, editor, Herrera, Luis Javier, editor, and Ortuño, Francisco, editor
- Published
- 2023
- Full Text
- View/download PDF
6. Pan-genome de Bruijn graph using the bidirectional FM-index.
- Author
-
Depuydt, Lore, Renders, Luca, Abeel, Thomas, and Fostier, Jan
- Subjects
DE Bruijn graph ,PATTERN matching ,PAN-genome ,DATA structures ,REPRESENTATIONS of graphs ,SEQUENCE alignment - Abstract
Background: Pan-genome graphs are gaining importance in the field of bioinformatics as data structures to represent and jointly analyze multiple genomes. Compacted de Bruijn graphs are inherently suited for this purpose, as their graph topology naturally reveals similarity and divergence within the pan-genome. Most state-of-the-art pan-genome graphs are represented explicitly in terms of nodes and edges. Recently, an alternative, implicit graph representation was proposed that builds directly upon the unidirectional FM-index. As such, a memory-efficient graph data structure is obtained that inherits the FM-index' backward search functionality. However, this representation suffers from a number of shortcomings in terms of functionality and algorithmic performance. Results: We present a data structure for a pan-genome, compacted de Bruijn graph that aims to address these shortcomings. It is built on the bidirectional FM-index, extending the ability of its unidirectional counterpart to navigate and search the graph in both directions. All basic graph navigation steps can be performed in constant time. Based on these features, we implement subgraph visualization as well as lossless approximate pattern matching to the graph using search schemes. We demonstrate that we can retrieve all occurrences corresponding to a read within a certain edit distance in a very efficient manner. Through a case study, we show the potential of exploiting the information embedded in the graph's topology through visualization and sequence alignment. Conclusions: We propose a memory-efficient representation of the pan-genome graph that supports subgraph visualization and lossless approximate pattern matching of reads against the graph using search schemes. The C++ source code of our software, called Nexus, is available at https://github.com/biointec/nexus under AGPL-3.0 license. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
7. The Complexity of Approximate Pattern Matching on de Bruijn Graphs
- Author
-
Gibney, Daniel, Thankachan, Sharma V., Aluru, Srinivas, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, and Pe'er, Itsik, editor
- Published
- 2022
- Full Text
- View/download PDF
8. On the Hardness of Sequence Alignment on De Bruijn Graphs.
- Author
-
Gibney, Daniel, Thankachan, Sharma V., and Aluru, Srinivas
- Subjects
- *
DE Bruijn graph , *SEQUENCE alignment , *NP-complete problems , *MATCHING theory , *BIPARTITE graphs , *COMPUTATIONAL biology , *HARDNESS - Abstract
The problem of aligning a sequence to a walk in a labeled graph is of fundamental importance to Computational Biology. For an arbitrary graph G = (V , E) and a pattern P of length m, a lower bound based on the Strong Exponential Time Hypothesis implies that an algorithm for finding a walk in G exactly matching P significantly faster than O (| E | m) time is unlikely. However, for many special graphs, such as de Bruijn graphs, the problem can be solved in linear time. For approximate matching, the picture is more complex. When edits (substitutions, insertions, and deletions) are only allowed to the pattern, or when the graph is acyclic, the problem is solvable in O (| E | m) time. When edits are allowed to arbitrary cyclic graphs, the problem becomes NP-complete, even on binary alphabets. Moreover, NP-completeness continues to hold even when edits are restricted to only substitutions. Despite the popularity of the de Bruijn graphs in Computational Biology, the complexity of approximate pattern matching on the de Bruijn graphs remained unknown. We investigate this problem and show that the properties that make the de Bruijn graphs amenable to efficient exact pattern matching do not extend to approximate matching, even when restricted to the substitutions only case with alphabet size four. Specifically, we prove that determining the existence of a matching walk in a de Bruijn graph is NP-complete when substitutions are allowed to the graph. We also demonstrate that an algorithm significantly faster than O (| E | m) is unlikely for the de Bruijn graphs in the case where substitutions are only allowed to the pattern. This stands in contrast to pattern-to-text matching where exact matching is solvable in linear time, such as on the de Bruijn graphs, but approximate matching under substitutions is solvable in subquadratic Õ (n m ) time, where n is the text's length. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
9. NetNDP: Nonoverlapping (delta, gamma)-approximate pattern matching.
- Author
-
Youxi Wu, Bojing Jian, Yan Li, He Jiang, and Xindong Wu
- Subjects
- *
PATTERN matching , *HAMMING distance , *SEQUENTIAL pattern mining , *PARENT-child relationships , *LARGE deviations (Mathematics) - Abstract
Pattern matching can be used to calculate the support of patterns, and is a key issue in sequential pattern mining (or sequence pattern mining). Nonoverlapping pattern matching means that two occurrences cannot use the same character in the sequence at the same position. Approximate pattern matching allows for some data noise, and is more general than exact pattern matching. At present, nonoverlapping approximate pattern matching is based on Hamming distance, which cannot be used to measure the local approximation between the subsequence and pattern, resulting in large deviations in matching results. To tackle this issue, we present a Nonoverlapping Delta and gamma approximate Pattern matching (NDP) scheme that employs the (δ, ϒ)-distance to give an approximate pattern matching, where the local and the global distances do not exceed ϒ and, respectively. We first transform the NDP problem into a local approximate Nettree and then construct an efficient algorithm, called the local approximate Nettree for NDP (NetNDP). We propose a new approach called the Minimal Root Distance which allows us to determine whether or not a node has root paths that satisfy the global constraint and to prune invalid nodes and parent-child relationships. NetNDP finds the rightmost absolute leaf of the max root, searches for the rightmost occurrence from the rightmost absolute leaf, and deletes this occurrence. We iterate the above steps until there are no new occurrences. Numerous experiments are used to verify the performance of the proposed algorithm. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
10. Programmable Multiplexed Nucleic Acid Detection by Harnessing Specificity Defect of CRISPR-Cas12a.
- Author
-
Guan X, Yang R, Zhang J, Moon J, Hou C, Guo C, Avery L, Scarola D, Roberts DS, LaSala R, and Liu C
- Abstract
CRISPR-Cas12a works like a sophisticated algorithm in nucleic acid detection, yet its challenge lies in sometimes failing to distinguish targets with mismatches due to its specificity limitations. Here, the mismatch profiles, including the quantity, location, and type of mismatches in the CRISPR-Cas12a reaction, are investigated and its various tolerances to mismatches are discovered. By harnessing the specificity defect of the CRISPR-Cas12a enzyme, a dual-mode detection strategy is designed, which includes approximate matching and precise querying of target sequences and develop a programmable multiplexed nucleic acid assay. With the assay, 14 high-risk human papillomavirus (HPV) subtypes are simultaneously detected, collectively responsible for 99% of cervical cancer cases, with attomolar sensitivity. Specifically, the assay not only distinguishes HPV16 and HPV18, the two most common subtypes but also detects 12 other high-risk pooled HPV subtypes. To enable low-cost point-of-care testing, the assay is incorporated into a paper-based microfluidic chip. Furthermore, the clinical performance of the paper-based microfluidic chip is validated by testing 75 clinical swab samples, achieving performance comparable to that of PCR. This programmable multiplexed nucleic acid assay has the potential to be widely applied for sensitive, specific, and simultaneous detection of different pathogens., (© 2024 The Author(s). Advanced Science published by Wiley‐VCH GmbH.)
- Published
- 2024
- Full Text
- View/download PDF
11. b-move: Faster Lossless Approximate Pattern Matching in a Run-Length Compressed Index.
- Author
-
Depuydt L, Renders L, Van de Vyver S, Veys L, Gagie T, and Fostier J
- Abstract
Background: Due to the increasing availability of high-quality genome sequences, pan-genomes are gradually replacing single consensus reference genomes in many bioinformatics pipelines to better capture genetic diversity. Traditional bioinformatics tools using the FM-index face memory limitations with such large genome collections. Recent advancements in run-length compressed indices like Gagie et al.'s r-index and Nishimoto and Tabei's move structure, alleviate memory constraints but focus primarily on backward search for MEM-finding. Arakawa et al.'s br-index initiates complete approximate pattern matching using bidirectional search in run-length compressed space, but with significant computational overhead due to complex memory access patterns., Results: We introduce b-move, a novel bidirectional extension of the move structure, enabling fast, cache-efficient, lossless approximate pattern matching in run-length compressed space. It achieves bidirectional character extensions up to 7 times faster than the br-index, closing the performance gap with FM-index-based alternatives. For locating occurrences, b-move performs ϕ and ϕ - 1 operations up to 7 times faster than the br-index. At the same time, it maintains the favorable memory characteristics of the br-index, for example, all available complete E. coli genomes on NCBI's RefSeq collection can be compiled into a b-move index that fits into the RAM of a typical laptop., Conclusions: b-move proves practical and scalable for pan-genome indexing and querying. We provide a C++ implementation of b-move, supporting efficient lossless approximate pattern matching including locate functionality, available at https://github.com/biointec/b-move under the AGPL-3.0 license., Competing Interests: Competing interests The authors declare that they have no competing interests. Additional Declarations: No competing interests reported.
- Published
- 2024
- Full Text
- View/download PDF
12. NetDPO: (delta, gamma)-approximate pattern matching with gap constraints under one-off condition.
- Author
-
Li, Yan, Yu, Lei, Liu, Jing, Guo, Lei, Wu, Youxi, and Wu, Xindong
- Subjects
PATTERN matching ,DATA structures ,HAMMING distance ,NP-hard problems ,HEURISTIC algorithms - Abstract
Approximate pattern matching not only is more general than exact pattern matching, but also allows some data noise. Most of them adopt the Hamming distance to measure similarity, which indicates the number of different characters in two sequences, but it cannot reflect the approximation between two characters. This paper addresses the approximate pattern matching with a local distance no larger than δ and a global distance no larger than γ, which is named Delta and gamma Pattern matching with gap constraints under One-off condition (DPO). First, we show that the problem is an NP-Hard problem. Therefore, we construct a heuristic algorithm named approximate Nettree for DPO (NetDPO), which transforms the problem into an approximate Nettree based on δ distance which is a specially designed data structure. Then, NetDPO calculates the number of paths that reach the roots within γ distance. To find the maximal occurrences, we employ the rightmost parent strategy and the optimal parent strategy to select the better occurrence which can minimize the influence after removing the occurrence. Iterate this process until there are no occurrences. Finally, we analyze the time and space complexities of NetDPO. Extensive experimental results verify the superiority of the proposed algorithm. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
13. Lossless Approximate Pattern Matching: Automated Design of Efficient Search Schemes.
- Author
-
Renders L, Depuydt L, Rahmann S, and Fostier J
- Subjects
- Humans, Programming, Linear, Pattern Recognition, Automated methods, Algorithms, Software, Computational Biology methods
- Abstract
This study introduces a pioneering approach to automate the creation of search schemes for lossless approximate pattern matching. Search schemes are combinatorial structures that define a series of searches over a partitioned pattern. Each search specifies the processing order of these parts and the cumulative lower and upper bounds on the number of errors in each part of the pattern. Together, these searches ensure the identification of all approximate occurrences of a search pattern within a predefined limit of k errors. While existing literature offers designed schemes for up to k = 4 errors, designing search schemes for larger k values incurs escalating computational costs. Our method integrates a greedy algorithm and a novel Integer Linear Programming (ILP) formulation to design efficient search schemes for up to k = 7 errors. Comparative analyses demonstrate the superiority of our ILP-optimal schemes over alternative strategies in both theoretical and practical contexts. Additionally, we propose a dynamic scheme selection technique tailored to specific search patterns, further enhancing efficiency. Combined, this yields runtime reductions of up to 53% for higher k values. To facilitate search scheme generation, we present Hato, an open-source software tool (AGPL-3.0 license) employing the greedy algorithm and utilizing CPLEX for ILP solving. Furthermore, we introduce Columba 1.2, an open-source lossless read-mapper (AGPL-3.0 license) implemented in C++. Columba surpasses existing state-of-the-art tools by identifying all approximate occurrences of 100,000 Illumina reads (150 bp) in the human reference genome within 24 seconds (maximum edit distance of 4) and 75 seconds (maximum edit distance of 6) using a single CPU core. Notably, our study showcases Columba's capability to align 100,000 reads of length 50, with high error rates and up to an edit distance of 7, in a mere 2 hours and 15 minutes. This achievement is unmatched by other lossless aligners, which require over 3 hours for edit distance 5 alignments. Moreover, Columba exhibits a mapping rate four times higher than that of a lossy tool for this dataset.
- Published
- 2024
- Full Text
- View/download PDF
14. Faster Recovery of Approximate Periods over Edit Distance
- Author
-
Kociumaka, Tomasz, Radoszewski, Jakub, Rytter, Wojciech, Straszyński, Juliusz, Waleń, Tomasz, Zuba, Wiktor, Hutchison, David, Series Editor, Kanade, Takeo, Series Editor, Kittler, Josef, Series Editor, Kleinberg, Jon M., Series Editor, Mattern, Friedemann, Series Editor, Mitchell, John C., Series Editor, Naor, Moni, Series Editor, Pandu Rangan, C., Series Editor, Steffen, Bernhard, Series Editor, Terzopoulos, Demetri, Series Editor, Tygar, Doug, Series Editor, Weikum, Gerhard, Series Editor, Gagie, Travis, editor, Moffat, Alistair, editor, Navarro, Gonzalo, editor, and Cuadros-Vargas, Ernesto, editor
- Published
- 2018
- Full Text
- View/download PDF
15. Circular pattern matching with k mismatches.
- Author
-
Charalampopoulos, Panagiotis, Kociumaka, Tomasz, Pissis, Solon P., Radoszewski, Jakub, Rytter, Wojciech, Straszyński, Juliusz, Waleń, Tomasz, and Zuba, Wiktor
- Subjects
- *
PATTERN matching , *ALGORITHMS , *HAMMING distance - Abstract
We consider the circular pattern matching with k mismatches (k -CPM) problem in which one is to compute the minimal Hamming distance of every length- m substring of T and any cyclic rotation of P , if this distance is no more than k. It is a variation of the well-studied k -mismatch problem. A multitude of papers has been devoted to solving the k -CPM problem, but only average-case upper bounds are known. In this paper, we present the first non-trivial worst-case upper bounds for this problem. Specifically, we show an O (n k) -time algorithm and an O (n + n m k 4) -time algorithm. The latter algorithm applies in an extended way a technique that was very recently developed for the k -mismatch problem Bringmann et al. (2019) [10]. A preliminary version of this work appeared at FCT 2019 [35]. In this version we improve the time complexity of the second algorithm from O (n + n m k 5) to O (n + n m k 4). [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
16. b-move: faster bidirectional character extensions in a run-length compressed index.
- Author
-
Depuydt L, Renders L, de Vyver SV, Veys L, Gagie T, and Fostier J
- Abstract
Due to the increasing availability of high-quality genome sequences, pan-genomes are gradually replacing single consensus reference genomes in many bioinformatics pipelines to better capture genetic diversity. Traditional bioinformatics tools using the FM-index face memory limitations with such large genome collections. Recent advancements in run-length compressed indices like Gagie et al.'s r-index and Nishimoto and Tabei's move structure, alleviate memory constraints but focus primarily on backward search for MEM-finding. Arakawa et al.'s br-index initiates complete approximate pattern matching using bidirectional search in run-length compressed space, but with significant computational overhead due to complex memory access patterns. We introduce b-move, a novel bidirectional extension of the move structure, enabling fast, cache-efficient bidirectional character extensions in run-length compressed space. It achieves bidirectional character extensions up to 8 times faster than the br-index, closing the performance gap with FM-index-based alternatives, while maintaining the br-index's favorable memory characteristics. For example, all available complete E. coli genomes on NCBI's RefSeq collection can be compiled into a b-move index that fits into the RAM of a typical laptop. Thus, b-move proves practical and scalable for pan-genome indexing and querying. We provide a C++ implementation of b-move, supporting efficient lossless approximate pattern matching including locate functionality, available at https://github.com/biointec/b-move under the AGPL-3.0 license.
- Published
- 2024
- Full Text
- View/download PDF
17. A New Automata Based Approximate String Matching Approach and Web Interface for Bioinformatics Algorithms
- Author
-
Gıyasettin Özcan and Burak Koca
- Subjects
a-bom ,biyoinformatik ,ara yüz ,yaklaşık desen eşleştirme ,bioinformatics ,interface ,approximate pattern matching ,Technology ,Engineering (General). Civil engineering (General) ,TA1-2040 - Abstract
In this study, we present a new web interface for major bioinformatics algorithms and introduce a novel approximate string matching algorithm. Our web interface executes major algorithms on the field for the use of computational biologists, students or any other interested researchers. In the web interface, algorithms come under three sections: Sequence alignment, pattern matching and motif finding. In each section, we introduce algorithms in order to find best fitting one for specific dataset and problem. The interface introduces execution time, memory usage and context specific results of algorithms such as alignment score. The interface utilizes emerging open source languages and tools. In order to develop light and user-friendly interface, all parts of the interface coded with Python language. On the other hand, Django is used for web interface. Second contribution of the study is novel A-BOM algorithm, which is designed for approximate pattern matching problem. The algorithm is approximate matching variation of Backward Oracle Matching. We compare our algorithm with popular approximate string matching algorithms. Results denote that A-BOM introduces %30 to %80 short runtime improvement when compared to current approximate pattern matching algorithms on long patterns.
- Published
- 2018
- Full Text
- View/download PDF
18. Memristive-Based Associative Memory for Approximate Computational Reuse
- Author
-
Rahimi, Abbas, Benini, Luca, Gupta, Rajesh K., Rahimi, Abbas, Benini, Luca, and Gupta, Rajesh K.
- Published
- 2017
- Full Text
- View/download PDF
19. NetDAP: (δ, γ) −approximate pattern matching with length constraints.
- Author
-
Wu, Youxi, Fan, Jinquan, Li, Yan, Guo, Lei, and Wu, Xindong
- Subjects
PRUNING ,PATTERN matching ,SEQUENTIAL pattern mining ,PARENT-child relationships ,ONLINE algorithms ,HAMMING distance ,ALGORITHMS - Abstract
Pattern matching(PM) with gap constraints has been applied to compute the support of a pattern in a sequence, which is an essential task of the repetitive sequential pattern mining (or sequence pattern mining). Compared with exact PM, approximate PM allows data noise (differences) between the pattern and the matched subsequence. Therefore, more valuable patterns can be found. Approximate PM with gap constraints mainly adopts the Hamming distance to measure the approximation degree which only reflects the number of different characters between two sequences, but ignores the distance between different characters. Hence, this paper addresses (δ, γ) approximate PM with length constraints which employs local-global constraints to improve the accuracy of the PM, namely, the maximal distance between two corresponding characters is less or equal to the local threshold δ, and the sum of all the δ distances is also less or equal to the global threshold γ. To tackle the problem effectively, this paper proposes an effective online algorithm, named NetDAP, which employs a special designed data structure named approximate single-leaf Nettree. An approximate single-leaf Nettree can be created by adopting dynamic programming to determine the range of rootleaf, the minimal root, the maximal root, the range of nodes for each level, and the range of parents for each node. To improve the performance, two pruning strategies are proposed to prune the nodes and the parent-child relationships which do not satisfy the δ and γ distance constraints respectively. Finally, extensive experimental results on real protein data sets and time series verify the performance of the proposed algorithm. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
20. On the Complexity of Sequence-to-Graph Alignment.
- Author
-
Jain, Chirag, Zhang, Haowen, Gao, Yu, and Aluru, Srinivas
- Subjects
- *
HAMMING distance , *PATTERN matching , *HYPERTEXT systems , *REPRESENTATIONS of graphs - Abstract
Availability of extensive genetic data across multiple individuals and populations is driving the growing importance of graph-based reference representations. Aligning sequences to graphs is a fundamental operation on several types of sequence graphs (variation graphs, assembly graphs, pan-genomes, etc.) and their biological applications. Although research on sequence-to-graph alignments is nascent, it can draw from related work on pattern matching in hypertext. In this article, we study sequence-to-graph alignment problems under Hamming and edit distance models, and linear and affine gap penalty functions, for multiple variants of the problem that allow changes in query alone, graph alone, or in both. We prove that when changes are permitted in graphs either standalone or in conjunction with changes in the query, the sequence-to-graph alignment problem is N P -complete under both Hamming and edit distance models for alphabets of size ≥2. For the case where only changes to the sequence are permitted, we present an O (| V | + m | E |) time algorithm, where m denotes the query size, and V and E denote the vertex and edge sets of the graph, respectively. Our result is generalizable to both linear and affine gap penalty functions, and improves upon the runtime complexity of existing algorithms. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
21. NETASPNO: Approximate Strict Pattern Matching Under Nonoverlapping Condition
- Author
-
Youxi Wu, Shasha Li, Jingyu Liu, Lei Guo, and Xindong Wu
- Subjects
Approximate pattern matching ,wildcard ,gap constraint ,sequence ,occurrence ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
In pattern matching, a gap constraint is a more flexible wildcard than traditional wildcards “?”and “*”. Pattern matching with gap constraints is more difficult to handle and fulfills user's enquiries more easily. Pattern matching with gap constraints has therefore been carried out in numerous research works, such as music information retrieval, searching protein sites, and sequence pattern mining. Strict pattern matching under a nonoverlapping condition, as a type of pattern matching with gap constraints, is a key issue of sequence pattern mining with gap constraints since it can be used to compute the frequency of a pattern. Exact matching limits the flexibility of the match to some extent since it requires each character to be matched exactly. We therefore address approximate strict pattern matching under the nonoverlapping constraints (ASPNO) and propose an effective algorithm, named NETtree for ASPNO (NETASPNO), which first transforms the problem into a Nettree data structure, an extensive tree structure. To find the nonoverlapping occurrences effectively, we propose the concept of number of roots paths with distance constraints (NRPDC) which indicates the number of path from a node to the roots with distanced and can be used to delete useless parent-child relationships and useless nodes. We iteratively recalculate the NRPDCs of each node on the subnettree with the rightmost root. Then we can get a path from the rightmost leaf to its rightmost root without using the backtracking strategy. NETASPNO therefore iteratively gets the rightmost root-leaf-path and prunes the path on the Nettree. Extensive experimental results demonstrate that NETASPNO has better performance than the other competitive algorithms.
- Published
- 2018
- Full Text
- View/download PDF
22. Multiallelic Maximal Perfect Haplotype Blocks with Wildcards via PBWT
- Author
-
Rojas, I, Valenzuela, O, Rojas Ruiz, F, Herrera, LJ, Ortuño, F, Bonizzoni, P, Della Vedova, G, Pirola, Y, Rizzi, R, Sgrò, M, Rojas, I, Valenzuela, O, Rojas Ruiz, F, Herrera, LJ, Ortuño, F, Bonizzoni, P, Della Vedova, G, Pirola, Y, Rizzi, R, and Sgrò, M
- Abstract
Computing maximal perfect blocks of a given panel of haplotypes is a crucial task for efficiently solving problems such as polyploid haplotype reconstruction and finding identical-by-descent segments shared among individuals of a population. Unfortunately, the presence of missing data in the haplotype panel limits the usefulness of the notion of perfect blocks. We propose a novel algorithm for computing maximal blocks in a panel with missing data (represented as wildcards). The algorithm is based on the Positional Burrows-Wheeler Transform (PBWT) and has been implemented in the tool Wild-pBWT, available at https://github.com/AlgoLab/Wild-pBWT/. Experimental comparison showed that Wild-pBWT is 10–15 times faster than another state-of-the-art approach, while using a negligible amount of memory.
- Published
- 2023
23. Pan-genome de Bruijn graph using the bidirectional FM-index
- Author
-
Depuydt, Lore (author), Renders, Luca (author), Abeel, T.E.P.M.F. (author), Fostier, Jan (author), Depuydt, Lore (author), Renders, Luca (author), Abeel, T.E.P.M.F. (author), and Fostier, Jan (author)
- Abstract
Background: Pan-genome graphs are gaining importance in the field of bioinformatics as data structures to represent and jointly analyze multiple genomes. Compacted de Bruijn graphs are inherently suited for this purpose, as their graph topology naturally reveals similarity and divergence within the pan-genome. Most state-of-the-art pan-genome graphs are represented explicitly in terms of nodes and edges. Recently, an alternative, implicit graph representation was proposed that builds directly upon the unidirectional FM-index. As such, a memory-efficient graph data structure is obtained that inherits the FM-index’ backward search functionality. However, this representation suffers from a number of shortcomings in terms of functionality and algorithmic performance. Results: We present a data structure for a pan-genome, compacted de Bruijn graph that aims to address these shortcomings. It is built on the bidirectional FM-index, extending the ability of its unidirectional counterpart to navigate and search the graph in both directions. All basic graph navigation steps can be performed in constant time. Based on these features, we implement subgraph visualization as well as lossless approximate pattern matching to the graph using search schemes. We demonstrate that we can retrieve all occurrences corresponding to a read within a certain edit distance in a very efficient manner. Through a case study, we show the potential of exploiting the information embedded in the graph’s topology through visualization and sequence alignment. Conclusions: We propose a memory-efficient representation of the pan-genome graph that supports subgraph visualization and lossless approximate pattern matching of reads against the graph using search schemes. The C++ source code of our software, called Nexus, is available at https://github.com/biointec/nexus under AGPL-3.0 license., Pattern Recognition and Bioinformatics
- Published
- 2023
- Full Text
- View/download PDF
24. Small-Space Algorithms for the Online Language Distance Problem for Palindromes and Squares
- Author
-
Gabriel Bathie and Tomasz Kociumaka and Tatiana Starikovskaya, Bathie, Gabriel, Kociumaka, Tomasz, Starikovskaya, Tatiana, Gabriel Bathie and Tomasz Kociumaka and Tatiana Starikovskaya, Bathie, Gabriel, Kociumaka, Tomasz, and Starikovskaya, Tatiana
- Abstract
We study the online variant of the language distance problem for two classical formal languages, the language of palindromes and the language of squares, and for the two most fundamental distances, the Hamming distance and the edit (Levenshtein) distance. In this problem, defined for a fixed formal language L, we are given a string T of length n, and the task is to compute the minimal distance to L from every prefix of T. We focus on the low-distance regime, where one must compute only the distances smaller than a given threshold k. In this work, our contribution is twofold: 1) First, we show streaming algorithms, which access the input string T only through a single left-to-right scan. Both for palindromes and squares, our algorithms use O(k polylog n) space and time per character in the Hamming-distance case and O(k² polylog n) space and time per character in the edit-distance case. These algorithms are randomised by necessity, and they err with probability inverse-polynomial in n. 2) Second, we show deterministic read-only online algorithms, which are also provided with read-only random access to the already processed characters of T. Both for palindromes and squares, our algorithms use O(k polylog n) space and time per character in the Hamming-distance case and O(k⁴ polylog n) space and amortised time per character in the edit-distance case.
- Published
- 2023
- Full Text
- View/download PDF
25. Streaming k-Edit Approximate Pattern Matching via String Decomposition
- Author
-
Sudatta Bhattacharya and Michal Koucký, Bhattacharya, Sudatta, Koucký, Michal, Sudatta Bhattacharya and Michal Koucký, Bhattacharya, Sudatta, and Koucký, Michal
- Abstract
In this paper we give an algorithm for streaming k-edit approximate pattern matching which uses space Õ(k²) and time Õ(k²) per arriving symbol. This improves substantially on the recent algorithm of Kociumaka, Porat and Starikovskaya [Kociumaka et al., 2022] which uses space Õ(k⁵) and time Õ(k⁸) per arriving symbol. In the k-edit approximate pattern matching problem we get a pattern P and text T and we want to identify all substrings of the text T that are at edit distance at most k from P. In the streaming version of this problem both the pattern and the text arrive in a streaming fashion symbol by symbol and after each symbol of the text we need to report whether there is a current suffix of the text with edit distance at most k from P. We measure the total space needed by the algorithm and time needed per arriving symbol.
- Published
- 2023
- Full Text
- View/download PDF
26. Streaming $k$-edit approximate pattern matching via string decomposition
- Author
-
Bhattacharya, Sudatta and Koucký, Michal
- Subjects
FOS: Computer and information sciences ,Theory of computation → Sketching and sampling ,streaming algorithms ,Approximate pattern matching ,Computer Science - Data Structures and Algorithms ,Data Structures and Algorithms (cs.DS) ,edit distance ,Theory of computation → Pattern matching - Abstract
In this paper we give an algorithm for streaming k-edit approximate pattern matching which uses space Õ(k²) and time Õ(k²) per arriving symbol. This improves substantially on the recent algorithm of Kociumaka, Porat and Starikovskaya [Kociumaka et al., 2022] which uses space Õ(k⁵) and time Õ(k⁸) per arriving symbol. In the k-edit approximate pattern matching problem we get a pattern P and text T and we want to identify all substrings of the text T that are at edit distance at most k from P. In the streaming version of this problem both the pattern and the text arrive in a streaming fashion symbol by symbol and after each symbol of the text we need to report whether there is a current suffix of the text with edit distance at most k from P. We measure the total space needed by the algorithm and time needed per arriving symbol., LIPIcs, Vol. 261, 50th International Colloquium on Automata, Languages, and Programming (ICALP 2023), pages 22:1-22:14
- Published
- 2023
27. Beating in Approximate LZW-Compressed Pattern Matching
- Author
-
Gawrychowski, Paweł, Straszak, Damian, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Cai, Leizhen, editor, Cheng, Siu-Wing, editor, and Lam, Tak-Wah, editor
- Published
- 2013
- Full Text
- View/download PDF
28. Near Linear Time Construction of an Approximate Index for All Maximum Consecutive Sub-sums of a Sequence
- Author
-
Cicalese, Ferdinando, Laber, Eduardo, Weimann, Oren, Yuster, Raphael, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Kärkkäinen, Juha, editor, and Stoye, Jens, editor
- Published
- 2012
- Full Text
- View/download PDF
29. Finite Automata for Generalized Approach to Backward Pattern Matching
- Author
-
Antoš, Jan, Melichar, Bořivoj, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Nierstrasz, Oscar, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Sudan, Madhu, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Vardi, Moshe Y., Series editor, Weikum, Gerhard, Series editor, Domaratzki, Michael, editor, and Salomaa, Kai, editor
- Published
- 2011
- Full Text
- View/download PDF
30. Efficient and secure outsourced approximate pattern matching protocol.
- Author
-
Xiaochao Wei, Minghao Zhao, and Qiuliang Xu
- Subjects
- *
PATTERN matching , *CLOUD computing , *SIMULATION methods & models , *CLIENT/SERVER computing , *APPROXIMATION theory - Abstract
Pattern matching is a basic algorithmic problem that identifies the appearance as well as the location of a pattern in a specific text, and one of the most important variants of that, approximate pattern matching, can be used to discern a substring in the text that is similar to the pattern, as long as their differences stay within a certain threshold. It serves as a basic component in many real-world applications, such as facial recognition, DNA matching and music retrieval. Motivated by the newly emerging secure outsourced computing, in this paper we proposed protocols to realize these functionalities in a privacy-preserving manner. Specifically, we constructed exact and approximate matching protocols, and both of them ensure that the party holds the text (with length of n) learns noting about the pattern (with length of m). We composed a novel idea to combine secret sharing scheme with oblivious transfer (OT), such as to transform the secure pattern matching problem into reconstructing of a shared secret, which means that if a shared secret can be correctly reconstructed, it indicates the pattern indeed exists in the text. Our protocol for approximate pattern matching is generated in the cloud-assisted setting, where the reconstruction phase is outsourced to an honest-but-curious cloud server. Using oblivious transfer extension technique, a powerful method to use few integrated OTs to implement large-scale single OTs, our protocol is efficiently constructed. Both of the protocols are secure in semi-honest model, and we present a detailed secure simulation-based proof in this paper. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
31. An efficient approach for faster matching of approximate patterns in graphs.
- Author
-
Khan, Muhammad Ghufran, Halim, Zahid, and Baig, Abdul Rauf
- Subjects
- *
PATTERN matching , *DATA management , *KNOWLEDGE representation (Information theory) - Abstract
Graphs have proven to be an efficient problem representation scheme in many real-world applications and can serve to address mining of patterns in large volumes of data. This work addresses the data management issue in graph databases for shortest path queries, verification of reachability, and pattern matching queries. The key issue is to match all possible patterns in a large data graph that matches a user-provided pattern, known as the query graph. The large size of the graph makes the problem complex when searching for all patterns that are similar to the user's query. For a given query, a match occurs when a graph with n vertices has the same label as the corresponding vertices in the query graph. This work addresses such pattern matching query problems on large graphs and presents an efficient algorithm to mine all patterns that satisfies the given query. The current work presents a novel Fast Graph Pattern (FGP) mining algorithm that performs a label-specific walk on a graph and stores all vertices that are visited during the walk and hashes them on a labeled root. Experiments are performed using five benchmark datasets, and the proposed approach is compared with three state-of-the-art methods. The obtained results suggest better performance of the current proposal by achieving five times quicker results. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
32. An Interval Temporal Logic-Based Matching Framework for Finding Occurrences of Multi-event Attack Signatures
- Author
-
Nowicka, Elzbieta, Zawada, Marcin, Gorodetsky, Vladimir, editor, Kotenko, Igor, editor, and Skormin, Victor A., editor
- Published
- 2007
- Full Text
- View/download PDF
33. Cubyhum: Algorithms for Query by Humming
- Author
-
Pauws, Steffen, Toolenaar, Frank, editor, Verhaegh, Wim F. J., editor, Aarts, Emile, editor, and Korst, Jan, editor
- Published
- 2004
- Full Text
- View/download PDF
34. Approximate Swapped Matching
- Author
-
Amir, Amihood, Lewenstein, Moshe, Porat, Ely, Goos, Gerhard, editor, Hartmanis, Juris, editor, van Leeuwen, Jan, editor, Kapoor, Sanjiv, editor, and Prasad, Sanjiva, editor
- Published
- 2000
- Full Text
- View/download PDF
35. Efficient special cases of pattern matching with swaps
- Author
-
Amir, Amihood, Landau, Gad M., Lewenstein, Moshe, Lewensteint, Noa, Goos, Gerhard, editor, Hartmanis, Juris, editor, van Leeuwen, Jan, editor, and Farach-Colton, Martin, editor
- Published
- 1998
- Full Text
- View/download PDF
36. Polynomial-time algorithms for computing characteristic strings
- Author
-
Ito, Minoru, Shimizu, Kuniyasu, Nakanishi, Michio, Hashimoto, Akihiro, Goos, Gerhard, editor, Hartmanis, Juris, editor, Crochemore, Maxime, editor, and Gusfield, Dan, editor
- Published
- 1994
- Full Text
- View/download PDF
37. Searching and Indexing Genomic Databases via Kernelization
- Author
-
Travis eGagie and Simon ePuglisi
- Subjects
Data Compression ,Kernelization ,String algorithms ,Indexes ,Approximate pattern matching ,genomic databases ,Biotechnology ,TP248.13-248.65 - Abstract
The rapid advance of DNA sequencing technologies has yielded databases of thousands of genomes. To search and index these databases effectively, it is important that we take advantage of the similarity between those genomes. Several authors have recently suggested searching or indexing only one reference genome and the parts of the other genomes where they differ. In this paper we survey the twenty-year history of this idea and discuss its relation to kernelization in parameterized complexity.
- Published
- 2015
- Full Text
- View/download PDF
38. Approximate pattern matching with gap constraints.
- Author
-
Wu, Youxi, Tang, Zhiqiang, Jiang, He, and Wu, Xindong
- Subjects
- *
PATTERN matching , *APPROXIMATION theory , *CONSTRAINT satisfaction , *ALGORITHMS , *BIOLOGICAL databases - Abstract
Pattern matching is a key issue in sequential pattern mining. Many researchers now focus on pattern matching with gap constraints. However, most of these studies involve exact pattern matching problems, a special case of approximate pattern matching and a more challenging task. In this study, we introduce an approximate pattern matching problem with Hamming distance. Its objective is to compute the number of approximate occurrences of pattern P with gap constraints in sequence S under similarity constraint d. We propose an efficient algorithm named Single-rOot Nettree for approximate pattern matchinG with gap constraints (SONG) based on a new non-linear data structure Single-root Nettree to effectively solve the problem. Theoretical analysis and experiments demonstrate an interesting law that the ratio M(P,S,d)/N(P,S,m) approximately follows a binomial distribution, where M(P,S,d) and N(P,S,m) are the numbers of the approximate occurrences whose distances to pattern P are d (0≤d≤m) and no more than m (the length of pattern P), respectively. Experimental results for real biological data validate the efficiency and effectiveness of SONG. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
39. Order-preserving pattern matching with k mismatches.
- Author
-
Gawrychowski, Paweł and Uznański, Przemysław
- Subjects
- *
PATTERN matching , *MATCHING theory , *ALGORITHMS , *APPROXIMATION theory , *MATHEMATICAL analysis - Abstract
We consider a generalization of the recently introduced order-preserving pattern matching. The difference between the standard pattern matching and the order-preserving variant is that instead of looking for an exact copy of the pattern, we only require that the relative order between the elements is the same. In our generalization, we additionally allow up to k mismatches between the pattern of length m and the text of length n , and the goal is to construct an efficient algorithm for small values of k . Our solution detects an order-preserving occurrence with up to k mismatches in O ( n ( log log m + k log log k ) ) time. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
40. Approximate string matching using a bidirectional index.
- Author
-
Kucherov, Gregory, Salikhov, Kamil, and Tsur, Dekel
- Subjects
- *
BIDIRECTIONAL associative memories (Computer science) , *COMPUTERS , *DATABASE searching , *PATTERN matching , *PROBABILISTIC databases - Abstract
We study strategies of approximate pattern matching that exploit bidirectional text indexes, extending and generalizing ideas of [9] . We introduce a formalism, called search schemes, to specify search strategies of this type, then develop a probabilistic measure for the efficiency of a search scheme, prove several combinatorial results on efficient search schemes, and finally, provide experimental computations supporting the superiority of our strategies. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
41. Bit-parallel approximate pattern matching: Kepler GPU versus Xeon Phi.
- Author
-
Tran, Tuan Tu, Liu, Yongchao, and Schmidt, Bertil
- Subjects
- *
PARALLEL computers , *APPROXIMATION theory , *GRAPHICS processing units , *DETERMINISTIC finite automata , *ERROR analysis in mathematics - Abstract
Approximate pattern matching (APM) targets to find the occurrences of a pattern inside a subject text allowing a limited number of errors. It has been widely used in many application areas such as bioinformatics and information retrieval. Bit-parallel APM takes advantage of the intrinsic parallelism of bitwise operations inside a machine word. This approach typically encodes non-deterministic finite automaton (NFA) states or value differences between adjacent cells of a dynamic programming matrix in the form of bit arrays. Wu–Manber (WM) is a well-known bit-parallel APM algorithm, which simulates an NFA and gains parallel efficiency by performing multiple state updates within a machine word. An important parameter is the machine word size (e.g. 32 or 64 bits for CPUs). Due to increasing vector capabilities, efficient mapping of bit-parallel APM algorithms onto modern high performance computing architectures is an interesting research topic. Prominent examples are Xeon Phi coprocessors and CUDA-enabled GPUs, which provide words of size 512 bits (by means of vector registers) and 1024 bits (by means of warps), respectively. In this paper, we investigate mappings of the WM algorithm onto these two accelerator types. Both architectures are able to achieve around two orders-of-magnitude speedups compared to a single-threaded CPU implementation. Moreover, our tile-based implementation on a GeForce Titan graphics card runs up to 2.9 × faster than our implementation on an Intel Xeon Phi 5110P. Source code is available at http://xbitpar.sourceforge.net . [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
42. Circular pattern matching with k mismatches
- Author
-
Charalampopoulos, P. (Panagiotis), Kociumaka, T. (Tomasz), Pissis, S. (Solon), Radoszewski, J. (Jakub), Rytter, W. (Wojciech), Straszyński, J. (Juliusz), Waleń, T. (Tomasz), Zuba, W.P. (Wiktor), Charalampopoulos, P. (Panagiotis), Kociumaka, T. (Tomasz), Pissis, S. (Solon), Radoszewski, J. (Jakub), Rytter, W. (Wojciech), Straszyński, J. (Juliusz), Waleń, T. (Tomasz), and Zuba, W.P. (Wiktor)
- Abstract
We consider the circular pattern matching with k mismatches (k-CPM) problem in which one is to compute the minimal Hamming distance of every length-m substring of T and any cyclic rotation of P, if this distance is no more than k. It is a variation of the well-studied k-mismatch problem. A multitude of papers has been devoted to solving the k-CPM problem, but only average-case upper bounds are known. In this paper, we present the first non-trivial worst-case upper bounds for this problem. Specifically, we show an O(nk)-time algorithm and an [Formula presented]-time algorithm. The latter algorithm applies in an extended way a technique that was very recently developed for the k-mismatch problem Bringmann et al. (2019) [10]. A preliminary version of this work appeared at FCT 2019 [35]. In this version we improve the time complexity of the second algorithm from [Formula presented] to [Formula presented].
- Published
- 2021
- Full Text
- View/download PDF
43. Approximate Similarity Search Under Edit Distance Using Locality-Sensitive Hashing
- Author
-
Samuel McCauley, McCauley, Samuel, Samuel McCauley, and McCauley, Samuel
- Abstract
Edit distance similarity search, also called approximate pattern matching, is a fundamental problem with widespread database applications. The goal of the problem is to preprocess n strings of length d, to quickly answer queries q of the form: if there is a database string within edit distance r of q, return a database string within edit distance cr of q. Previous approaches to this problem either rely on very large (superconstant) approximation ratios c, or very small search radii r. Outside of a narrow parameter range, these solutions are not competitive with trivially searching through all n strings. In this work we give a simple and easy-to-implement hash function that can quickly answer queries for a wide range of parameters. Specifically, our strategy can answer queries in time Õ(d3^rn^{1/c}). The best known practical results require c ≫ r to achieve any correctness guarantee; meanwhile, the best known theoretical results are very involved and difficult to implement, and require query time that can be loosely bounded below by 24^r. Our results significantly broaden the range of parameters for which there exist nontrivial theoretical bounds, while retaining the practicality of a locality-sensitive hash function.
- Published
- 2021
- Full Text
- View/download PDF
44. Strict approximate pattern matching with general gaps.
- Author
-
Wu, Youxi, Fu, Shuai, Jiang, He, and Wu, Xindong
- Subjects
PATTERN matching ,ONLINE algorithms ,HAMMING distance ,INFORMATION theory ,COMPUTER science - Abstract
Pattern matching with gap constraints is one of the essential problems in computer science such as music information retrieval and sequential pattern mining. One of the cases is called loose matching, which only considers the matching position of the last pattern substring in the sequence. One more challenging problem is considering the matching positions of each character in the sequence, called strict pattern matching which is one of the essential tasks of sequential pattern mining with gap constraints. Some strict pattern matching algorithms were designed to handle pattern mining tasks, since strict pattern matching can be used to compute the frequency of some patterns occurring in the given sequence and then the frequent patterns can be derived. In this article, we address a more general strict approximate pattern matching with Hamming distance, named SAP (Strict Approximate Pattern matching with general gaps and length constraints), which means that the gap constraints can be negative. We show that a SAP instance can be transformed into an exponential amount of the exact pattern matching with general gaps instances. Hence, we propose an effective online algorithm, named SETA (SubnETtree for sAp), based on the subnettree structure (a Nettree is an extension of a tree with multi-parents and multi-roots) and show the completeness of the algorithm. The space and time complexities of the algorithm are O( m × Maxlen × W × d) and O( Maxlen × W × m × n × d), respectively, where m, Maxlen, W, and d are the length of pattern P, the maximal length constraint, the maximal gap length of pattern P and the approximate threshold. Extensive experimental results validate the correctness and effectiveness of SETA. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
45. Approximating the maximum consecutive subsums of a sequence.
- Author
-
Cicalese, Ferdinando, Laber, Eduardo, Weimann, Oren, and Yuster, Raphael
- Subjects
- *
APPROXIMATION theory , *MATHEMATICAL sequences , *VECTOR analysis , *BINARY number system , *INTEGERS , *DATA transmission systems - Abstract
Abstract: We present a novel approach for computing all maximum consecutive subsums in a sequence of positive integers in near-linear time. Solutions for this problem over binary sequences can be used for reporting existence of Parikh vectors in a bit string. Recently, several attempts have been made to build indexes for all Parikh vectors of a binary string in subquadratic time. However, no algorithm is known to date which can beat by more than a polylogarithmic factor the naive procedure. We show how to construct a -approximate index for all Parikh vectors of a binary string in time, for any constant . Such index is approximate, in the sense that it leaves a small chance for false positives (no false negatives are possible). However, we can tune the parameters of the algorithm so that we can strictly control such a chance of error while still guaranteeing strong subquadratic running time. [Copyright &y& Elsevier]
- Published
- 2014
- Full Text
- View/download PDF
46. Approximating Text-to-Pattern Distance via Dimensionality Reduction
- Author
-
Uznański, Przemysław
- Subjects
FOS: Computer and information sciences ,Combinatorial Algorithms ,𝓁₁ Distance ,Hamming Distance ,Computer Science - Data Structures and Algorithms ,Approximate Pattern Matching ,Data Structures and Algorithms (cs.DS) ,𝓁₂ Distance ,Approximation Algorithms - Abstract
Text-to-pattern distance is a fundamental problem in string matching, where given a pattern of length m and a text of length n, over an integer alphabet, we are asked to compute the distance between pattern and the text at every location. The distance function can be e.g. Hamming distance or 𝓁_p distance for some parameter p > 0. Almost all state-of-the-art exact and approximate algorithms developed in the past ∼ 40 years were using FFT as a black-box. In this work we present 𝒪~(n/ε²) time algorithms for (1±ε)-approximation of 𝓁₂ distances, and 𝒪~(n/ε³) algorithm for approximation of Hamming and 𝓁₁ distances, all without use of FFT. This is independent to the very recent development by Chan et al. [STOC 2020], where 𝒪(n/ε²) algorithm for Hamming distances not using FFT was presented - although their algorithm is much more "combinatorial", our techniques apply to other norms than Hamming., LIPIcs, Vol. 161, 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020), pages 29:1-29:11
- Published
- 2020
47. Compressed text indexing with wildcards.
- Author
-
Hon, Wing-Kai, Ku, Tsung-Han, Shah, Rahul, Thankachan, Sharma V., and Vitter, Jeffrey Scott
- Subjects
SIGNS & symbols ,INDEXING ,ALPHABETS ,MATHEMATICAL sequences ,SINGLE nucleotide polymorphisms ,EMPIRICAL research - Abstract
Abstract: Let be a text of total length n, where characters of each are chosen from an alphabet Σ of size σ, and ϕ denotes a wildcard symbol. The text indexing with wildcards problem is to index T such that when we are given a query pattern P, we can locate the occurrences of P in T efficiently. This problem has been applied in indexing genomic sequences that contain single-nucleotide polymorphisms (SNP) because SNP can be modeled as wildcards. Recently Tam et al. (2009) and Thachuk (2011) have proposed succinct indexes for this problem. In this paper, we present the first compressed index for this problem, which takes only bits of space, where is the hth-order empirical entropy () of T. [Copyright &y& Elsevier]
- Published
- 2013
- Full Text
- View/download PDF
48. Approximate regular expression matching with multi-strings.
- Author
-
Belazzougui, Djamal and Raffinot, Mathieu
- Subjects
APPROXIMATION theory ,MATCHING theory ,PROBLEM solving ,QUERY (Information retrieval system) ,ERROR analysis in mathematics ,MATHEMATICAL bounds - Abstract
Abstract: In this paper, we are interested in solving the approximate regular expression matching problem: we are given a regular expression R in advance and we wish to answer the following query: given a text T and a parameter k, find all the substrings of T which match the regular expression R with at most k errors (an error consist in deleting inserting, or substituting a character). There exists a well known solution for this problem in time where m is the size of the regular expression (the number of operators and characters appearing in R) and n the length of the text. There also exists a solution for the case (exact regular expression matching) which solves the problem in time , where d is the number of strings in the regular expression (a string is a sequence of characters connected with concatenation operator). In this paper, we show that both methods can be combined to solve the approximate regular approximate matching problem in time for arbitrary k. This bound can be much better than the bound achieved by the best actual regular expression matching algorithm in case (that is k is not too large and R contains much less occurrences of ∪ and ⁎ than occurrences of ). [Copyright &y& Elsevier]
- Published
- 2013
- Full Text
- View/download PDF
49. Approximate all-pairs suffix/prefix overlaps
- Author
-
Välimäki, Niko, Ladra, Susana, and Mäkinen, Veli
- Subjects
- *
APPROXIMATION theory , *SUFFIXES & prefixes (Grammar) , *PROBLEM solving , *ENTROPY , *DNA , *ERROR rates - Abstract
Abstract: Finding approximate overlaps is the first phase of many sequence assembly methods. Given a set of strings of total length n and an error-rate ϵ, the goal is to find, for all-pairs of strings, their suffix/prefix matches (overlaps) that are within edit distance , where ℓ is the length of the overlap. We propose a new solution for this problem based on backward backtracking (Lam, et al., 2008) and suffix filters (Kärkkäinen and Na, 2008). Our technique uses bits of space, where is the k-th order entropy and σ the alphabet size. In practice, it is more scalable in terms of space, and comparable in terms of time, than q-gram filters (Rasmussen, et al., 2006). Our method is also easy to parallelize and scales up to millions of DNA reads. [Copyright &y& Elsevier]
- Published
- 2012
- Full Text
- View/download PDF
50. A linear size index for approximate pattern matching.
- Author
-
Chan, Ho-Leung, Lam, Tak-Wah, Sung, Wing-Kin, Tam, Siu-Lung, and Wong, Swee-Seong
- Subjects
APPROXIMATION theory ,PATTERN perception ,MATCHING theory ,INDEXING ,NUCLEOTIDE sequence ,COMPUTATIONAL complexity - Abstract
Abstract: This paper revisits the problem of indexing a text for pattern matching with up to k errors. A naive solution either has a worst-case matching time complexity of or requires space, where m is the length of the pattern. Devising a solution with better performance has been a challenge until Cole et al. (2004) showed an -space index that can support k-error matching in time, where occ is the number of occurrences. Motivated by the indexing of long sequences like DNA, we have investigated the feasibility of devising a linear-size index that still has a time complexity linear in pattern length. This paper in particular presents an -space index that supports k-error matching in worst-case time. This index can be further compressed from words into bits with a slight increase in the time complexity. [Copyright &y& Elsevier]
- Published
- 2011
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.