1. Elastic-Degenerate String Matching with 1 Error
- Author
-
Bernardini, Giulia, Gabory, Esteban, Pissis, Solon P., Stougie, Leen, Sweering, Michelle, Zuba, Wiktor, Castañeda, Armando, Rodríguez-Henríquez, Francisco, Università degli studi di Trieste = University of Trieste, Centrum Wiskunde & Informatica (CWI), Vrije Universiteit Amsterdam [Amsterdam] (VU), Equipe de recherche européenne en algorithmique et biologie formelle et expérimentale (ERABLE), Laboratoire de Biométrie et Biologie Evolutive - UMR 5558 (LBBE), Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-VetAgro Sup - Institut national d'enseignement supérieur et de recherche en alimentation, santé animale, sciences agronomiques et de l'environnement (VAS)-Centre National de la Recherche Scientifique (CNRS)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-VetAgro Sup - Institut national d'enseignement supérieur et de recherche en alimentation, santé animale, sciences agronomiques et de l'environnement (VAS)-Centre National de la Recherche Scientifique (CNRS)-Inria Lyon, Institut National de Recherche en Informatique et en Automatique (Inria), Netherlands Organisation for Scientific Research (NWO), project OCENW.GROOT.2019.015 and 024.002.003, MUR - FSE REACT EU - PON R&I 2014-2020, European Project: 872539,H2020-EU.1.3. - EXCELLENT SCIENCE - Marie Skłodowska-Curie Actions, H2020-EU.1.3.3. - Stimulating innovation by means of cross-fertilisation of knowledge,H2020-MSCA-RISE-2019,PANGAIA(2020), European Project: 956229,H2020-EU.1.3. - EXCELLENT SCIENCE - Marie Skłodowska-Curie Actions,ALPACA(2021), Bioinformatics, AIMMS, Bio Informatics (IBIVU), Operations Analytics, Tinbergen Institute, Amsterdam Business Research Institute, Castañeda, Armando, Rodríguez-Henríquez, Francisco, Bernardini, Giulia, Gabory, Esteban, Pissis, Solon P., Stougie, Leen, Sweering, Michelle, and Zuba, Wiktor
- Subjects
FOS: Computer and information sciences ,String algorithms ,Elastic-degenerate strings ,SDG 16 - Peace ,Approximate string matching ,Degenerate strings ,Edit distance ,SDG 16 - Peace, Justice and Strong Institutions ,Justice and Strong Institutions ,String algorithm ,Degenerate string ,Computer Science - Data Structures and Algorithms ,Data Structures and Algorithms (cs.DS) ,[INFO]Computer Science [cs] - Abstract
An elastic-degenerate string is a sequence of $n$ finite sets of strings of total length $N$, introduced to represent a set of related DNA sequences, also known as a pangenome. The ED string matching (EDSM) problem consists in reporting all occurrences of a pattern of length $m$ in an ED text. This problem has recently received some attention by the combinatorial pattern matching community, culminating in an $\tilde{\mathcal{O}}(nm^{\omega-1})+\mathcal{O}(N)$-time algorithm [Bernardini et al., SIAM J. Comput. 2022], where $\omega$ denotes the matrix multiplication exponent and the $\tilde{\mathcal{O}}(\cdot)$ notation suppresses polylog factors. In the $k$-EDSM problem, the approximate version of EDSM, we are asked to report all pattern occurrences with at most $k$ errors. $k$-EDSM can be solved in $\mathcal{O}(k^2mG+kN)$ time, under edit distance, or $\mathcal{O}(kmG+kN)$ time, under Hamming distance, where $G$ denotes the total number of strings in the ED text [Bernardini et al., Theor. Comput. Sci. 2020]. Unfortunately, $G$ is only bounded by $N$, and so even for $k=1$, the existing algorithms run in $\Omega(mN)$ time in the worst case. In this paper we show that $1$-EDSM can be solved in $\mathcal{O}((nm^2 + N)\log m)$ or $\mathcal{O}(nm^3 + N)$ time under edit distance. For the decision version, we present a faster $\mathcal{O}(nm^2\sqrt{\log m} + N\log\log m)$-time algorithm. We also show that $1$-EDSM can be solved in $\mathcal{O}(nm^2 + N\log m)$ time under Hamming distance. Our algorithms for edit distance rely on non-trivial reductions from $1$-EDSM to special instances of classic computational geometry problems (2d rectangle stabbing or 2d range emptiness), which we show how to solve efficiently. In order to obtain an even faster algorithm for Hamming distance, we rely on employing and adapting the $k$-errata trees for indexing with errors [Cole et al., STOC 2004]., Comment: This is an extended version of a paper accepted at LATIN 2022
- Published
- 2022
- Full Text
- View/download PDF