Back to Search
Start Over
Constructing Antidictionaries in Output-Sensitive Space
- Source :
- DCC, Theory of Computing Systems, Theory of Computing Systems, Springer Verlag, 2021, 65 (5), pp.777-797. ⟨10.1007/s00224-020-10018-5⟩, Ayad, L A K, Badkobeh, G, Fici, G, Heliou, A & Pissis, S P 2019, Constructing Antidictionaries in Output-Sensitive Space . in J A Storer, A Bilgin, J Serra-Sagrista & M W Marcellin (eds), Proceedings-DCC 2019 : 2019 Data Compression Conference . vol. 2019-March, 8712742, pp. 538-547 . https://doi.org/10.1109/DCC.2019.00062
- Publication Year :
- 2019
- Publisher :
- IEEE, 2019.
-
Abstract
- A word $x$ that is absent from a word $y$ is called minimal if all its proper factors occur in $y$. Given a collection of $k$ words $y_1,y_2,\ldots,y_k$ over an alphabet $\Sigma$, we are asked to compute the set $\mathrm{M}^{\ell}_{y_{1}\#\ldots\#y_{k}}$ of minimal absent words of length at most $\ell$ of word $y=y_1\#y_2\#\ldots\#y_k$, $\#\notin\Sigma$. In data compression, this corresponds to computing the antidictionary of $k$ documents. In bioinformatics, it corresponds to computing words that are absent from a genome of $k$ chromosomes. This computation generally requires $\Omega(n)$ space for $n=|y|$ using any of the plenty available $\mathcal{O}(n)$-time algorithms. This is because an $\Omega(n)$-sized text index is constructed over $y$ which can be impractical for large $n$. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when $||\mathrm{M}^{\ell}_{y_{1}\#\ldots\#y_{N}}||=o(n)$, for all $N\in[1,k]$. For instance, in the human genome, $n \approx 3\times 10^9$ but $||\mathrm{M}^{12}_{y_{1}\#\ldots\#y_{k}}|| \approx 10^6$. We consider a constant-sized alphabet for stating our results. We show that all $\mathrm{M}^{\ell}_{y_{1}},\ldots,\mathrm{M}^{\ell}_{y_{1}\#\ldots\#y_{k}}$ can be computed in $\mathcal{O}(kn+\sum^{k}_{N=1}||\mathrm{M}^{\ell}_{y_{1}\#\ldots\#y_{N}}||)$ total time using $\mathcal{O}(\mathrm{MaxIn}+\mathrm{MaxOut})$ space, where $\mathrm{MaxIn}$ is the length of the longest word in $\{y_1,\ldots,y_{k}\}$ and $\mathrm{MaxOut}=\max\{||\mathrm{M}^{\ell}_{y_{1}\#\ldots\#y_{N}}||:N\in[1,k]\}$. Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.<br />Comment: Version accepted to DCC 2019
- Subjects :
- FOS: Computer and information sciences
Settore ING-INF/05 - Sistemi Di Elaborazione Delle Informazioni
Output sensitive algorithms
String algorithms
Physics
Antidictionarie
Settore INF/01 - Informatica
Output sensitive algorithm
0102 computer and information sciences
Absent words
Space (mathematics)
01 natural sciences
Antidictionaries
Combinatorics
010201 computation theory & mathematics
TheoryofComputation_ANALYSISOFALGORITHMSANDPROBLEMCOMPLEXITY
Data compression
Computer Science - Data Structures and Algorithms
Data Structures and Algorithms (cs.DS)
Computer Science::Symbolic Computation
[INFO]Computer Science [cs]
Absent word
Alphabet
Word (group theory)
Subjects
Details
- ISSN :
- 14324350, 14330490, and 23750359
- Database :
- OpenAIRE
- Journal :
- 2019 Data Compression Conference (DCC)
- Accession number :
- edsair.doi.dedup.....a9a9bb8ff81d4da2652483a07f783eda
- Full Text :
- https://doi.org/10.1109/dcc.2019.00062