Back to Search
Start Over
Compressed Representations of Sequences and Full-Text Indexes.
- Source :
- ACM Transactions on Algorithms; 2007, Vol. 3 Issue 2, p1-24, 24p, 1 Diagram, 5 Charts
- Publication Year :
- 2007
-
Abstract
- Given a sequence S = s<subscript>1</subscript>s<subscript>2</subscript> … s<subscript>n</subscript> of integers smaller than r = O(polylog(n)), we show how S can be represented using nH<subscript>0</subscript>(S) + o(n) bits, so that we can know any s<subscript>q</subscript>, as well as answer rank and select queries on S, in constant time. H0(S) is the zero-order empirical entropy of S and nH<subscript>0</subscript>(S) provides an information-theoretic lower bound to the bit storage of any sequence S via a fixed encoding of its symbols. This extends previous results on binary sequences, and improves previous results on general sequences where those queries are answered in O(log r ) time. For larger r, we can still represent S in nH<subscript>0</subscript>(S) + o(n log r ) bits and answer queries in O(log r/ log log n) time. Another contribution of this article is to show how to combine our compressed representation of integer sequences with a compression boosting technique to design compressed full-text indexes that scale well with the size of the input alphabet ∑. Specifically, we design a variant of the FM-index that indexes a string T [1, n] within nH<subscript>k</subscript> (T )+o(n) bits of storage, where H<subscript>k</subscript> (T ) is the kth-order empirical entropy of T . This space bound holds simultaneously for all k ≤ α log<subscript>∣∑∣</subscript>n, constant 0 < α < 1, and <subscript>∣∑∣</subscript> = O(polylog(n)). This index counts the occurrences of an arbitrary pattern P[1, p] as a substring of T in O(p) time; it locates each pattern occurrence in O(log<superscript>1+ε</superscript> n) time for any constant 0 < ε < 1; and reports a text substring of length ℓ in O(ℓ + log<superscript>1+ε</superscript> n) time. Compared to all previous works, our index is the first that removes the alphabet-size dependance from all query times, in particular, counting time is linear in the pattern length. Still, our index uses essentially the same space of the kth-order entropy of the text T , which is the best space obtained in previous work. We can also handle larger alphabets of size ∣∑∣ = O(n<superscript>β</superscript>), for any 0 < β < 1, by paying o(n log ∣∑∣) extra space and multiplying all query times by O(log ∣∑∣/ log log n). [ABSTRACT FROM AUTHOR]
Details
- Language :
- English
- ISSN :
- 15496325
- Volume :
- 3
- Issue :
- 2
- Database :
- Complementary Index
- Journal :
- ACM Transactions on Algorithms
- Publication Type :
- Academic Journal
- Accession number :
- 25299915
- Full Text :
- https://doi.org/10.1145/1240233.1240243