Start Over

Compressed Representations of Sequences and Full-Text Indexes.

Authors :: Ferragina, Paolo
Manzini, Giovanni
Mäkinen, Veli
Navarro, Gonzalo
Source :: ACM Transactions on Algorithms; 2007, Vol. 3 Issue 2, p1-24, 24p, 1 Diagram, 5 Charts
Publication Year :: 2007
Abstract: Given a sequence S = s<subscript>1</subscript>s<subscript>2</subscript> … s<subscript>n</subscript> of integers smaller than r = O(polylog(n)), we show how S can be represented using nH<subscript>0</subscript>(S) + o(n) bits, so that we can know any s<subscript>q</subscript>, as well as answer rank and select queries on S, in constant time. H0(S) is the zero-order empirical entropy of S and nH<subscript>0</subscript>(S) provides an information-theoretic lower bound to the bit storage of any sequence S via a fixed encoding of its symbols. This extends previous results on binary sequences, and improves previous results on general sequences where those queries are answered in O(log r ) time. For larger r, we can still represent S in nH<subscript>0</subscript>(S) + o(n log r ) bits and answer queries in O(log r/ log log n) time. Another contribution of this article is to show how to combine our compressed representation of integer sequences with a compression boosting technique to design compressed full-text indexes that scale well with the size of the input alphabet ∑. Specifically, we design a variant of the FM-index that indexes a string T [1, n] within nH<subscript>k</subscript> (T )+o(n) bits of storage, where H<subscript>k</subscript> (T ) is the kth-order empirical entropy of T . This space bound holds simultaneously for all k ≤ α log<subscript>∣∑∣</subscript>n, constant 0 < α < 1, and <subscript>∣∑∣</subscript> = O(polylog(n)). This index counts the occurrences of an arbitrary pattern P[1, p] as a substring of T in O(p) time; it locates each pattern occurrence in O(log<superscript>1+ε</superscript> n) time for any constant 0 < ε < 1; and reports a text substring of length ℓ in O(ℓ + log<superscript>1+ε</superscript> n) time. Compared to all previous works, our index is the first that removes the alphabet-size dependance from all query times, in particular, counting time is linear in the pattern length. Still, our index uses essentially the same space of the kth-order entropy of the text T , which is the best space obtained in previous work. We can also handle larger alphabets of size ∣∑∣ = O(n<superscript>β</superscript>), for any 0 < β < 1, by paying o(n log ∣∑∣) extra space and multiplying all query times by O(log ∣∑∣/ log log n). [ABSTRACT FROM AUTHOR]