1. XML stream processing using tree-edit distance embeddings
- Author
-
Garofalakis, Minos and Kumar, Amit
- Subjects
XML ,Algorithm ,Technology application ,Database administration -- Equipment and supplies ,Database administration -- Research ,XML (Document markup language) -- Usage ,XML (Document markup language) -- Research ,Algorithms -- Technology application - Abstract
We propose the first known solution to the problem of correlating, in small space, continuous streams of XML data through approximate (structure and content) matching, as defined by a general tree-edit distance metric. The key element of our solution is a novel algorithm for obliviously embedding tree-edit distance metrics into an L1 vector space while guaranteeing a (worst-case) upper bound of O([log.sup.2] n log* n) on the distance distortion between any data trees with at most n nodes. We demonstrate how our embedding algorithm can be applied in conjunction with known random sketching techniques to (1) build a compact synopsis of a massive, streaming XML data tree that can be used as a concise surrogate for the full tree in approximate tree-edit distance computations; and (2) approximate the result of tree-edit-distance similarity joins over continuous XML document streams. Experimental results from an empirical study with both synthetic and real-life XML data trees validate our approach, demonstrating that the average-case behavior of our embedding techniques is much better than what would be predicted from our theoretical worst-case distortion bounds. To the best of our knowledge, these are the first algorithmic results on low-distortion embeddings for tree-edit distance metrics, and on correlating (e.g., through similarity joins) XML data in the streaming model. Categories and Subject Descriptors: H.2.4 [Database Management]: Systems--Query processing; G.2.1 [Discrete Mathematics]: Combinatorics--Combinatorial algorithms General Terms: Algorithms, Performance, Theory Additional Key Words and Phrases: XML, data streams, data synopses, approximate query processing, tree-edit distance, metric-space embeddings
- Published
- 2005