1. Sampling-based Techniques for Interactive Exploration of Large Datasets
- Author
-
Kamat, Niranjan Ganesh
- Subjects
- Computer Engineering, Computer Science, Database, Sampling, Interactive, Data Cube, Visualization, Online, Cyclic Scan, DICE, Sesame, Join Sampling, FluxQuery, Error, Variance, Reuse, Result Reuse, Speculation, Query Execution, Shard, Scale up, Scale out, Query Rewriting
- Abstract
Extracting useful information from large datasets has become increasingly important. While our computational capabilities have improved greatly in the last few decades, data explosion has resulted in the growth rate of data vastly out-pacing that of computational power. In such scenarios, obtaining results within interactive~(few seconds at the most) response times is extremely difficult. Exploring data interactively has numerous advantages such as shortening the feedback loop, providing the ability to perform numerous experiments, and giving a smoother user experience. As a result, providing results using a sample of the data to improve query response time has become hugely important. In this work, I will present our efforts in the direction of interactive exploration of large-scale datasets within interactive response times with projects such as \emph{DICE}, \emph{Sesame}, \emph{FluxQuery}, and a unified join sampling approach.We use approaches such as speculative execution, data sampling, faceted exploration, and scan sharing towards this end.In an OLAP scenario, we note that queries occur not in isolation but as part of a larger query session, where queries might be similar to the previously executed queries. \emph{DICE} uses this session-oriented behavior of a user to speculatively execute and cache the likely follow-up queries, so that the user query can be answered quickly from the cache. \emph{DICE} scales up to a billion tuples within sub-second response time using 50 nodes.Further, \emph{DICE} has a novel, intuitive, synergistic interface designed to aid the speculation and faceted exploration used by the backend.In the context of sampled aggregations, the results provided to the user have lesser significance if not provided with the corresponding error bars as well, which depend on the variance of the measure. Variance can be expensive to compute.In \emph{Sesame}, we investigate different techniques for reusing variance computations in order to speed-up error delivery. We use the entire user query session to build an optimal set of speculative queries to run, giving results alongside errors nearly $25 \times$ faster. Supporting sampling in the presence of joins is an important problem in data analysis, but is inherently challenging due to the need to avoid correlation between output tuples.To sample joins, we present a unified approach through two key contributions. First, in the case where a \emph{correlated} sample is \emph{acceptable}, we provide techniques, for all join types, to sample base relations so that their join is \emph{as random as possible}.Second, in the case where a correlated sample is \emph{not acceptable}, we provide enhancements to the state-of-the-art algorithms to reduce their execution time and intermediate data size.Modern computing devices and user interfaces have necessitated highly interactive querying. Some of these interfaces issue a large number of dynamically changing queries to the backend.In \emph{FluxQuery}, we propose a novel model to interpret the variability of likely queries in a workload. We implement a cyclic scan-based approach to process queries from such workloads in an efficient and practical manner, while reducing the overall system load.
- Published
- 2018