Start Over

Meta-dataflows: efficient exploratory dataflow jobs

Authors :: Castro Fernandez, R
Culhane, W
Watcharapichat, P
Weidlich, M
Pietzuch, PR
BP International Limited (0946)
Source :: ACM Conference on Management of Data (SIGMOD)
Publication Year :: 2018
Publisher :: Association for Computing Machinery (ACM), 2018.
Abstract: Distributed dataflow systems such as Apache Spark and Apache Flink are used to derive new insights from large datasets. While they efficiently execute concrete data processing workflows, expressed as dataflow graphs, they lack generic support for exploratory work- flows : if a user is uncertain about the correct processing pipeline, e.g. in terms of data cleaning strategy or choice of model parame- ters, they must repeatedly submit modified jobs to the system. This, however, misses out on optimisation opportunities for exploratory workflows, both in terms of scheduling and memory allocation. We describe meta-dataflows (MDFs), a new model to effectively express exploratory workflows and efficiently execute them on compute clusters. With MDFs, users specify a family of dataflows using two primitives: (a) an explore operator automatically con- siders choices in a dataflow; and (b) a choose operator assesses the result quality of explored dataflow branches and selects a subset of the results. We propose optimisations to execute MDFs: a system can (i) avoid redundant computation when exploring branches by reusing intermediate results and discarding results from underper- forming branches; and (ii) consider future data access patterns in the MDF when allocating cluster memory. Our evaluation shows that MDFs improve the runtime of exploratory workflows by up to 90% compared to sequential execution.