Back to Search Start Over

Building a novel physical design of a distributed big data warehouse over a Hadoop cluster to enhance OLAP cube query performance.

Authors :
Ramdane, Yassine
Boussaid, Omar
Boukraà, Doulkifli
Kabachi, Nadia
Bentayeb, Fadila
Source :
Parallel Computing. Jul2022, Vol. 111, pN.PAG-N.PAG. 1p.
Publication Year :
2022

Abstract

Improving OLAP (Online Analytical Processing) query performance in a distributed system on top of Hadoop is a challenging task. An OLAP Cube query comprises several relational operations, such as selection, join, and group-by aggregation. It is well-known that star join and group-by aggregation are the most costly operations in a Hadoop database system. These operations indeed increase network traffic and may overflow memory; to overcome these difficulties, numerous partitioning and data load balancing techniques have been proposed in the literature. However, some issues remain questionable, such as decreasing the Spark stages and the network I/O for an OLAP query being executed on a distributed system. In a precedent work, we proposed a novel data placement strategy for a big data warehouse over a Hadoop cluster. This data warehouse schema enhances the projection, selection, and star-join operations of an OLAP query, such that the system's query-optimizer can perform a star join process locally, in only one spark stage without a shuffle phase. Also, the system can skip loading unnecessary data blocks when executing the predicates. In this paper, we extend our previous work with further technical details and experiments, and we propose a new dynamic approach to improve the group-by aggregation. To evaluate our approach, we conduct some experiments on a cluster with 15 nodes. Experimental results show that our method outperforms existing approaches in terms of OLAP query evaluation time. • The Group-by aggregation may involve high communication cost during the shuffle phase. • We propose a dynamic technique for the Partitioning and Load Balancing (PLB) of data. • Our approach Enhances OLAP query execution time over Hadoop Clusters compared to existing approaches. • Our approach combines data and workload-driven models. • The star join operation is performed in only one Spark stage, without a shuffle phase. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
01678191
Volume :
111
Database :
Academic Search Index
Journal :
Parallel Computing
Publication Type :
Academic Journal
Accession number :
156765003
Full Text :
https://doi.org/10.1016/j.parco.2022.102918