Back to Search
Start Over
Building a novel physical design of a distributed big data warehouse over a Hadoop cluster to enhance OLAP cube query performance.
- Source :
-
Parallel Computing . Jul2022, Vol. 111, pN.PAG-N.PAG. 1p. - Publication Year :
- 2022
-
Abstract
- Improving OLAP (Online Analytical Processing) query performance in a distributed system on top of Hadoop is a challenging task. An OLAP Cube query comprises several relational operations, such as selection, join, and group-by aggregation. It is well-known that star join and group-by aggregation are the most costly operations in a Hadoop database system. These operations indeed increase network traffic and may overflow memory; to overcome these difficulties, numerous partitioning and data load balancing techniques have been proposed in the literature. However, some issues remain questionable, such as decreasing the Spark stages and the network I/O for an OLAP query being executed on a distributed system. In a precedent work, we proposed a novel data placement strategy for a big data warehouse over a Hadoop cluster. This data warehouse schema enhances the projection, selection, and star-join operations of an OLAP query, such that the system's query-optimizer can perform a star join process locally, in only one spark stage without a shuffle phase. Also, the system can skip loading unnecessary data blocks when executing the predicates. In this paper, we extend our previous work with further technical details and experiments, and we propose a new dynamic approach to improve the group-by aggregation. To evaluate our approach, we conduct some experiments on a cluster with 15 nodes. Experimental results show that our method outperforms existing approaches in terms of OLAP query evaluation time. • The Group-by aggregation may involve high communication cost during the shuffle phase. • We propose a dynamic technique for the Partitioning and Load Balancing (PLB) of data. • Our approach Enhances OLAP query execution time over Hadoop Clusters compared to existing approaches. • Our approach combines data and workload-driven models. • The star join operation is performed in only one Spark stage, without a shuffle phase. [ABSTRACT FROM AUTHOR]
Details
- Language :
- English
- ISSN :
- 01678191
- Volume :
- 111
- Database :
- Academic Search Index
- Journal :
- Parallel Computing
- Publication Type :
- Academic Journal
- Accession number :
- 156765003
- Full Text :
- https://doi.org/10.1016/j.parco.2022.102918