Back to Search Start Over

On exploring data lakes by finding compact, isolated clusters.

Authors :
Jiménez, Patricia
Roldán, Juan C.
Corchuelo, Rafael
Source :
Information Sciences. Apr2022, Vol. 591, p103-127. 25p.
Publication Year :
2022

Abstract

• Data lakes store unprocessed business data at large scale. • Clustering helps data engineers understand the structure of their data lakes. • RóMULO is a meta-heuristic multi-way clustering proposal to cluster data lakes. • The results confirm that RóMULO is a promising contribution to assist data engineers. Data engineers are very interested in data lake technologies due to the incredible abundance of datasets. They typically use clustering to understand the structure of the datasets before applying other methods to infer knowledge from them. This article presents the first proposal that explores how to use a meta-heuristic to address the problem of multi-way single-subspace automatic clustering, which is very appropriate in the context of data lakes. It was confronted with five strong competitors that combine the state-of-the-art attribute selection proposal with three classical single-way clustering proposals, a recent quantum-inspired one, and a recent deep-learning one. The evaluation focused on exploring their ability to find compact and isolated clusterings as well as the extent to which such clusterings can be considered good classifications. The statistical analyses conducted on the experimental results prove that it ranks the first regarding effectiveness using six standard coefficients and it is very efficient in terms of CPU time, not to mention that it did not result in any degraded clusterings or timeouts. Summing up: this proposal contributes to the array of techniques that data engineers can use to explore their data lakes. [ABSTRACT FROM AUTHOR]

Subjects

Subjects :
*LAKES
*STRUCTURAL engineering

Details

Language :
English
ISSN :
00200255
Volume :
591
Database :
Academic Search Index
Journal :
Information Sciences
Publication Type :
Periodical
Accession number :
155260947
Full Text :
https://doi.org/10.1016/j.ins.2021.12.045