1. Reproducible experiments for generating pre-processing pipelines for AutoETL
- Author
-
Universitat Politècnica de Catalunya. Departament d'Enginyeria de Serveis i Sistemes d'Informació, Universitat Politècnica de Catalunya. inSSIDE - integrated Software, Services, Information and Data Engineering, Giovanelli, Joseph, Bilalli, Besim, Abelló Gamazo, Alberto, Silva Coira, Fernando, de Bernardo Roca, Guillermo, Universitat Politècnica de Catalunya. Departament d'Enginyeria de Serveis i Sistemes d'Informació, Universitat Politècnica de Catalunya. inSSIDE - integrated Software, Services, Information and Data Engineering, Giovanelli, Joseph, Bilalli, Besim, Abelló Gamazo, Alberto, Silva Coira, Fernando, and de Bernardo Roca, Guillermo
- Abstract
This work is a companion reproducibility paper of the experiments and results reported in Giovanelli et al. (2022), where data pre-processing pipelines are evaluated in order to find pipeline prototypes that reduce the classification error of supervised learning algorithms. With the recent shift towards data-centric approaches, where instead of the model, the dataset is systematically changed for better model performance, data pre-processing is receiving a lot of attention. Yet, its impact over the final analysis is not widely recognized, primarily due to the lack of publicly available experiments that quantify it. To bridge this gap, this work introduces a set of reproducible experiments on the impact of data pre-processing by providing a detailed reproducibility protocol together with a software tool and a set of extensible datasets, which allow for all the experiments and results of our aforementioned work to be reproduced. We introduce a set of strongly reproducible experiments based on a collection of intermediate results, and a set of weakly reproducible experiments (Lastra-Diaz, 0000) that allows reproducing our end-to-end optimization process and evaluation of all the methods reported in our primary paper. The reproducibility protocol is created in Docker and tested in Windows and Linux. In brief, our primary work (i) develops a method for generating effective prototypes, as templates or logical sequences of pre-processing transformations, and (ii) instantiates the prototypes into pipelines, in the form of executable or physical sequences of actual operators that implement the respective transformations. For the first, a set of heuristic rules learned from extensive experiments are used, and for the second techniques from Automated Machine Learning (AutoML) are applied., This work is partially supported by the EU’s Horizon Programme call, under Grant Agreements No. 101093164 (ExtremeXP) and the DOGO4ML project, funded by the Spanish Ministerio de Ciencia e Innovación under the project/funding scheme PID2020-117191RB-I00 / AEI / 10.13039/501100011033., Peer Reviewed, Postprint (author's final draft)
- Published
- 2024