1. A Theoretical Model for Estimating Entity Resolution Costs in Cloud Computing Environments
- Author
-
Cassimiro Nascimento , Dimas, Santos Pires , Carlos, Brasileiro Araújo , Tiago, Universidade Federal Rural de Pernambuco, Universidade Federal de Campina Grande [Campina Grande] ( UFCG ), Universidade Federal de Campina Grande [Campina Grande] (UFCG), and Nascimento, Dimas
- Subjects
[ INFO ] Computer Science [cs] ,ACM: I.: Computing Methodologies/I.6: SIMULATION AND MODELING ,Entity resolution ,Theoretical Costs ,ACM : I.: Computing Methodologies/I.6: SIMULATION AND MODELING ,[INFO]Computer Science [cs] ,[INFO] Computer Science [cs] ,Cloud Computing ,Data Quality - Abstract
International audience; Entity resolution is the task of identifying duplicate entities in a dataset or multiple datasets. In the era of Big Data, this task has gained notorious attention due to the intrinsic quadratic complexity of the problem in relation to the size of the dataset. In practice, this task can be outsourced to a cloud service, and thus, a service customer may be interested in estimating the costs of an entity resolution solution before executing it. Since the execution time of an entity resolution solution depends on a combination of various algorithms, their respective parameter values and the employed cloud infrastructure, in practice it is hard to perform an a priori estimation of infrastructure costs for executing an entity resolution task. Besides estimating customer costs, the estimation of entity resolution costs is also important to evaluate if a set of ER parameter values can be employed to execute a task that meets predefined time and budget restrictions. Aiming to tackle these challenges, we formalize the problem of estimating ER costs taking into account the main parameters that may influence the execution time of the ER task. We also propose an algorithm, denominated T BF , for evaluating the feasibility of ER parameter values, given a set of predefined customer restrictions. Since the efficacy of the proposed algorithm is strongly tied to the accuracy provided by the theoretical estimations of the ER costs, we also present a number of guidelines that can be further explored to improve even more the efficacy of the proposed model.
- Published
- 2018