9 results on '"Pires, Carlos A."'
Search Results
2. The EUChinaHealthCloud Project Towards Open Science in the Twenty-first Century
- Author
-
Su, Ying, Pires, Carlos Morais, Li, YunPing, Zhang, Zhenji, editor, Zhang, Runtong, editor, and Zhang, Juliang, editor
- Published
- 2013
- Full Text
- View/download PDF
3. Applying machine learning techniques for scaling out data quality algorithms in cloud computing environments
- Author
-
Nascimento, Dimas Cassimiro, Pires, Carlos Eduardo, and Mestre, Demetrio Gomes
- Published
- 2016
- Full Text
- View/download PDF
4. A Theoretical Model for Estimating Entity Resolution Costs in Cloud Computing Environments
- Author
-
Cassimiro Nascimento , Dimas, Santos Pires , Carlos, Brasileiro Araújo , Tiago, Universidade Federal Rural de Pernambuco, Universidade Federal de Campina Grande [Campina Grande] ( UFCG ), Universidade Federal de Campina Grande [Campina Grande] (UFCG), and Nascimento, Dimas
- Subjects
[ INFO ] Computer Science [cs] ,ACM: I.: Computing Methodologies/I.6: SIMULATION AND MODELING ,Entity resolution ,Theoretical Costs ,ACM : I.: Computing Methodologies/I.6: SIMULATION AND MODELING ,[INFO]Computer Science [cs] ,[INFO] Computer Science [cs] ,Cloud Computing ,Data Quality - Abstract
International audience; Entity resolution is the task of identifying duplicate entities in a dataset or multiple datasets. In the era of Big Data, this task has gained notorious attention due to the intrinsic quadratic complexity of the problem in relation to the size of the dataset. In practice, this task can be outsourced to a cloud service, and thus, a service customer may be interested in estimating the costs of an entity resolution solution before executing it. Since the execution time of an entity resolution solution depends on a combination of various algorithms, their respective parameter values and the employed cloud infrastructure, in practice it is hard to perform an a priori estimation of infrastructure costs for executing an entity resolution task. Besides estimating customer costs, the estimation of entity resolution costs is also important to evaluate if a set of ER parameter values can be employed to execute a task that meets predefined time and budget restrictions. Aiming to tackle these challenges, we formalize the problem of estimating ER costs taking into account the main parameters that may influence the execution time of the ER task. We also propose an algorithm, denominated T BF , for evaluating the feasibility of ER parameter values, given a set of predefined customer restrictions. Since the efficacy of the proposed algorithm is strongly tied to the accuracy provided by the theoretical estimations of the ER costs, we also present a number of guidelines that can be further explored to improve even more the efficacy of the proposed model.
- Published
- 2018
5. Estimating record linkage costs in distributed environments.
- Author
-
Nascimento, Dimas Cassimiro, Pires, Carlos Eduardo Santos, Araujo, Tiago Brasileiro, and Mestre, Demetrio Gomes
- Subjects
- *
CLOUD storage , *COST estimates , *TIME management , *COST accounting , *CUSTOMER services , *ESTIMATES , *BIG data , *PARAMETER estimation - Abstract
Record Linkage (RL) is the task of identifying duplicate entities in a dataset or multiple datasets. In the era of Big Data, this task has gained notorious attention due to the intrinsic quadratic complexity of the problem in relation to the size of the dataset. In practice, this task can be outsourced to a cloud service, and thus, a service customer may be interested in estimating the costs of a record linkage solution before executing it. Since the execution time of a record linkage solution depends on a combination of various algorithms, their respective parameter values and the employed cloud infrastructure, in practice it is hard to perform an a priori estimation of infrastructure costs for executing a record linkage task. Besides estimating customer costs, the estimation of record linkage costs is also important to evaluate whether (or not) the application of a set of RL parameter values will satisfy predefined time and budget restrictions. Aiming to tackle these challenges, we propose a theoretical model for estimating RL costs taking into account the main steps that may influence the execution time of the RL task. We also propose an algorithm, denoted as T B F , for evaluating the feasibility of RL parameter values, given a set of predefined customer restrictions. We evaluate the efficacy of the proposed model combined with regression techniques using record linkage results processed in real distributed environments. Based on the experimental results, we show that the employed regression technique has significant influence over the estimated record linkage costs. Moreover, we conclude that specific regression techniques are more suitable for estimating record linkage costs, depending on the evaluated scenario. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
6. Towards the efficient parallelization of multi-pass adaptive blocking for entity matching.
- Author
-
Mestre, Demetrio Gomes, Pires, Carlos Eduardo Santos, and Nascimento, Dimas Cassimiro
- Subjects
- *
PARALLEL programs (Computer programs) , *BIG data , *CLOUD computing , *WORKFLOW , *INFRASTRUCTURE (Economics) - Abstract
Modern parallel computing programming models, such as MapReduce (MR), have proven to be powerful tools for efficient parallel execution of data-intensive tasks such as Entity Matching (EM) in the era of Big Data. For this reason, studies about challenges and possible solutions of how EM can benefit from this well-known cloud computing programming model have become an important demand nowadays. Furthermore, the effectiveness and scalability of MR-based implementations for EM depend on how well the workload distribution is balanced among all reduce tasks. In this article, we investigate how MapReduce can be used to perform efficient (load balanced) parallel EM using a variation of the multi-pass Sorted Neighborhood Method (SNM) that uses a varying size (adaptive) window. We propose Multi-pass MapReduce Duplicate Count Strategy ( MultiMR-DCS++ ), a MR-based approach for multi-pass adaptive SNM, aiming to increase even more the performance of the SNM. The evaluation results based on real-world datasets and cluster infrastructure show that our approach increases the performance of MapReduce-based SNM regarding the EM execution time and detection quality. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
7. Explanation and answers to critiques on: Blockchain-based Privacy-Preserving Record Linkage.
- Author
-
Nóbrega, Thiago, Pires, Carlos Eduardo S., and Nascimento, Dimas Cassimiro
- Subjects
- *
BLOCKCHAINS , *CLOUD computing , *EXPLANATION , *PYTHON programming language , *INFORMATION sharing - Abstract
The "Blockchain-based Privacy-Preserving Record Linkage—Enhancing Data Privacy in an Untrusted Environment" (BC-PPRL) uses Blockchain technology to provide accountability to the computation performed during the comparison step of PPRL. The BC-PPRL utilizes small fragments (splits) of the encoded records to iterative compute the similarity of the records and classify them into matches and non-matches, without sharing the complete information of the encoded records. Christen et al. propose a novel attack that leverages the exchanged information by the BC-PPRL. In this work, we acknowledge the Christen et al. findings and provide a detailed explanation of how the privacy of BB-PPRL could be compromised. We also make available a simplified version of the BC-PPRL, the datasets, and version (ported to python 3) of the attack that could be executed in the google cloud environment at: https://github.com/thiagonobrega/bcpprl-simplified. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
8. Accuracy of daily estimation of grass reference evapotranspiration using ERA-Interim reanalysis products with assessment of alternative bias correction schemes.
- Author
-
Paredes, Paula, Martins, Diogo S., Pereira, Luis Santos, Cadima, Jorge, and Pires, Carlos
- Subjects
- *
BIAS correction (Topology) , *EVAPOTRANSPIRATION , *CLIMATE change , *CLOUD computing , *REGRESSION analysis , *METEOROLOGICAL stations - Abstract
Highlights • ERA-Iterim reanalysis weather variables were used for computing daily PM-ET o. • The accuracy of weather variables estimation was assessed and temperature was corrected for elevation. • Various bias correction approaches were tested for ET o REAN using cross-validation. • Bias correction used data aggregated quarterly to consider seasonality of climate. • Additive bias correction relative to the nearest grid point was selected because accurate and simple. Abstract This study aims at assessing the accuracy of estimating daily grass reference evapotranspiration (PM-ET o ) computed with ERA-Interim reanalysis products, as well as to assess the quality of reanalysis products as predictors of daily maximum and minimum temperature, net radiation, dew point temperature and wind speed, which are used to compute PM-ET o. With this propose, ET o computed from local observations of weather variables in 24 weather stations distributed across Continental Portugal were compared with reanalysis-based values of ET o (ET o REAN ). Three different versions of these reanalysis-based ET o were computed: (i) an (uncorrected) ET o based on the individual weather variables for the nearest grid point to the weather station; (ii) the previously calculated ET o corrected for bias with a simple bias-correction rule based only on the nearest grid point; and (iii) the ET o corrected for bias with a more complex rule involving all grid points in a 100 km radius of the weather station. Both bias correction approaches were tested aggregating data on a monthly, quarterly and a single overall basis. Cross-validation was used to allow evaluating the uncertainties that are modelled independently of any forcing; with this purpose, data sets were divided into two groups. Results show that ET o REAN without bias correction is strongly correlated with PM-ET o (R2>0.80) but tends to over-estimate PM-ET o , with the slope of the regression forced to the origin b 0 ≥ 1.05, a mean RMSE of 0.79 mm day−1, and with EF generally above 0.70. Cross-validation results showed that using both bias correction methods improved the accuracy of estimations, in particular when a monthly aggregation was used. In addition, results showed that using the multiple regression correction method outperforms the additive bias correction leading to lower RMSE, with mean RMSE of 0.57 and 0.64 mm day−1 respectively. The selection of the bias correction approach to be adopted should balance the ease of use, the quality of results and the ability to capture the intra-annual seasonality of ET o. Thus, for irrigation scheduling operational purposes, we propose the use of the additive bias correction with a quarterly aggregation. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
9. Reduzindo custos da deduplicação de dados utilizando heurísticas e computação em nuvem
- Author
-
NASCIMENTO FILHO, Dimas Cassimiro do., PIRES, Carlos Eduardo Santos., CAMPELO, Cláudio Elizio Calazans., MARINHO, Leandro Balby., GALANTE, Renata de Matos., and MONTEIRO FILHO, José Maria da Silva.
- Subjects
Big Data ,Ciência da Computação ,Ciências ,Heurísticas ,Deduplicação de Dados ,Computação em Nuvem ,Deduplication ,Heuristics ,Cloud Computing ,Data Quality ,Qualidade de Dados - Abstract
Submitted by Lucienne Costa (lucienneferreira@ufcg.edu.br) on 2018-05-02T21:20:23Z No. of bitstreams: 1 DIMAS CASSIMIRO DO NASCIMENTO FILHO – TESE (PPGCC) 2017.pdf: 1879329 bytes, checksum: bda72914ec66d17611d9d0ab5b9ec6d5 (MD5) Made available in DSpace on 2018-05-02T21:20:23Z (GMT). No. of bitstreams: 1 DIMAS CASSIMIRO DO NASCIMENTO FILHO – TESE (PPGCC) 2017.pdf: 1879329 bytes, checksum: bda72914ec66d17611d9d0ab5b9ec6d5 (MD5) Previous issue date: 2017-11-10 Na era de Big Data, na qual a escala dos dados provê inúmeros desafios para algoritmos clássicos, a tarefa de avaliar a qualidade dos dados pode se tornar custosa e apresentar tempos de execução elevados. Por este motivo, gerentes de negócio podem optar por terceirizar o monitoramento da qualidade de bancos de dados para um serviço específico, usualmente baseado em computação em nuvem. Neste contexto, este trabalho propõe abordagens para redução de custos da tarefa de deduplicação de dados, a qual visa detectar entidades duplicadas em bases de dados, no contexto de um serviço de qualidade de dados em nuvem. O trabalho tem como foco a tarefa de deduplicação de dados devido a sua importância em diversos contextos e sua elevada complexidade. É proposta a arquitetura em alto nível de um serviço de monitoramento de qualidade de dados que emprega o provisionamento dinâmico de recursos computacionais por meio da utilização de heurísticas e técnicas de aprendizado de máquina. Além disso, são propostas abordagens para a adoção de algoritmos incrementais de deduplicação de dados e controle do tamanho de blocos gerados na etapa de indexação do problema investigado. Foram conduzidos quatro experimentos diferentes visando avaliar a eficácia dos algoritmos de provisionamento de recursos propostos e das heurísticas empregadas no contexto de algoritmos incrementais de deduplicação de dados e de controle de tamanho dos blocos. Os resultados dos experimentos apresentam uma gama de opções englobando diferentes relações de custo e benefício, envolvendo principalmente: custo de infraestrutura do serviço e quantidade de violações de SLA ao longo do tempo. Outrossim, a avaliação empírica das heurísticas propostas para o problema de deduplicação incremental de dados também apresentou uma série de padrões nos resultados, envolvendo principalmente o tempo de execução das heurísticas e os resultados de eficácia produzidos. Por fim, foram avaliadas diversas heurísticas para controlar o tamanho dos blocos produzidos em uma tarefa de deduplicação de dados, cujos resultados de eficácia são bastante influenciados pelos valores dos parâmetros empregados. Além disso, as heurísticas apresentaram resultados de eficiência que variam significativamente, dependendo da estratégia de poda de blocos adotada. Os resultados dos quatro experimentos conduzidos apresentam suporte para demonstrar que diferentes estratégias (associadas ao provisionamento de recursos computacionais e aos algoritmos de qualidade de dados) adotadas por um serviço de qualidade de dados podem influenciar significativamente nos custos do serviço e, consequentemente, os custos repassados aos usuários do serviço. In the era of Big Data, in which the scale of the data provides many challenges for classical algorithms, the task of assessing the quality of datasets may become costly and complex. For this reason, business managers may opt to outsource the data quality monitoring for a specific cloud service for this purpose. In this context, this work proposes approaches for reducing the costs generated from solutions for the data deduplication problem, which aims to detect duplicate entities in datasets, in the context of a service for data quality monitoring. This work investigates the deduplication task due to its importance in a variety of contexts and its high complexity. We propose a high-level architecture of a service for data quality monitoring, which employs provisioning algorithms that use heuristics and machine learning techniques. Furthermore, we propose approaches for the adoption of incremental data quality algorithms and heuristics for controlling the size of the blocks produced in the indexing phase of the investigated problem. Four different experiments have been conducted to evaluate the effectiveness of the proposed provisioning algorithms, the heuristics for incremental record linkage and the heuristics to control block sizes for entity resolution. The results of the experiments show a range of options covering different tradeoffs, which involves: infrastructure costs of the service and the amount of SLA violations over time. In turn, the empirical evaluation of the proposed heuristics for incremental record linkage also presented a number of patterns in the results, which involves tradeoffs between the runtime of the heuristics and the obtained efficacy results. Lastly, the evaluation of the heuristics proposed to control block sizes have presented a large number of tradeoffs regarding execution time, amount of pruning approaches and the obtained efficacy results. Besides, the efficiency results of these heuristics may vary significantly, depending of the adopted pruning strategy. The results from the conducted experiments support the fact that different approaches (associated with cloud computing provisioning and the employed data quality algorithms) adopted by a data quality service may produce significant influence over the generated service costs, and thus, the final costs forwarded to the service customers.
- Published
- 2017
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.