Back to Search Start Over

Automated Parallel Data Processing Engine with Application to Large-Scale Feature Extraction

Authors :
Jonathan B. Ajo-Franklin
Kesheng Wu
Bin Dong
Xin Xing
Source :
2018 IEEE/ACM Machine Learning in HPC Environments (MLHPC).
Publication Year :
2018
Publisher :
IEEE, 2018.

Abstract

As new scientific instruments generate ever more data, we need to parallelize advanced data analysis algorithms such as machine learning to harness the available computing power. The success of commercial Big Data systems demonstrated that it is possible to automatically parallelize many algorithms. However, these Big Data tools have trouble handling the complex analysis operations from scientific applications. To overcome this difficulty, we have started to build an automated parallel data processing engine for science, known as ARRAYUDF. This paper provides an overview of this data processing engine, and a use case involving a feature extraction task from a large-scale seismic recording technology, called distributed acoustic sensing (DAS). The key challenge associated with DAS data sets is that they are vast in volume and noisy in data quality. The existing methods used by the DAS team for extracting useful signals like traveling seismic waves are complex and very time-consuming. Our parallel data processing engine reduces the job execution time from 10s of hours to 10s of seconds, and achieves 95% parallelization efficiency. ARRAYUDF could be used to implement more advanced data processing algorithms including machine learning, and could work with many more applications.

Details

Database :
OpenAIRE
Journal :
2018 IEEE/ACM Machine Learning in HPC Environments (MLHPC)
Accession number :
edsair.doi...........a6a292c55dc908514e0134a4f3481fa8
Full Text :
https://doi.org/10.1109/mlhpc.2018.8638638