A Parallel Matrix-Based Method for Computing Approximations in Incomplete Information Systems

Authors :: Jian-Syuan Wong
Yi Pan
Tianrui Li
Junbo Zhang
Source :: IEEE Transactions on Knowledge and Data Engineering. 27:326-339
Publication Year :: 2015
Publisher :: Institute of Electrical and Electronics Engineers (IEEE), 2015.
Abstract: As the volume of data grows at an unprecedented rate, large-scale data mining and knowledge discovery present a tremendous challenge. Rough set theory, which has been used successfully in solving problems in pattern recognition, machine learning, and data mining, centers around the idea that a set of distinct objects may be approximated via a lower and upper bound. In order to obtain the benefits that rough sets can provide for data mining and related tasks, efficient computation of these approximations is vital. The recently introduced cloud computing model, MapReduce, has gained a lot of attention from the scientific community for its applicability to large-scale data analysis. In previous research, we proposed a MapReduce-based method for computing approximations in parallel, which can efficiently process complete data but fails in the case of missing (incomplete) data. To address this shortcoming, three different parallel matrix-based methods are introduced to process large-scale, incomplete data. All of them are built on MapReduce and implemented on Twister that is a lightweight MapReduce runtime system. The proposed parallel methods are then experimentally shown to be efficient for processing large-scale data.

Subjects :: Theoretical computer science
Computer science
Data stream mining
business.industry
Approximation algorithm
Cloud computing
computer.software_genre
Computer Science Applications
Set (abstract data type)
Runtime system
Computational Theory and Mathematics
Knowledge extraction
Pattern recognition (psychology)
Information system
Data mining
Rough set
business
computer
Information Systems

Full Text Access

Tools