1. Data cleansing mechanisms and approaches for big data analytics: a systematic study
- Author
-
Omed Hassan Ahmed, Bay Vo, Marwan Yassin Ghafour, Elham Azhir, Sarkar Hasan Ahmed, Amir Masoud Rahmani, and Mehdi Hosseinzadeh
- Subjects
Data cleansing ,Dirty data ,General Computer Science ,Computer science ,business.industry ,Data management ,Big data ,Context (language use) ,Usability ,Missing data ,computer.software_genre ,Data science ,Scalability ,business ,computer - Abstract
With the evolution of new technologies, the production of digital data is constantly growing. It is thus necessary to develop data management strategies in order to handle the large-scale datasets. The data gathered through different sources, such as sensor networks, social media, business transactions, etc. is inherently uncertain due to noise, missing values, inconsistencies and other problems that impact the quality of big data analytics. One of the key challenges in this context is to detect and repair dirty data, i.e. data cleansing, and various techniques have been presented to solve this issue. However, to the best of our knowledge, there has not been any comprehensive review of data cleansing techniques for big data analytics. As such, a comprehensive and systematic study on the state-of-the-art mechanisms within the scope of the big data cleansing is done in this survey. Therefore, five categories to review these mechanisms are considered, which are machine learning-based, sample-based, expert-based, rule-based, and framework-based mechanisms. A number of articles are reviewed in each category. Furthermore, this paper denotes the advantages and disadvantages of the chosen data cleansing techniques and discusses the related parameters, comparing them in terms of scalability, efficiency, accuracy, and usability. Finally, some suggestions for further work are provided to improve the big data cleansing mechanisms in the future.
- Published
- 2021
- Full Text
- View/download PDF