Start Over

Improving the Quality of Web-Based Data Imputation With Crowd Intervention.

Authors :: Gu, Binbin
Li, Zhixu
Liu, An
Xu, Jiajie
Zhao, Lei
Zhou, Xiaofang
Source :: IEEE Transactions on Knowledge & Data Engineering. Jun2021, Vol. 33 Issue 6, p2534-2547. 14p.
Publication Year :: 2021
Abstract: Data incompleteness is a common data quality problem in databases. Recent work proposes to retrieve missing string values from the World Wide Web for higher imputation recall, but on the other hand, takes the risk of introducing web noises into the imputation results. So far there lacks an effective way to control the quality of web-based data imputation, given the complexity of the quality model and lacking of enough ground truth data. In this article, an EM-based quality model is first built for web-based data imputation which investigates three key factors jointly, i.e., precision of web sources, correlation among web sources, and precision and recall of the employed extractors. However, the accuracy of the EM-based quality model could be harmed when the EM (Expectation Maximization) assumption that “the majority agree on the truth” does not hold in some cases. To solve this problem, we introduce crowd intervention to help improve the quality model. While a straightforward but expensive way is to let the crowd to identify all these undesirable cases and provide the right imputation values for these blanks, a most crowd-economic way is to select a small set of blanks for crowd-based imputation, whose results could help to adjust the EM-based quality model towards a better one. To achieve this, an adaptive blank selection strategy is proposed to select a sequence of blanks for crowd-based imputation. Also, we work on finding a proper time to stop further crowd intervention for the balance of crowd efficiency and quality improvement. Our experiments performed on three real world and one simulated data collections prove that the proposed quality model can effectively help improve the quality of the web-based imputation results by more than 15 percent, while our crowd cost saving strategy saves more than 75 percent crowd cost. [ABSTRACT FROM AUTHOR]