151. TaintStream: fine-grained taint tracking for big data platforms through dynamic code translation
- Author
-
Gang Huang, Yunxin Liu, Xuanzhe Liu, Chengxu Yang, Zhenpeng Chen, Yuanchun Li, and Mengwei Xu
- Subjects
Database ,business.industry ,Computer science ,Privacy policy ,Data management ,Data erasure ,Big data ,Access control ,computer.software_genre ,Scripting language ,Overhead (computing) ,Data retention ,business ,computer - Abstract
Big data has become valuable property for enterprises and enabled various intelligent applications. Today, it is common to host data in big data platforms (e.g., Spark), where developers can submit scripts to process the original and intermediate data tables. Meanwhile, it is highly desirable to manage the data to comply with various privacy requirements. To enable flexible and automated privacy policy enforcement, we propose TaintStream, a fine-grained taint tracking framework for Spark-like big data platforms. TaintStream works by automatically injecting taint tracking logic into the data processing scripts, and the injected scripts are dynamically translated to maintain a taint tag for each cell during execution. The dynamic translation rules are carefully designed to guarantee non-interference in the original data operation. By defining different semantics of taint tags, TaintStream can enable various data management applications such as access control, data retention, and user data erasure. Our experiments on a self-crafted benchmarksuite show that TaintStream is able to achieve accurate cell-level taint tracking with a precision of 93.0% and less than 15% overhead. We also demonstrate the usefulness of TaintStream through several real-world use cases of privacy policy enforcement.
- Published
- 2021