1. Privacy-Preserving High-dimensional Data Collection with Federated Generative Autoencoder
- Author
-
Jens Grossklags, Xuebing Zhou, and Xue Jiang
- Subjects
Clustering high-dimensional data ,Ethics ,federated learning ,generative models ,Computer science ,business.industry ,QA75.5-76.95 ,Machine learning ,computer.software_genre ,BJ1-1725 ,Autoencoder ,local differential privacy ,Privacy preserving ,Electronic computers. Computer science ,General Earth and Planetary Sciences ,Artificial intelligence ,business ,computer ,high-dimensional data collection ,Generative grammar ,General Environmental Science - Abstract
Business intelligence and AI services often involve the collection of copious amounts of multidimensional personal data. Since these data usually contain sensitive information of individuals, the direct collection can lead to privacy violations. Local differential privacy (LDP) is currently considered a state-ofthe-art solution for privacy-preserving data collection. However, existing LDP algorithms are not applicable to high-dimensional data; not only because of the increase in computation and communication cost, but also poor data utility.In this paper, we aim at addressing thecurse-of-dimensionalityproblem in LDP-based high-dimensional data collection. Based on the idea of machine learning and data synthesis, we propose DP-Fed-Wae, an efficient privacy-preserving framework for collecting high-dimensional categorical data. With the combination of a generative autoencoder, federated learning, and differential privacy, our framework is capable of privately learning the statistical distributions of local data and generating high utility synthetic data on the server side without revealing users’ private information. We have evaluated the framework in terms of data utility and privacy protection on a number of real-world datasets containing 68–124 classification attributes. We show that our framework outperforms the LDP-based baseline algorithms in capturing joint distributions and correlations of attributes and generating high-utility synthetic data. With a local privacy guarantee ∈ = 8, the machine learning models trained with the synthetic data generated by the baseline algorithm cause an accuracy loss of 10% ~ 30%, whereas the accuracy loss is significantly reduced to less than 3% and at best even less than 1% with our framework. Extensive experimental results demonstrate the capability and efficiency of our framework in synthesizing high-dimensional data while striking a satisfactory utility-privacy balance.
- Published
- 2022