General Utility and Privacy of Synthetic Datsets

Releasing Differentially Private Synthetic Micro-Data with Bayesian GANs


This paper shows how to generate differentially private synthetic data using generative adversarial nets (GANs). We bring together insights from three literatures. First, generating artificial copies of original data is considered the gold standard in differential privacy, since any further analysis of this kind of data does not spend any extra amount of the privacy budget. While this literature has used machine learning to generate synthetic data on relatively trivial data sets, we show how to handle even complex data structures. Second, GANs became prominent in learning and generating the representation of visual and audio data. However, unlike in the context of synthetic visual and audio data, synthetic micro-data requires to take account not only of the point estimate, but also has to capture the diversity of the original data. We therefore apply, third, Bayesian GAN. We show how BayesGAN can generate differentially private data when injecting the right amount of noise during training with a Stochastic Gradient Langevin Dynamics sampler. In our paper, we are the first to generate differentially private data using BayesGAN. So far, our experiments show that we generate differentially private micro-data that are at least as useful for analysis and prediction as synthetic data generated with other, so far considered methods. In addition, we also incorporate the privacy loss parameters epsilon and delta into our framework which allows users to control the desired privacy loss of the synthetic data.

Working Paper