UniformGAN: generative adversarial networks in uniform probability spaces
Improving correlation by leveraging integral probability transform
More Info
expand_more
Abstract
Sharing data is becoming increasingly difficult, due to the regulatory constraints imposed by the General Data Protection Regulation (GDPR). Businesses are not allowed to share data which contains privacy sensitive information. Synthetic data generation has emerged as a solution to this problem. State of the art generative adversarial networks (GAN) can generate synthetic data which statistically resembles the original data, while changing privacy sensitive information so that it cannot be related back to a person.
However, the process of generating synthetic data is still a very time consuming process for data scientists.
One of the challenges faced in synthetic data generation is aptly modeling the raw data; transforming it into numerical, and specifying the hyper-parameters such as which columns are categorical, mixed type, numerical or log distributed, is a non-trivial task. Another challenge is making estimations about the underlying distributions of the data and how these different distributions are correlated.
The proposed solution UniformGAN addresses these issues by adopting a transformer which can handle raw data and detect the data type and transforms it into a numerical equivalent. It uses the data type and estimated distribution to set the hyper-parameters for categorical columns, mixed columns, and log columns.
Furthermore, it estimates the underlying distributions of the data and leverages a statistical transformation in order for the machine learning model to easier learn the dependence structure of variables.
The evaluation with regard to machine learning utility, statistical similarity, and privacy preverabiliy has shown that UniformGAN improves accuracy with regard to decision tree classification utility, improving averaged machine learning utility by 2% compared to CTAB-GAN, and 19.21% compared to copulaGAN, while maintaining statistical similarity and privacy preservability compared to state of the art tabular data modeling techniques.