The Impact of Synthetic and Real Training Data on Model Vulnerability

Jacob Michael Swoveland
MS, 2023
Cheng, Guang
Membership inference attacks can threaten the privacy of records in machine learning modelsby enabling adversaries to determine whether or not a record was used to train said model. In this paper we will be exploring the use of synthetic training data to defend against this form of attack. Synthetic data here keeps the attributes of the original training data set while maintaining machine learning utility. We use CTGAN and DP-CTGAN in order to generate high quality tabular synthetic training data. We evaluate the effectiveness of this approach empirically by comparing the vulnerability and utility of models trained with synthetic and real data. We also analyze the privacy-utility trade-off that comes with using synthetic data. Synthetic data seems to be a promising defense mechanism against membership inference attacks by providing increased privacy at reasonable utility losses.
2023