Abstract
This paper introduces two methods of creating differentially private (DP) synthetic data that are now incorporated into the synth pop package for R. Both are suitable for synthesising categorical data, or numeric data grouped into categories. Ten data sets with varying characteristics were used to evaluate the methods. Measures of disclosiveness and of utility were defined and calculated. The first method is to add DP noise to a cross tabulation of all the variables and create synthetic data by a multinomial sample from the resulting probabilities. While this method certainly reduced disclosure risk, it did not provide synthetic data of adequate quality for any of the data sets. The other method is to create a set of noisy marginal distributions that are made to agree with each other with an iterative proportional fitting algorithm and then to use the fitted probabilities as above. This proved to provide useable synthetic data for most of these data sets at values of the differentially privacy parameter \(\epsilon \) as low as 0.5. The relationship between the disclosure risk and \(\epsilon \) is illustrated for each of the data sets. Results show how the trade-off between disclosiveness and data utility depend on the characteristics of the data sets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
This acronym will also be used for “differentially private”.
- 2.
It can be installed from https://github.com/bnowok/synthpop.
- 3.
Also known as the RAS algorithm and as raking in Computer Science.
- 4.
Hazy https://hazy.com/, Accessed 18 May 2022.
- 5.
Mostly AI https://mostly.ai/ebook/synthetic-data-for-enterprises, Accessed 18 May 2022.
- 6.
https://research.aimultiple.com/synthetic-data/, Multiple AI: In-Depth Synthetic Data Guide, Accessed 18 May 2022.
- 7.
see https://lehd.ces.census.gov/applications/, Accessed 16th March 2022.
- 8.
https://www.nist.gov/ctl/pscr/open-innovation-prize-challenges/past-prize-challenges, where you will find details of the winning methods and links to some of the software used.
- 9.
This is described as robustness to post-processing.
- 10.
See footnote 2. It will be made available on CRAN after Version 1.7.0.
- 11.
- 12.
Also sometimes termed “correct matches”.
- 13.
Available as a data set as in the synth pop package.
- 14.
The only numeric variables selected from SD2011 were Age and Income, and from the PUMS data only Age.
- 15.
- 16.
It is possible that the added noise produced margins that are impossible to reconcile. These cases do not correspond to useable syntheses.
- 17.
A count determined by the parameter nprior is distributed equally over the table or margin entries except those defined as structural zeros. This is important for non-DP synthesis so as to prevent them remaining as zeros. The default value of 1 for nprior was used.
References
Synthetic data for official statistics: a starter guide. United Nations, Geneva. UNECE: High Level Group for the Modernisation of Official Statisics, (2022, forthcoming)
Abowd, J.M.: The U.S. census bureau adopts differential privacy. In: 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2018). https://doi.org/10.1145/3219819.3226070. Accessed May 2022
Abowd, J.M., et al.: The 2020 census disclosure avoidance system TopDown algorithm (2022). https://arxiv.org/abs/2204.08986
Abowd, J.M., Vilhuber, L.: How protective are synthetic data? In: Domingo-Ferrer, J., Saygın, Y. (eds.) PSD 2008. LNCS, vol. 5262, pp. 239–246. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87471-3_20
Bowen, C.M., Liu, F.: Comparative study of differentially private data synthesis methods. Stat. Sci. 35(2), 280–307 (2020)
Bowen, C.M., Snoke, J.: Comparative study of differentially private synthetic data algorithms from the NIST PSCR differential privacy synthetic data challenge. J. Priv. Confid. 11(1), 12704 (2021)
Cole, D., Sautmann, V., (eds.) Handbook on Using Administrative Data for Research and Evidence-based Policy, Chap. 6 Designing Access with Differential Privacy, pp. 173–239 (2020). https://admindatahandbook.mit.edu/book/v1.0/diffpriv.html. Accessed on 19 May 2022
Drechsler, J.: Synthetic Data Sets for Statistical Disclosure Control: Theory and Implementation. Springer, New York (2011). https://doi.org/10.1007/978-1-4614-0326-5
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14
Garfinkel, S.: Differential privacy and the 2020 us census. MIT Case Studies in Social and Ethical Responsibilities of Computing (Winter 2022) (2022). https://mit-serc.pubpub.org/pub/differential-privacy-2020-us-census
Goodfellow, I., et al.: Generative adversarial networks. In: Advances in Neural Information Processing Systems, vol. 3, pp. 2672–2680 (2014). https://arxiv.org/abs/1406.2661
Hawes, M.B.: Implementing differential privacy: seven lessons from the 2020 united states census. Harv. Data Sci. Rev. 2(2) (2020). https://hdsr.mitpress.mit.edu/pub/dgg03vo6, https://hdsr.mitpress.mit.edu/pub/dgg03vo6
Hay, M., Machanavajjhala, A., Miklau, G., Chen, Y., Zhang, D.: Principled evaluation of differentially private algorithms using DPBENCH. In: Proceedings of the 2016 International Conference on Management of Data (2016). https://dl.acm.org/doi/10.1145/2882903.2882931
Jackson, J., Mitra, R., Francis, B., Dove, I.: Using saturated count models for user-friendly synthesis of categorical data. J. Roy. Statist. Soc. Serues A (2022, accepted). https://arxiv.org/abs/2107.08062v2)
Kenny, C.T., Kuriwaki, S., McCartan, C., Rosenman, E., Simko, T., Imai, K.: The use of differential privacy for census data and its impact on redistricting: the case of the 2020 US. Census. Sci. Adv. 7(7), 1–17 (2021). https://imai.fas.harvard.edu/research/DAS.html
Little, R.J.A.: Statistical analysis of masked data. J. Off. Stat. 9(2), 407–26 (1993)
Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., Vilhuber, L.: Privacy: theory meets practice on the map. In: 2008 IEEE 24th International Conference on Data Engineering, pp. 277–286 (2008)
McKenna, R., Miklau, G., Sheldon, D.: Winning the NIST contest: a scalable and general approach to differentially private synthetic data. J. Priv. Confidentiality 11(3), 1–30 (2021). https://journalprivacyconfidentiality.org/index.php/jpc/article/view/407
Muralidhar, K., Domingo-Ferrer, J., Martínez, S.: \(\epsilon \)-differential privacy for microdata releases does not guarantee confidentiality (let alone utility). In: Privacy in Statistical Databases 2020 (2020)
Nowok, B., Raab, G.M., Dibben, C.: synthpop: Generating Synthetic Versions of Sensitive Microdata for Statistical Disclosure Control, R package version 5.0-0 (2018). https://CRAN.R-project.org/package=synthpop
Nowok, B., Raab, G.M., Dibben, C.: synthpop: Generating Synthetic Versions of Sensitive Microdata for Statistical Disclosure Control (2021). https://CRAN.R-project.org/package=synthpop, R package version 1.7-0
Pejó, B.: Guide to Differential Privacy Modifications: A Taxonomy of Variants and Extensions. Springer Briefs in Computer Science Serries. Springer International Publishing AG, Cham (2022). https://doi.org/10.1007/978-3-030-96398-9_12
Raab, G., Nowok, B., Dibben, C.: Practical data synthesis for large samples. J. Priv. Confidentiality 7, 67–97 (2017). https://journalprivacyconfidentiality.org/index.php/jpc/article/view/407
Raab, G.M., Nowok, B., Dibben, C.: Assessing, visualizing and improving the utility of synthetic data. Available as a vignette for the Synthpop package at https://cran.r-project.org/web/packages/synthpop/vignettes/utility.pdf. Accessed 1 May 2022
Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multiple imputation for statistical disclosure limitation. J. Off. Stat. 19(1), 1–17 (2003)
Rubin, D.B.: Discussion: Statistical disclosure limitation. J. Off. Stat. 9(2), 461–468 (1993)
Shlomo, N.: Integrating differential privacy in the statistical disclosure control tool-kit for synthetic data production. In: Domingo-Ferrer, J., Muralidhar, K. (eds.) PSD 2020. LNCS, vol. 12276, pp. 271–280. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-57521-2_19
Snoke, J., Raab, G., Nowok, B., Dibben, C., Slavkovic, A.: General and specific utility measures for synthetic data. J. Roy. Stat. Soc. Ser. A 181(3), 663–688 (2018)
Taub, J., Elliot, M., Pampaka, M., Smith, D.: Differential correct attribution probability for synthetic data: an exploration. In: Domingo-Ferrer, J., Montes, F. (eds.) PSD 2018. LNCS, vol. 11126, pp. 122–137. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99771-1_9
Voas, D., Williamson, P.: Evaluating goodness-of-fit measures for synthetic microdata. Geog. Environ. Model. 5, 177–200 (2001)
Zhang, J., Cormode, G., Procopiuc, C., Srivastava, D., Xiao, X.: PrivBayes: private data release via Bbayesian networks. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD 2014, pp. 1423–1434. ACM (2014)
Acknowledgements
This work would not have been possible without the work of Beata Nowok, the main author of synth pop. Any errors found in the new DP routines on github (see note 2) are entirely my responsibility. The ESRC/UKRI provided support for the Administrative data Research Centre and the Scottish Longitudinal Study. I would also like to thank two anonymous referees for helpful comments on an earlier version of this paper.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Appendix
A Appendix
1.1 A.1 Details of the Variables in Data Sets
Tables 4 and 5 give details of the variables selected from each of the two data sets. See section data for how each of the data set were created. The two Age variables were each grouped into 5 categories.
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Raab, G.M. (2022). Utility and Disclosure Risk for Differentially Private Synthetic Categorical Data. In: Domingo-Ferrer, J., Laurent, M. (eds) Privacy in Statistical Databases. PSD 2022. Lecture Notes in Computer Science, vol 13463. Springer, Cham. https://doi.org/10.1007/978-3-031-13945-1_18
Download citation
DOI: https://doi.org/10.1007/978-3-031-13945-1_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-13944-4
Online ISBN: 978-3-031-13945-1
eBook Packages: Computer ScienceComputer Science (R0)