Skip to main content

Utility and Disclosure Risk for Differentially Private Synthetic Categorical Data

  • Conference paper
  • First Online:
Privacy in Statistical Databases (PSD 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13463))

Included in the following conference series:

  • 745 Accesses

Abstract

This paper introduces two methods of creating differentially private (DP) synthetic data that are now incorporated into the synth pop package for R. Both are suitable for synthesising categorical data, or numeric data grouped into categories. Ten data sets with varying characteristics were used to evaluate the methods. Measures of disclosiveness and of utility were defined and calculated. The first method is to add DP noise to a cross tabulation of all the variables and create synthetic data by a multinomial sample from the resulting probabilities. While this method certainly reduced disclosure risk, it did not provide synthetic data of adequate quality for any of the data sets. The other method is to create a set of noisy marginal distributions that are made to agree with each other with an iterative proportional fitting algorithm and then to use the fitted probabilities as above. This proved to provide useable synthetic data for most of these data sets at values of the differentially privacy parameter \(\epsilon \) as low as 0.5. The relationship between the disclosure risk and \(\epsilon \) is illustrated for each of the data sets. Results show how the trade-off between disclosiveness and data utility depend on the characteristics of the data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    This acronym will also be used for “differentially private”.

  2. 2.

    It can be installed from https://github.com/bnowok/synthpop.

  3. 3.

    Also known as the RAS algorithm and as raking in Computer Science.

  4. 4.

    Hazy https://hazy.com/, Accessed 18 May 2022.

  5. 5.

    Mostly AI https://mostly.ai/ebook/synthetic-data-for-enterprises, Accessed 18 May 2022.

  6. 6.

    https://research.aimultiple.com/synthetic-data/, Multiple AI: In-Depth Synthetic Data Guide, Accessed 18 May 2022.

  7. 7.

    see https://lehd.ces.census.gov/applications/, Accessed 16th March 2022.

  8. 8.

    https://www.nist.gov/ctl/pscr/open-innovation-prize-challenges/past-prize-challenges, where you will find details of the winning methods and links to some of the software used.

  9. 9.

    This is described as robustness to post-processing.

  10. 10.

    See footnote 2. It will be made available on CRAN after Version 1.7.0.

  11. 11.

    See https://CRAN.R-project.org/package=mipfp.

  12. 12.

    Also sometimes termed “correct matches”.

  13. 13.

    Available as a data set as in the synth pop package.

  14. 14.

    The only numeric variables selected from SD2011 were Age and Income, and from the PUMS data only Age.

  15. 15.

    https://www.ipums.org/.

  16. 16.

    It is possible that the added noise produced margins that are impossible to reconcile. These cases do not correspond to useable syntheses.

  17. 17.

    A count determined by the parameter nprior is distributed equally over the table or margin entries except those defined as structural zeros. This is important for non-DP synthesis so as to prevent them remaining as zeros. The default value of 1 for nprior was used.

References

  1. Synthetic data for official statistics: a starter guide. United Nations, Geneva. UNECE: High Level Group for the Modernisation of Official Statisics, (2022, forthcoming)

    Google Scholar 

  2. Abowd, J.M.: The U.S. census bureau adopts differential privacy. In: 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2018). https://doi.org/10.1145/3219819.3226070. Accessed May 2022

  3. Abowd, J.M., et al.: The 2020 census disclosure avoidance system TopDown algorithm (2022). https://arxiv.org/abs/2204.08986

  4. Abowd, J.M., Vilhuber, L.: How protective are synthetic data? In: Domingo-Ferrer, J., Saygın, Y. (eds.) PSD 2008. LNCS, vol. 5262, pp. 239–246. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87471-3_20

    Chapter  Google Scholar 

  5. Bowen, C.M., Liu, F.: Comparative study of differentially private data synthesis methods. Stat. Sci. 35(2), 280–307 (2020)

    Article  MathSciNet  Google Scholar 

  6. Bowen, C.M., Snoke, J.: Comparative study of differentially private synthetic data algorithms from the NIST PSCR differential privacy synthetic data challenge. J. Priv. Confid. 11(1), 12704 (2021)

    Google Scholar 

  7. Cole, D., Sautmann, V., (eds.) Handbook on Using Administrative Data for Research and Evidence-based Policy, Chap. 6 Designing Access with Differential Privacy, pp. 173–239 (2020). https://admindatahandbook.mit.edu/book/v1.0/diffpriv.html. Accessed on 19 May 2022

  8. Drechsler, J.: Synthetic Data Sets for Statistical Disclosure Control: Theory and Implementation. Springer, New York (2011). https://doi.org/10.1007/978-1-4614-0326-5

    Book  MATH  Google Scholar 

  9. Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14

    Chapter  Google Scholar 

  10. Garfinkel, S.: Differential privacy and the 2020 us census. MIT Case Studies in Social and Ethical Responsibilities of Computing (Winter 2022) (2022). https://mit-serc.pubpub.org/pub/differential-privacy-2020-us-census

  11. Goodfellow, I., et al.: Generative adversarial networks. In: Advances in Neural Information Processing Systems, vol. 3, pp. 2672–2680 (2014). https://arxiv.org/abs/1406.2661

  12. Hawes, M.B.: Implementing differential privacy: seven lessons from the 2020 united states census. Harv. Data Sci. Rev. 2(2) (2020). https://hdsr.mitpress.mit.edu/pub/dgg03vo6, https://hdsr.mitpress.mit.edu/pub/dgg03vo6

  13. Hay, M., Machanavajjhala, A., Miklau, G., Chen, Y., Zhang, D.: Principled evaluation of differentially private algorithms using DPBENCH. In: Proceedings of the 2016 International Conference on Management of Data (2016). https://dl.acm.org/doi/10.1145/2882903.2882931

  14. Jackson, J., Mitra, R., Francis, B., Dove, I.: Using saturated count models for user-friendly synthesis of categorical data. J. Roy. Statist. Soc. Serues A (2022, accepted). https://arxiv.org/abs/2107.08062v2)

  15. Kenny, C.T., Kuriwaki, S., McCartan, C., Rosenman, E., Simko, T., Imai, K.: The use of differential privacy for census data and its impact on redistricting: the case of the 2020 US. Census. Sci. Adv. 7(7), 1–17 (2021). https://imai.fas.harvard.edu/research/DAS.html

  16. Little, R.J.A.: Statistical analysis of masked data. J. Off. Stat. 9(2), 407–26 (1993)

    Google Scholar 

  17. Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., Vilhuber, L.: Privacy: theory meets practice on the map. In: 2008 IEEE 24th International Conference on Data Engineering, pp. 277–286 (2008)

    Google Scholar 

  18. McKenna, R., Miklau, G., Sheldon, D.: Winning the NIST contest: a scalable and general approach to differentially private synthetic data. J. Priv. Confidentiality 11(3), 1–30 (2021). https://journalprivacyconfidentiality.org/index.php/jpc/article/view/407

  19. Muralidhar, K., Domingo-Ferrer, J., Martínez, S.: \(\epsilon \)-differential privacy for microdata releases does not guarantee confidentiality (let alone utility). In: Privacy in Statistical Databases 2020 (2020)

    Google Scholar 

  20. Nowok, B., Raab, G.M., Dibben, C.: synthpop: Generating Synthetic Versions of Sensitive Microdata for Statistical Disclosure Control, R package version 5.0-0 (2018). https://CRAN.R-project.org/package=synthpop

  21. Nowok, B., Raab, G.M., Dibben, C.: synthpop: Generating Synthetic Versions of Sensitive Microdata for Statistical Disclosure Control (2021). https://CRAN.R-project.org/package=synthpop, R package version 1.7-0

  22. Pejó, B.: Guide to Differential Privacy Modifications: A Taxonomy of Variants and Extensions. Springer Briefs in Computer Science Serries. Springer International Publishing AG, Cham (2022). https://doi.org/10.1007/978-3-030-96398-9_12

  23. Raab, G., Nowok, B., Dibben, C.: Practical data synthesis for large samples. J. Priv. Confidentiality 7, 67–97 (2017). https://journalprivacyconfidentiality.org/index.php/jpc/article/view/407

  24. Raab, G.M., Nowok, B., Dibben, C.: Assessing, visualizing and improving the utility of synthetic data. Available as a vignette for the Synthpop package at https://cran.r-project.org/web/packages/synthpop/vignettes/utility.pdf. Accessed 1 May 2022

  25. Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multiple imputation for statistical disclosure limitation. J. Off. Stat. 19(1), 1–17 (2003)

    Google Scholar 

  26. Rubin, D.B.: Discussion: Statistical disclosure limitation. J. Off. Stat. 9(2), 461–468 (1993)

    Google Scholar 

  27. Shlomo, N.: Integrating differential privacy in the statistical disclosure control tool-kit for synthetic data production. In: Domingo-Ferrer, J., Muralidhar, K. (eds.) PSD 2020. LNCS, vol. 12276, pp. 271–280. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-57521-2_19

    Chapter  Google Scholar 

  28. Snoke, J., Raab, G., Nowok, B., Dibben, C., Slavkovic, A.: General and specific utility measures for synthetic data. J. Roy. Stat. Soc. Ser. A 181(3), 663–688 (2018)

    Article  MathSciNet  Google Scholar 

  29. Taub, J., Elliot, M., Pampaka, M., Smith, D.: Differential correct attribution probability for synthetic data: an exploration. In: Domingo-Ferrer, J., Montes, F. (eds.) PSD 2018. LNCS, vol. 11126, pp. 122–137. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99771-1_9

    Chapter  Google Scholar 

  30. Voas, D., Williamson, P.: Evaluating goodness-of-fit measures for synthetic microdata. Geog. Environ. Model. 5, 177–200 (2001)

    Article  Google Scholar 

  31. Zhang, J., Cormode, G., Procopiuc, C., Srivastava, D., Xiao, X.: PrivBayes: private data release via Bbayesian networks. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD 2014, pp. 1423–1434. ACM (2014)

    Google Scholar 

Download references

Acknowledgements

This work would not have been possible without the work of Beata Nowok, the main author of synth pop. Any errors found in the new DP routines on github (see note 2) are entirely my responsibility. The ESRC/UKRI provided support for the Administrative data Research Centre and the Scottish Longitudinal Study. I would also like to thank two anonymous referees for helpful comments on an earlier version of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gillian M. Raab .

Editor information

Editors and Affiliations

A Appendix

A Appendix

1.1 A.1 Details of the Variables in Data Sets

Tables 4 and 5 give details of the variables selected from each of the two data sets. See section data for how each of the data set were created. The two Age variables were each grouped into 5 categories.

Table 4. Variables selected from the SD2011 data set.
Table 5. Variables selected from the IPUMS data set.

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Raab, G.M. (2022). Utility and Disclosure Risk for Differentially Private Synthetic Categorical Data. In: Domingo-Ferrer, J., Laurent, M. (eds) Privacy in Statistical Databases. PSD 2022. Lecture Notes in Computer Science, vol 13463. Springer, Cham. https://doi.org/10.1007/978-3-031-13945-1_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-13945-1_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-13944-4

  • Online ISBN: 978-3-031-13945-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy