Utility and Disclosure Risk for Differentially Private Synthetic Categorical Data

Raab, Gillian M.

doi:10.1007/978-3-031-13945-1_18

Gillian M. Raab⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13463))

Included in the following conference series:

International Conference on Privacy in Statistical Databases

745 Accesses

Abstract

This paper introduces two methods of creating differentially private (DP) synthetic data that are now incorporated into the synth pop package for R. Both are suitable for synthesising categorical data, or numeric data grouped into categories. Ten data sets with varying characteristics were used to evaluate the methods. Measures of disclosiveness and of utility were defined and calculated. The first method is to add DP noise to a cross tabulation of all the variables and create synthetic data by a multinomial sample from the resulting probabilities. While this method certainly reduced disclosure risk, it did not provide synthetic data of adequate quality for any of the data sets. The other method is to create a set of noisy marginal distributions that are made to agree with each other with an iterative proportional fitting algorithm and then to use the fitted probabilities as above. This proved to provide useable synthetic data for most of these data sets at values of the differentially privacy parameter $\epsilon $ as low as 0.5. The relationship between the disclosure risk and $\epsilon $ is illustrated for each of the data sets. Results show how the trade-off between disclosiveness and data utility depend on the characteristics of the data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Integrating Differential Privacy in the Statistical Disclosure Control Tool-Kit for Synthetic Data Production

Differential Correct Attribution Probability for Synthetic Data: An Exploration

Privacy Risk from Synthetic Data: Practical Proposals

Notes

1.
This acronym will also be used for “differentially private”.
2.
It can be installed from https://github.com/bnowok/synthpop.
3.
Also known as the RAS algorithm and as raking in Computer Science.
4.
Hazy https://hazy.com/, Accessed 18 May 2022.
5.
Mostly AI https://mostly.ai/ebook/synthetic-data-for-enterprises, Accessed 18 May 2022.
6.
https://research.aimultiple.com/synthetic-data/, Multiple AI: In-Depth Synthetic Data Guide, Accessed 18 May 2022.
7.
see https://lehd.ces.census.gov/applications/, Accessed 16th March 2022.
8.
https://www.nist.gov/ctl/pscr/open-innovation-prize-challenges/past-prize-challenges, where you will find details of the winning methods and links to some of the software used.
9.
This is described as robustness to post-processing.
10.
See footnote 2. It will be made available on CRAN after Version 1.7.0.
11.
See https://CRAN.R-project.org/package=mipfp.
12.
Also sometimes termed “correct matches”.
13.
Available as a data set as in the synth pop package.
14.
The only numeric variables selected from SD2011 were Age and Income, and from the PUMS data only Age.
15.
https://www.ipums.org/.
16.
It is possible that the added noise produced margins that are impossible to reconcile. These cases do not correspond to useable syntheses.
17.
A count determined by the parameter nprior is distributed equally over the table or margin entries except those defined as structural zeros. This is important for non-DP synthesis so as to prevent them remaining as zeros. The default value of 1 for nprior was used.

References

Synthetic data for official statistics: a starter guide. United Nations, Geneva. UNECE: High Level Group for the Modernisation of Official Statisics, (2022, forthcoming)
Google Scholar
Abowd, J.M.: The U.S. census bureau adopts differential privacy. In: 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2018). https://doi.org/10.1145/3219819.3226070. Accessed May 2022
Abowd, J.M., et al.: The 2020 census disclosure avoidance system TopDown algorithm (2022). https://arxiv.org/abs/2204.08986
Abowd, J.M., Vilhuber, L.: How protective are synthetic data? In: Domingo-Ferrer, J., Saygın, Y. (eds.) PSD 2008. LNCS, vol. 5262, pp. 239–246. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87471-3_20
Chapter Google Scholar
Bowen, C.M., Liu, F.: Comparative study of differentially private data synthesis methods. Stat. Sci. 35(2), 280–307 (2020)
Article MathSciNet Google Scholar
Bowen, C.M., Snoke, J.: Comparative study of differentially private synthetic data algorithms from the NIST PSCR differential privacy synthetic data challenge. J. Priv. Confid. 11(1), 12704 (2021)
Google Scholar
Cole, D., Sautmann, V., (eds.) Handbook on Using Administrative Data for Research and Evidence-based Policy, Chap. 6 Designing Access with Differential Privacy, pp. 173–239 (2020). https://admindatahandbook.mit.edu/book/v1.0/diffpriv.html. Accessed on 19 May 2022
Drechsler, J.: Synthetic Data Sets for Statistical Disclosure Control: Theory and Implementation. Springer, New York (2011). https://doi.org/10.1007/978-1-4614-0326-5
Book MATH Google Scholar
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14
Chapter Google Scholar
Garfinkel, S.: Differential privacy and the 2020 us census. MIT Case Studies in Social and Ethical Responsibilities of Computing (Winter 2022) (2022). https://mit-serc.pubpub.org/pub/differential-privacy-2020-us-census
Goodfellow, I., et al.: Generative adversarial networks. In: Advances in Neural Information Processing Systems, vol. 3, pp. 2672–2680 (2014). https://arxiv.org/abs/1406.2661
Hawes, M.B.: Implementing differential privacy: seven lessons from the 2020 united states census. Harv. Data Sci. Rev. 2(2) (2020). https://hdsr.mitpress.mit.edu/pub/dgg03vo6, https://hdsr.mitpress.mit.edu/pub/dgg03vo6
Hay, M., Machanavajjhala, A., Miklau, G., Chen, Y., Zhang, D.: Principled evaluation of differentially private algorithms using DPBENCH. In: Proceedings of the 2016 International Conference on Management of Data (2016). https://dl.acm.org/doi/10.1145/2882903.2882931
Jackson, J., Mitra, R., Francis, B., Dove, I.: Using saturated count models for user-friendly synthesis of categorical data. J. Roy. Statist. Soc. Serues A (2022, accepted). https://arxiv.org/abs/2107.08062v2)
Kenny, C.T., Kuriwaki, S., McCartan, C., Rosenman, E., Simko, T., Imai, K.: The use of differential privacy for census data and its impact on redistricting: the case of the 2020 US. Census. Sci. Adv. 7(7), 1–17 (2021). https://imai.fas.harvard.edu/research/DAS.html
Little, R.J.A.: Statistical analysis of masked data. J. Off. Stat. 9(2), 407–26 (1993)
Google Scholar
Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., Vilhuber, L.: Privacy: theory meets practice on the map. In: 2008 IEEE 24th International Conference on Data Engineering, pp. 277–286 (2008)
Google Scholar
McKenna, R., Miklau, G., Sheldon, D.: Winning the NIST contest: a scalable and general approach to differentially private synthetic data. J. Priv. Confidentiality 11(3), 1–30 (2021). https://journalprivacyconfidentiality.org/index.php/jpc/article/view/407
Muralidhar, K., Domingo-Ferrer, J., Martínez, S.: $\epsilon $-differential privacy for microdata releases does not guarantee confidentiality (let alone utility). In: Privacy in Statistical Databases 2020 (2020)
Google Scholar
Nowok, B., Raab, G.M., Dibben, C.: synthpop: Generating Synthetic Versions of Sensitive Microdata for Statistical Disclosure Control, R package version 5.0-0 (2018). https://CRAN.R-project.org/package=synthpop
Nowok, B., Raab, G.M., Dibben, C.: synthpop: Generating Synthetic Versions of Sensitive Microdata for Statistical Disclosure Control (2021). https://CRAN.R-project.org/package=synthpop, R package version 1.7-0
Pejó, B.: Guide to Differential Privacy Modifications: A Taxonomy of Variants and Extensions. Springer Briefs in Computer Science Serries. Springer International Publishing AG, Cham (2022). https://doi.org/10.1007/978-3-030-96398-9_12
Raab, G., Nowok, B., Dibben, C.: Practical data synthesis for large samples. J. Priv. Confidentiality 7, 67–97 (2017). https://journalprivacyconfidentiality.org/index.php/jpc/article/view/407
Raab, G.M., Nowok, B., Dibben, C.: Assessing, visualizing and improving the utility of synthetic data. Available as a vignette for the Synthpop package at https://cran.r-project.org/web/packages/synthpop/vignettes/utility.pdf. Accessed 1 May 2022
Raghunathan, T.E., Reiter, J.P., Rubin, D.B.: Multiple imputation for statistical disclosure limitation. J. Off. Stat. 19(1), 1–17 (2003)
Google Scholar
Rubin, D.B.: Discussion: Statistical disclosure limitation. J. Off. Stat. 9(2), 461–468 (1993)
Google Scholar
Shlomo, N.: Integrating differential privacy in the statistical disclosure control tool-kit for synthetic data production. In: Domingo-Ferrer, J., Muralidhar, K. (eds.) PSD 2020. LNCS, vol. 12276, pp. 271–280. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-57521-2_19
Chapter Google Scholar
Snoke, J., Raab, G., Nowok, B., Dibben, C., Slavkovic, A.: General and specific utility measures for synthetic data. J. Roy. Stat. Soc. Ser. A 181(3), 663–688 (2018)
Article MathSciNet Google Scholar
Taub, J., Elliot, M., Pampaka, M., Smith, D.: Differential correct attribution probability for synthetic data: an exploration. In: Domingo-Ferrer, J., Montes, F. (eds.) PSD 2018. LNCS, vol. 11126, pp. 122–137. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99771-1_9
Chapter Google Scholar
Voas, D., Williamson, P.: Evaluating goodness-of-fit measures for synthetic microdata. Geog. Environ. Model. 5, 177–200 (2001)
Article Google Scholar
Zhang, J., Cormode, G., Procopiuc, C., Srivastava, D., Xiao, X.: PrivBayes: private data release via Bbayesian networks. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD 2014, pp. 1423–1434. ACM (2014)
Google Scholar

Download references

Acknowledgements

This work would not have been possible without the work of Beata Nowok, the main author of synth pop. Any errors found in the new DP routines on github (see note 2) are entirely my responsibility. The ESRC/UKRI provided support for the Administrative data Research Centre and the Scottish Longitudinal Study. I would also like to thank two anonymous referees for helpful comments on an earlier version of this paper.

Author information

Authors and Affiliations

Scottish Centre for Administrative Data Research, University of Edinburgh, Edinburgh, Scotland, UK
Gillian M. Raab

Authors

Gillian M. Raab
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gillian M. Raab .

Editor information

Editors and Affiliations

Universitat Rovira i Virgili, Tarragona, Catalonia, Spain
Josep Domingo-Ferrer
Télécom SudParis, Palaiseau, France
Maryline Laurent

A Appendix

1.1 A.1 Details of the Variables in Data Sets

Tables 4 and 5 give details of the variables selected from each of the two data sets. See section data for how each of the data set were created. The two Age variables were each grouped into 5 categories.

Table 4. Variables selected from the SD2011 data set.

Full size table

Table 5. Variables selected from the IPUMS data set.

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Raab, G.M. (2022). Utility and Disclosure Risk for Differentially Private Synthetic Categorical Data. In: Domingo-Ferrer, J., Laurent, M. (eds) Privacy in Statistical Databases. PSD 2022. Lecture Notes in Computer Science, vol 13463. Springer, Cham. https://doi.org/10.1007/978-3-031-13945-1_18

Download citation

DOI: https://doi.org/10.1007/978-3-031-13945-1_18
Published: 14 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-13944-4
Online ISBN: 978-3-031-13945-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Utility and Disclosure Risk for Differentially Private Synthetic Categorical Data

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Integrating Differential Privacy in the Statistical Disclosure Control Tool-Kit for Synthetic Data Production

Differential Correct Attribution Probability for Synthetic Data: An Exploration

Privacy Risk from Synthetic Data: Practical Proposals

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Appendix

1.1 A.1 Details of the Variables in Data Sets

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Utility and Disclosure Risk for Differentially Private Synthetic Categorical Data

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Integrating Differential Privacy in the Statistical Disclosure Control Tool-Kit for Synthetic Data Production

Differential Correct Attribution Probability for Synthetic Data: An Exploration

Privacy Risk from Synthetic Data: Practical Proposals

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Appendix

A Appendix

1.1 A.1 Details of the Variables in Data Sets

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.