1 / 38

Publishing Set-Valued Data via Differential Privacy

Publishing Set-Valued Data via Differential Privacy. Rui Chen , Concordia University Noman Mohammed , Concordia University Benjamin C. M. Fung , Concordia University Bipin C. Desai , Concordia University Li Xiong , Emory University. VLDB 2011. Outline. Introduction Preliminaries

benedetto
Download Presentation

Publishing Set-Valued Data via Differential Privacy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Publishing Set-Valued Data via Differential Privacy Rui Chen, Concordia University Noman Mohammed, Concordia University Benjamin C. M. Fung, Concordia University Bipin C. Desai, Concordia University Li Xiong, Emory University VLDB 2011

  2. Outline • Introduction • Preliminaries • Sanitization algorithm • Experimental results • Conclusions

  3. Introduction • The problem: non-interactive set-valued data publication under differential privacy • Typical set-valued data: transaction data, web search queries

  4. Introduction • Set-valued data refers to the data in which each record owner is associated with a set of items drawn from an item universe.

  5. Introduction • Existing works [1, 2, 3, 4, 5, 6, 7] on publishing set-valued data are based on partitioned-based privacy models[8]. • They provide insufficient privacy protection. • Composition attack [8] • deFinetti attack [9] • Foreground knowledge attack [10] • They are vulnerable to background knowledge.

  6. Introduction • Differential privacy is independent of an adversary’ background knowledge and computational power (with exceptions [11]). • The outcome of any analysis should not overly depend on a single data record. • Existing differentially private data publishing approaches are not adequate in terms of both utility and scalability for our problem.

  7. Introduction • Problems of data-independent publishing approaches: Universe I = {I1, I2, I3 } • Scalability: O(2n) • Utility: noise accumulates exponentially

  8. Outline • Introduction • Preliminaries • Sanitization algorithm • Experimental results • Conclusions

  9. Preliminaries • Context-free taxonomy tree • Each internal node is a set of their leaves, not necessarily the semantic generalization

  10. Preliminaries • Differential privacy [12] D D’ • D and D’ are neighbors if they differ on at most one record A non-interactive privacy mechanism Agives ε-differential privacy if for all neighbours D, D’, and for any possible sanitized database D* ∈ Range(A), PrA[A(D) = D*] ≤ exp(ε) × PrA[A(D’) = D*]

  11. Preliminaries • Laplace mechanism [12] Global Sensitivity For example, for a single counting query Q over a dataset D, returning Q(D)+Laplace(1/ε) givesε-differential privacy.

  12. Preliminaries • Exponential mechanism [13] Given a utility function q : (D × R) → R for a database instance D, the mechanism A, A(D, q) = { return r with probability ∝ exp(ε×q(D, r)/2△q) } gives ε-differential privacy.

  13. Preliminaries • Composition properties [14] Sequential composition ∑iεi –differential privacy Parallel composition max(εi)–differential privacy

  14. Preliminaries • Utility metrics For a given itemset I’ I , a counting query Q over a dataset D is defined to be A privacy mechanism A is (α, δ)-useful if with probability 1- δ, for every counting query and every dataset D, for D*=A(D), |Q(D*)-Q(D)|<= α. [15]

  15. Outline • Introduction • Preliminaries • Sanitization algorithm • Experimental results • Conclusions

  16. Sanitization Algorithm • Top-down partitioning • Generalize all records to a single partition • Keep partitioning non-empty partitions until leaf partitions are reached

  17. Sanitization Algorithm • Privacy budget allocation • We reserve B/2 to obtain noisy sizes of leaf partitions and the rest B/2 to guide the partitioning. • Assign less budget to more general partitions and more budget to more specific partitions.

  18. Sanitization Algorithm • Privacy budget allocation A hierarchy cut needs at most partition operations to reach leaf partitions. Example: {I{1,2}, I{3, 4}} needs at most two partition operations to reach leaf partitions

  19. Sanitization Algorithm • Privacy budget allocation • We reserve B/2 to obtain noisy sizes of leaf partitions and the rest B/2 to guide the partitioning. • Assign less budget to more general partitions and more budget to more specific partitions. B/2/3 = B/6 (B/2-B/6)/2 = B/6 B/6+B/2 = 2B/3

  20. Sanitization Algorithm • Sub-partition generation • For a non-leaf partition, we need to consider all possible sub-partitions to satisfy differential privacy. • Efficient implementation: separately handling empty and non-empty partitions (inspired by [16]).

  21. Outline • Introduction • Preliminaries • Sanitization algorithm • Experimental results • Conclusions

  22. Experiments • Two real-life set-valued datasets are used. • MSNBC is publicly available at UCI machine learning repository(http://archive.ics.uci.edu/ml/index.html). • STM is provided by Societe de transport de Montreal (STM) (http://www.stm.info).

  23. Experiments • Average relative error vs. privacy budget B=0.75 B=0.5 B=1.0

  24. Experiments • Utility for frequent itemset mining B=0.75 B=0.5 B=1.0

  25. Experiments • Scalability: O(|D|*|I|) Runtime vs. |D| Runtime vs. |I|

  26. Outline • Introduction • Preliminaries • Sanitization algorithm • Experimental results • Conclusions

  27. Conclusions • Differential privacy can be successfully applied to non-interactive set-valued data publishing with guaranteed utility. • Differential privacy can be achieved by data-dependent solutions with improved efficiency and accuracy. • The general idea of data-dependent solutions applies to other types of data, for example, relational data [17] and trajectory data [18].

  28. References [1] J. Cao, P. Karras, C. Raissi, and K.-L. Tan. ρ–uncertainty inference proof transaction anonymization. In VLDB, pp. 1033–1044, 2010. [2] G. Ghinita, Y. Tao, and P. Kalnis. On the anonymization of sparse high-dimensional data. In ICDE, pp. 715–724, 2008. [3] Y. He and J. F. Naughton. Anonymization of set-valued data via top-down, local generalization. In VLDB, pp.934–945, 2009. [4] M. Terrovitis, N. Mamoulis, and P. Kalnis. Privacy-preserving anonymization of set-valued data. In VLDB, pp.115–125, 2008. [5] M. Terrovitis, N. Mamoulis, and P. Kalnis. Local and global recoding methods for anonymizing set-valued data.VLDBJ, 20(1):83–106, 2011. [6] Y. Xu, B. C. M. Fung, K. Wang, A. W. C. Fu, and J. Pei. Publishing sensitive transactions for itemset utility. In ICDM, pp. 1109–1114, 2008. [7] Y. Xu, K. Wang, A. W. C. Fu, and P. S. Yu. Anonymizing transaction databases for publication. In SIGKDD, pp. 767–775, 2008.

  29. References [8] S. R. Ganta, S. P. Kasiviswanathan, and A. Smith. Composition attacks and auxiliary information in data privacy. In SIGKDD, pp. 265-273, 2008. [9] D. Kifer. Attacks on privacy and deFinetti’s theorem. In SIGMOD, pp. 127–138, 2009. [10] R. C. W. Wong, A. Fu, K. Wang, P. S. Yu, and J. Pei. Can the utility of anonymized data be used for privacy breaches,” ACM Transactions on Knowledge Discovery from Data, to appear. [11] D. Kifer and A. Machanavajjhala. No free lunch in data privacy. In SIGMOD, 2011. [12] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pp. 265–284, 2006. [13] F. McSherry and K. Talwar. Mechanism design via differential privacy. In FOCS, pp. 94–103, 2007. [14] F. McSherry. Privacy integrated queries: An extensible platform for privacy-preserving data analysis. In SIGMOD, pp. 19–30, 2009. [15] A. Blum, K. Ligett, and A. Roth. A learning theory approach to non-interactive database privacy. In STOC, pp.609–618, 2008.

  30. References [16] G. Cormode, M. Procopiuc, D. Srivastava, and T. T. L. Tran. Differentially Private Publication of Sparse Data. In CoRR, 2011. [17] N. Mohammed, R. Chen, B. C. M. Fung, and P. S. Yu. Differentially private data release for data mining. In SIGKDD, 2011. [18] R. Chen, B. C. M. Fung, and B. C. Desai. Differentially private trajectory data publication. ICDE, under review, 2012.

  31. Thank you! Q & A

  32. Backup Slides

  33. Lower Bound Results • In the interactive setting, only a limited number of queries could be answered; otherwise, an adversary would be able to precisely reconstruct almost the entire original database. • In the non-interactive setting, one can only guarantee the utility of restricted classes of queries.

  34. Threshold Selection • We design the threshold as a function of the standard deviation of the noise and the height of a partition’s hierarchy cut:

  35. Relative error • (α, δ)-usefulness is effective to give an overall estimation of utility, but fails to produce intuitive experimental results. • We experimentally measure the utility of sanitized data for counting queries by relative error: Sanity bound

  36. Experiments • Average relative error vs. taxonomy tree fan-out B=0.75 B=0.5 B=1.0

More Related