550 likes | 718 Views
Differential Privacy in US Census. CompSci 590.03 Instructor: Ashwin Machanavajjhala. Announcements. No class on Wednesday, Oct 31 Guest Lecture: Prof. Jerry Reiter, (Duke Stats) Friday Nov 2, “Privacy in the U.S. Census and Synthetic Data Generation”. Outline.
E N D
Differential Privacy in US Census CompSci 590.03Instructor: Ashwin Machanavajjhala Lecture 17: 590.03 Fall 12
Announcements • No class on Wednesday, Oct 31 • Guest Lecture: Prof. Jerry Reiter, (Duke Stats) Friday Nov 2, “Privacy in the U.S. Census and Synthetic Data Generation” Lecture 17: 590.03 Fall 12
Outline • (continuation of last class) Relaxing differential privacy for utility • E-privacy [M et al VLDB ‘09] • Application of Differential Privacy in US Census[M et al ICDE ‘08] Lecture 17: 590.03 Fall 12
E-Privacy Lecture 17: 590.03 Fall 12
Defining Privacy ... nothing about an individual should be learnable from the database that cannot be learned without access to the database. T. Dalenius, 1977 Problem with this approach: • Analyst knows Bob has green hair. • Analysts learns from published data that people with green hair have 99% probability of cancer. • Therefore analyst knows Bob has high risk of cancer, even if Bob is not in the published data. Lecture 17: 590.03 Fall 12
Defining Privacy • Therefore analyst knows Bob has high risk of cancer, even if Bob is not in the published data. • This should not be considered a privacy breach – such correlations are exactly what we want the analyst to learn. Lecture 17: 590.03 Fall 12
Counterfactual Approach Consider 2 distributions: • Pr[Bob has cancer | adversary’s prior + output of mechanism on D] • “What adversary learns about Bob after seeing the published information” • Pr[Bob has cancer | adversary’s prior + output of mechanism on D-Bob]where D’ = D – {Bob} • “What adversary would have learnt about Bob even (in the hypothetical case) when Bob was not in the data” • Must be careful … when removing Bob you may also need to remove other tuples correlated with Bob … Lecture 17: 590.03 Fall 12
Counterfactual Privacy • Consider a set of data evolution scenarios (adversaries) {θ} • For every property sBob about Bob, and output of mechanism w|log P(sBob| θ, M(D) = w) - log P(sBob| θ, M(D-Bob) = w) | ≤ ε • When {θ} is the set of all product distributions which are independent across individuals, • D-Bob = D – {Bob’s record} • A mechanism satisfies the above definition if and only if it satisfies differential privacy Lecture 17: 590.03 Fall 12
Counterfactual Privacy • Consider a set of data evolution scenarios (adversaries) {θ} • For every property sBob about Bob, and output of mechanism w|log P(sBob| θ, M(D) = w) - log P(sBob| θ, M(D-Bob) = w) | ≤ ε • What about for other sets of prior distributions {θ}? • Open Question: If {θ} contains correlations, then definition of D-Bob itself is not very clear (like discussed in previous class about count constraints, or social networks) Lecture 17: 590.03 Fall 12
Certain vs Uncertain Adversaries Suppose an adversary has an uncertain prior: • Consider a two sided coin • A certain adversary knows bias of coin = p (for some p) • Exactly knows the bias of the coin • Knows every coin flip is random draw which is H with prob p • An uncertain adversary may think the coin’s bias is in [p – δ, p+δ] • Does not exactly know the bias of the coin. Assumes coin’s bias θ is drawn from some probability distribution π • Given θ, every coin flip is random draw which is H with probθ Lecture 17: 590.03 Fall 12
Learning • In machine learning/statistics, you want to use the observed data to learn something about the population. • E.g., Given 10 flips of a coin, what is the bias of the coin? • Assume your population is drawn from some prior distribution θ. • We don’t know the θ, but may know some θ’s are more likely than others (like π, a probability distribution over θ’s) • We want to learn the best θ that explains the observations … • If you are certain about θin the first place, there is no need for statistics/machine learning. Lecture 17: 590.03 Fall 12
Uncertain Priors and Learning • In many privacy scenarios, statistician (advertiser, epidemiologist, machine learning operator, …) is the adversary. • But statistician does not have a certain prior … (otherwise no learning) • Maybe, we can model a class of “realistic” adversaries using uncertain priors. Lecture 17: 590.03 Fall 12
E-Privacy • Counterfactual privacy with realistic adversaries • Consider a set of data evolution scenarios (uncertain adversaries) {π} • For every property sBob about Bob, and output of mechanism w|log P(sBob| π, M(D) = w) - log P(sBob| π, M(D-Bob) = w) | ≤ ε where P(sBob| π, M(D) = w) = ∫θP(sBob| θ, M(D) = w) P(θ | π) dθ Lecture 17: 590.03 Fall 12
Realistic Adversaries • Suppose your domain has 3 values green (g), red (r) and blue (b). • Suppose individuals are assumed to be drawn from some common distribution θ = {pg, pr, pb} {pg = 1, pr = 0, pb = 0} {pg = 0, pr = 1, pb = 0} {pg = 0, pr = 0, pb = 1} Lecture 17: 590.03 Fall 12
Modeling Realistic Adversaries • E.g., DirichletD(αg, αr, αb) priors to model uncertainty. • Maximum probability given to {p*g, p*r, p*b}, where p*g = αg / (αg + αr + αb) D(3,7,5) {p*g=0.2, pr=0.47, pb = 0.53} D(6,2,2) D(2,3,4) D(6,2,6) {p*g=0.43, pr=0.14, pb = 0.43} Lecture 17: 590.03 Fall 12
E.g., Dirichlet Prior • DirichletD(αg, αr, αb). • Call α = (αg + αr + αb) the stubbornness of the prior. • As α increases, more probability is given to {p*g, p*r, p*b}. • When α ∞, {p*g, p*r, p*b} has probability 1 and we get the independence assumption. D(2,3,4) α = 10 α = 14 D(6,2,6) Lecture 17: 590.03 Fall 12
Better Utility … • Suppose we consider a class of uncertain adversaries, who are characterized by Dirichlet distributions with known stubborness (α), but unknown shape (αg, αb, αr). • Algorithm: Generalization • Size of each group > α/ε-1. • If size of a group = δ(α/ ε-1), then most frequent sensitive value appears in at most 1-1/(ε+δ) fraction of tuples in the group. • Hence, we now also have a sliding scale to assess the power of adversaries (based on stubbornness) • More α implies more coarsening implies less utility Lecture 17: 590.03 Fall 12
Summary Lots of Open Questions: • What is the relationship between counterfactual privacy and Pufferfish? • In what other ways can causality theory be used? For defining correlations? • What are other interesting ways to instantiate E-privacy, and what are efficient algorithms for E-privacy? • … Lecture 17: 590.03 Fall 12
Differential privacy in US Census Lecture 17: 590.03 Fall 12
OnTheMap: A Census application that plots commuting patterns of workers http://onthemap.ces.census.gov/ Residences Residences(Sensitive) Workplace(Public) Lecture 17: 590.03 Fall 12
OnTheMap: A Census application that plots commuting patterns of workers Workplace(Quasi-identifier) Residence(Sensitive) Census Blocks Lecture 17: 590.03 Fall 12
Why publish commute patterns? • To compute Quarterly Workforce Indicators • Total employment • Average Earnings • New Hires & Separations • Unemployment Statistics E.g., Missouri state used this data to formulate a method allowing QWI to suggest industrial sectors where transitional training might be most effective … to proactively reduce time spent on unemployment insurance … Lecture 17: 590.03 Fall 12
A Synthetic Data Generator (Dirichletresampling) Step 1: Noise Addition (for each destination) D (7, 5, 4) A (2, 3, 3) D+A (9, 8, 7) + = Multi-set of Originsfor workers in Washington DC. Noise(fake workers) Noise infused data Washington DC Somerset Fuller Noise added to an origin with at least 1 worker is > 0 Lecture 17: 590.03 Fall 12
A Synthetic Data Generator (Dirichletresampling) Step 2: DirichletResampling(for each destination) Replace two of the same kind. (9, 8, 7) (9, 7, 7) (9, 9, 7) Draw a point at random S : Synthetic Data frequency of block b in D+A = 0 frequency of b in S = 0 i.e., block b is ignored by the algorithm. Lecture 17: 590.03 Fall 12
How should we add noise? • Intuitively, more noise yields more privacy … • How much noise should we add ? • To which blocks should we add noise? • Currently this is poorly understood. • Total amount of noise added is a state secret • Only 3-4 people in the US know this value in the current implementation of OnTheMap. Lecture 17: 590.03 Fall 12
How should we add noise? • Intuitively, more noise yields more privacy … • How much noise should we add ? • To which blocks should we add noise? • Currently this is poorly understood. • Total amount of noise added is a state secret • Only 3-4 people in the US know this value in the current implementation of OnTheMap. How much noise should we add? To which blocks should we add noise? Lecture 17: 590.03 Fall 12
Privacy of Synthetic Data Theorem 1: The Dirichletresampling algorithm preserves privacy if and only if for every destination d, the noise added to each block is at least where m(d) is the size of the synthetic population for destination d and εis the privacy parameter. m(d) ε - 1 Lecture 17: 590.03 Fall 12
1. How much noise should we add? Noise required per block: (differential privacy) Add noise to every block on the map. There are 8 million Census blocks on the map! 1 million original workers and 16 billion fake workers!!! lesser privacy 1 million original and synthetic workers. 2. To which blocks should we add noise? Lecture 17: 590.03 Fall 12
Intuition behind Theorem 1. Two possible inputs D1 D2 Adversary knows individual 1 is Either blue or red. Adversary knows individuals [2..n] are blue. blue and red are two different origin blocks. Lecture 17: 590.03 Fall 12
Intuition behind Theorem 1. Two possible inputs D1 D2 Noise Addition blue and red are two different origin blocks. Lecture 17: 590.03 Fall 12
Intuition behind Theorem 1. Noise infused inputs For every output … DirichletResampling D1 D2 O Pr[D1 O] = 1/10 * 2/11 * 3/12 * 4/13 * 5/14 * 6/15 Pr[D2 O] = 2/10 * 3/11 * 4/12 * 5/13 * 6/14 * 7/15 • Pr[D1 O] • Pr[D2 O] = 7 blue and red are two different origin blocks. Lecture 17: 590.03 Fall 12
Intuition behind Theorem 1. Noise infused inputs For every output … DirichletResampling D1 D2 O Adversary infers that it is very likely individual 1 is red … … unless noise added is very large. blue and red are two different origin blocks. Lecture 17: 590.03 Fall 12
Privacy Analysis: Summary • Chose differential privacy. • Guards against powerful adversaries. • Measures privacy as a distance between prior and posterior. • Derived necessary and sufficient conditions when OnTheMap preserves privacy. • The above conditions make the data published by OnTheMap useless. Lecture 17: 590.03 Fall 12
But, breach occurs with very low probability. Noise infused inputs For every output … DirichletResampling D1 D2 O Probability of O ≈ 10-4 blue and red are two different origin blocks. Lecture 17: 590.03 Fall 12
Negligible function Definition: f(x) is negligible if it goes to 0 faster than the inverse of any polynomial. e.g., 2-x and e-x are negligible functions. Lecture 17: 590.03 Fall 12
(ε,δ)-Indistinguishability For every pair of inputs that differ in one value For any subset of outputs T D1 D2 O1 O2 O3 O4 Pr[D1 T] ≤ eεPr[D2 T] + δ(|D2|) If T occurs with negligible probability, the adversary is allowed to distinguish between D1 and D2 by a factor > ε using Oi in T. Lecture 17: 590.03 Fall 12
Conditions for (ε,δ)-Indistinguishability Theorem 2: The Dirichletresampling algorithm preserves (ε,δ)-indistinguishability if for every destination d, the noise added to each block is at least where n(d) is the number of workers commuting to dand m(d) ≤ n(d). log n(d) Lecture 17: 590.03 Fall 12
Probabilistic Differential Privacy • (ε,δ)-Indistinguishability is an asymptotic measure • May not guarantee privacy when number of workers at a destination is small. Definition (Disclosure Set Disc(D, ε)):The set of output tables that breach ε-differential privacy for D and some other table D’ that differs from D in one value. Lecture 17: 590.03 Fall 12
Probabilistic Differential Privacy For every pair of inputs that differ in one value For every probable output D1 D2 O Adversary may distinguish between D1 and D2 based on a set of unlikely outputs with probability at most δ • Pr[D1 O] • Pr[D2 O] < eε] > 1 - δ Pr[O | Lecture 17: 590.03 Fall 12
1. How much noise should we add? Noise required per block: lesser privacy Differential Privacy Probabilistic Differential Privacy (δ = 10-5) 1 million original and synthetic workers. Lecture 17: 590.03 Fall 12
Prob. Differential Privacy: Summary • Ignoring privacy breaches that occur due to low probability outputs drastically reduces noise. • Two ways to bound low probability outputs • (ε,δ)-Indistinguishability and Negligible functions.Noise required for privacy ≥ (log n(d)) per block • (ε,δ)-Probabilistic differential privacy and Disclosure sets.Efficient algorithm to calculate noise per block (see paper). • Does probabilistic differential privacy allow useful information to be published? Lecture 17: 590.03 Fall 12
1. How much noise should we add? Noise required per block: lesser privacy Differential Privacy Probabilistic Differential Privacy (δ = 10-5) 1 million original and synthetic workers. 2. To which blocks should we add noise? • Why not add noise to every block? Lecture 17: 590.03 Fall 12
Why not add noise to every block? Noise required per block: (probabilistic differential privacy) • There are about 8 million blocks on the map! • Total noise added is about 6 million. • Causes non-trivial spurious commute patterns. • Roughly 1 million fake workers from West Coast (out of a total 7 million points in D+A). • Hence, 1/7 of the synthetic data have residences in West Coast and work in Washington DC. lesser privacy 1 million original and synthetic workers. Lecture 17: 590.03 Fall 12
2. To which blocks should we add noise? Noise required per block: (probabilistic differential privacy) Adding noise to all blocks creates spurious commute patterns. lesser privacy 1 million original and synthetic workers. • Why not add noise only to blocks • that appear in the original data? Lecture 17: 590.03 Fall 12
Theorem 3: Adding noise only to blocks that appear in the data breaches privacy. If a block b does not appear in the original data and no noise is added to b then b cannot appear in the synthetic data. Lecture 17: 590.03 Fall 12
Theorem 3: Adding noise only to blocks that appear in the data breaches privacy. Somerset 1 Fayette 0 Somerset 0 Fayette 1 • Worker W comes from Somerset or Fayette. • No one else comes from there. • If • S has a synthetic worker from Somerset • Then • W comes from Somerset!! Lecture 17: 590.03 Fall 12
Ignoring outliers degrades utility • Each of these points are outliers. • Contribute to about half the workers. Lecture 17: 590.03 Fall 12
Our solution to “Where to add noise?” Step 1 : Coarsen the domain • Based on an existing public dataset (Census Transportation Planning Package, CTPP). Lecture 17: 590.03 Fall 12
Our solution to “Where to add noise?” Step 1 : Coarsen the domain Step 2: Probabilistically drop blocks with 0 support • Pick a function f: {b1, …, bk} (0,1] (based on external data) • For every block b with 0 support, ignore b with probability f(b) Theorem 4: Parameter ε increases by • max ( max ( 2 noise per block, f(b) ) ) b Lecture 17: 590.03 Fall 12
Utility of the provably private algorithm • Utility measured by average commute distance for each destination block. Experimental Setup: • OTM: Currently published OnTheMap data used as original data. • All destinations in Minnesota. • 120, 690 origins per destination. • chosen by pruning out blocks that are > 100 miles from the destination. • ε = 100, δ = 10-5 • Additional leakage due to probabilistic pruning = 4 (min f(b) = 0.0378) Lecture 17: 590.03 Fall 12