790 likes | 965 Views
Differential Privacy on Linked Data: Theory and Implementation. Yotam Aron. Table of Contents. Introduction Differential Privacy for Linked Data SPIM implementation Evaluation. Contributions. Theory: how to apply differential privacy to linked data.
E N D
Differential Privacy on Linked Data: Theory and Implementation YotamAron
Table of Contents • Introduction • Differential Privacy for Linked Data • SPIM implementation • Evaluation
Contributions • Theory: how to apply differential privacy to linked data. • Implementation: privacy module for SPARQL queries. • Experimental evaluation: differential privacy on linked data.
Overview: Privacy Risk • Statistical data can leak privacy. • Mosaic Theory: Different data sources harmful when combined. • Examples: • Netflix Prize Data set • GIC Medical Data set • AOL Data logs • Linked data has added ontologies and meta-data, making it even more vulnerable.
Current Solutions • Accountability: • Privacy Ontologies • Privacy Policies and Laws • Problems: • Requires agreement among parties. • Does not actually prevent breaches, just a deterrent.
Current Solutions (Cont’d) • Anonymization • Delete “private” data • K – anonymity (Strong Privacy Guarantee) • Problems • Deletion provides no strong guarantees • Must be carried out for every data set • What data should be anonymized? • High computational cost (k-anonimity is np-hard)
Differential Privacy • Definition for relational databases (from PINQ paper): A randomized function K gives Ɛ-differential privacy if for all data sets and differing on at most one record, and all,
Differential Privacy • What does this mean? • Adversaries get roughly same results from and , meaning a single individual’s data will not greatly affect their knowledge acquired from each data set.
How Is This Achieved? • Add noise to result. • Simplest: Add Laplace noise
Laplace Noise Parameters • Mean = 0 (so don’t add bias) • Variance = , where is defined, for a record j, as • Theorem: For query Q result R,the output R + Laplace(0, ) is differentially private.
Other Benefit of Laplace Noise • A set of queries each with sensitivity will have an overall sensitivity of • Implementation-wise, can allocate an “budget” Ɛ for a client and for each query client specifies to use.
Benefits of Differential Privacy • Strong Privacy Guarantee • Mechanism-Based, so don’t have to mess with data. • Independent of data set’s structure. • Works well with for statistical analysis algorithms.
Problems with Differential Privacy • Potentially poor performance • Complexity • Noise • Only works with statistical data (though this has fixes) • How to calculate sensitivity of arbitrary query without brute-force?
Differential Privacy and Linked Data • Want same privacy guarantees for linked data without, but no “records.” • What should be “unit of difference”? • One triple • All URIs related to person’s URI • All links going out from person’s URI
Differential Privacy and Linked Data • Want same privacy guarantees for linked data without, but no “records.” • What should be “unit of difference”? • One triple • All URIs related to person’s URI • All links going out from person’s URI
Differential Privacy and Linked Data • Want same privacy guarantees for linked data without, but no “records.” • What should be “unit of difference”? • One triple • All URIs related to person’s URI • All links going out from person’s URI
Differential Privacy and Linked Data • Want same privacy guarantees for linked data without, but no “records.” • What should be “unit of difference”? • One triple • All URIs related to person’s URI • All links going out from person’s URI
“Records” for Linked Data • Reduce links in graph to attributes • Idea: • Identify individual contributions from a single individual to total answer. • Find contribution that affects answer most.
“Records” for Linked Data • Reduce links in graph to attributes, makes it a record. P1 P2 Knows
“Records” for Linked Data • Repeated attributes and null values allowed P1 P2 Knows Loves Knows P3 P4 Knows
“Records” for Linked Data • Repeated attributes and null values allowed (not good RDBMS form but makes definitions easier)
Query Sensitivity in Practice • Need to find triples that “belong” to a person. • Idea: • Identify individual contributions from a single individual to total answer. • Find contribution that affects answer most. • Done using sorting and limiting functions in SPARQL
Example S1 • COUNT of places visited S2 P1 MA P2 State of Residence S3 Visited
Example S1 • COUNT of places visited S2 P1 MA P2 State of Residence S3 Visited
Example S1 • COUNT of places visited S2 P1 MA P2 State of Residence S3 Visited Answer: Sensitivity of 2
Using SPARQL • Query: (COUNT(?s) as ?num_places_visited) WHERE{ ?p :visited ?s }
Using SPARQL • Sensitivity Calculation Query (Ideally): SELECT ?p (COUNT(ABS(?s)) as ?num_places_visited) WHERE{ ?p :visited ?s; ?p foaf:name ?n } GROUP BY ?p ORDER BY ?num_places_visited LIMIT 1
In reality… • LIMIT, ORDER BY, GROUP BY doesn’t work together in 4store… • For now: Don’t use LIMIT and get top answers manually. • I.e. Simulate using these keywords in python • Will affect results, so better testing should be carried out in the future. • Would like to keep it on sparql-side ideally so there is less transmitted data (e.g. on large data sets)
(Side rant) 4store limitations • Many operations not supported in unison • E.g. cannot always filter and use “order by” for some reason • Severely limits the types of queries I could use to test. • May be desirable to work with a different triplestore that is more up-to-date (ARQ). • Didn’t because wanted to keep code in python. • Also had already written all code for 4store
Problems with this Approach • Need to identify “people” in graph. • Assume, for example, that URI with a foaf:name is a personand use its triples in privacy calculations. • Imposes some constraints on linked data format for this to work. • For future work, look if there’s a way to automatically identify private data, maybe by using ontologies. • Complexity is tied to speed of performing query over large data set. • Still not generalizable to all functions.
…and on the Plus Side • Model for sensitivity calculation can be expanded to arbitrary statistical functions. • e.g. dot products, distance functions, variance, etc. • Relatively simple to implement using SPARQL 1.1
SPARQL Privacy Insurance Module • i.e. SPIM • Use authentication, AIR, and differential privacy in one system. • Authentication to manage Ɛ-budgets. • AIR to control flow of information and non-statistical data. • Differential privacy for statistics. • Goal: Provide a module that can integrate into SPARQL 1.1 endpoints and provide privacy.
Design HTTP Server OpenID Authentication AIR Reasoner Differential Privacy Module SPIM Main Process Triplestore Privacy Policies User Data
HTTP Server and Authentication • HTTP Server: Django server that handles http requests. • OpenID Authentication: Django module. HTTP Server OpenID Authentication
SPIM Main Process • Controls flow of information. • First checks user’s budget, then uses AIR, then performs final differentially-private query. SPIM Main Process
AIR Reasoner • Performs access control by translating SPARQL queries to n3 and checking against policies. • Can potentially perform more complicated operations (e.g. check user credentials) AIR Reasoner Privacy Policies
Differential Privacy Protocol Differential Privacy Module SPARQL Endpoint Client Scenario: Client wishes to make standard SPARQL 1.1 statistical query. Client has Ɛ “budget” of overall accuracy for all queries.
Differential Privacy Protocol Differential Privacy Module SPARQL Endpoint Query, Ɛ > 0 Client Step 1: Query and epsilon value sent to the endpoint and intercepted by the enforcement module.
Differential Privacy Protocol Differential Privacy Module SPARQL Endpoint Client Sens Query Step 2: The sensitivity of the query is calculated using a re-written, related query.
Differential Privacy Protocol Differential Privacy Module SPARQL Endpoint Client Query Step 3: Actual query sent.
Differential Privacy Protocol Differential Privacy Module SPARQL Endpoint Result and Noise Client Step 4: Result with Laplace noise sent over.
Evaluation • Three things to evaluate: • Correctness of operation • Correctness of differential privacy • Runtime • Used an anonymized clinical database as the test data and added fake names, social security numbers, and addresses.
Correctness of Operation • Can the system do what we want? • Authentication provides access control • AIR restricts information and types of queries • Differential privacy gives strong privacy guarantees. • Can we do better?
Use Case Used in Thesis • Clinical database data protection • HIPAA: Federal protection of private information fields, such as name and social security number, for patients. • 3 users • Alice: Works in CDC, needs unhindered access • Bob: Researcher that needs access to private fields (e.g. addresses) • Charlie: Amateur researcher to whom HIPAA should apply • Assumptions: • Django is secure enough to handle “clever attacks” • Users do not collude, so can allocate individual epsilon values.
Use Case Solution Overview • What should happen: • Dynamically apply different AIR policies at runtime. • Give different epsilon-budgets. • How allocated: • Alice: No AIR Policy, no noise. • Bob: Give access to addresses but hide all other private information fields. • Epsilon budget: E1 • Charlie: Hide all private information fields in accordance with HIPAA • Epsilon budget: E2
Use Case Solution Overview • Alice: No AIR Policy • Bob: Give access to addresses but hide all other private information fields. • Epsilon budget: E1 • Charlie: Hide all private information fields in accordance with HIPAA • Epsilon budget: E2