320 likes | 610 Views
Anonymization of Set-Valued Data via Top-Down, Local Generalization. Yeye He Jeffrey F. Naughton University of Wisconsin-Madison. Overview. The problem: Anonymizing set-valued data presents challenges not seen in relational data
E N D
Anonymization of Set-Valued Data via Top-Down, Local Generalization Yeye He Jeffrey F. Naughton University of Wisconsin-Madison
Overview • The problem: • Anonymizing set-valued data presents challenges not seen in relational data • Previous solutions explored parts but not all of the problem space • Our goals: • Develop a scalable algorithm for the new variant of the problem • Perform experiments to explore strengths and weaknesses of the approach
What’s set-valued data • “Relational data” • One sensitive attribute for each tuple • “Set-valued data” • Logically: (personid, {item1, item2, …, itemn}) • Multiple sensitive values in one record possible
An attack scenario • Retailer publishes market basket data • The adversary knows Alice has bought milk, beer, and diapers • The adversary infers Alice has also bought pregnancy test and diabetes medicine • beer, milk, diapers • beer, milk, diapers, pregnancy test, diabetes medicine
Existing work: a priori QI/SI partition • Scenarios where a priori partitioning set elements into Quasi-Identifier Item & Sensitive Item possible • {beer, milk, diapers, pregnancy test, diabetes medicine} • Substantial existing work & good algorithms • [Ghinita+08] [Xu+08a] [Xu+08b] [Nergiz+07] • But what if a priori partitioning not possible? • Individuals may have different privacy requirements • The adversary may see sensitive items and use as QI Set-valued data anonymization a priori QI/SI partition possible ?
Existing work: no QI/SI partition • Prior work [Terrovitis+08] proposed the km-anonymity model • km-anonymity • For any transaction (data record) T, for any subset of m items in T, there are at least k-1 other transactions with the same m items Set-valued data anonymization a priori QI/SI partition possible No a priori QI/SI partition
The m in km-anonymity [Terrovitis+08] • Attack revisited • The data 103anonymized, the adversary sees {beer, milk, diapers} • Cannot tell Alice’s transaction from the other 9 • Effective assuming the adversary never sees more than m=3 items • m in km-anonymity • requires some identified ms.t. no adversary will ever see more than m items • What about the case where there is no such m? • The case we consider Set-valued data anonymization a priori QI/SI partition possible No a priori QI/SI partition Has identified m No identified m
Our model: k-anonymityfor set-valued data • Transactional database D is k-anonymous if • Every transaction (data record) occurs at least k times • Different from km-anonymity [Terrovitis+08] • no limit on m, i.e., valid for any m • thus a stronger privacy model
k-anonymity subsumes km-anonymity[Terrovitis+08] • Every database D that satisfies k-anonymity also satisfieskm-anonymity • There exists a database D that satisfies km-anonymity for all m but not k-anonymity • Example: 23-anonymous but not 2-anonymous T1 = {A, B, C} T2 = {A, B, C} T3 = {A, B} No QI/SI partition Set-valued data anonymization a priori QI/SI partition possible k-anon km-anon
Problem statement • Given a transactional database D, find a transformation D’ of Ds.t.: • D’ satisfies k-anonymity • the transformation minimizes information loss between (D, D’)
Hierarchical generalization All Alcohol Health care • Transaction generalization • Ti: {“Beer”, “Wine”, “Diaper”} {“Alcohol”, “Health care”} • Duplicates removed Beer Wine Diaper Pregnancy test
Information loss metric • Normalized Certainty Penalty (NCP) [Xu+06] • Also used in previous work [Terrovitis+08] All Health care Alcohol • Example: • Generalize “Beer” to “Alcohol”: (2/4 = 0.5) info loss • Generalize “Beer” to “All”: (4/4 = 1) info loss Beer Wine Diaper Pregnancy test
Our algorithm: Partition-based anonymization • Top-down • Generalize everything to the root representation • Resulting one initial partition • Divide and conquer • Choose a node to specialize for each partition • Based on information gain heuristics • Recursively partition on resulting sub-partitions
Generalize all data to root {ALL} • {a1} {ALL} • {a1,a2} {ALL} • {b1,b2} {ALL} • {b1,b2} {ALL} • {a1,a2,b2} {ALL} • {a1,a2,b2} {ALL} • {a1,a2,b1,b2} • One initial partition
Initial partition:specialize using ALL {A, B} {ALL} {A} {ALL} {A} {ALL} {B} {ALL} {B} {ALL} {A, B} {ALL} {A, B} {ALL} {A, B} • Produces three sub-partitions
Green partition: specialize using A {a1, a2} • {a1} • {a1,a2} {B} {B} {A, B} {A} {A, B} {A} {A, B} • Specialization violates 2-anonymity, rolls back
Blue partition: specialize using B {b1, b2} {B} • {b1,b2} {B} • {b1,b2} {A, B} {A} {A, B} {A} {A, B} • Specialization ok, reaches leave level, stop
Red partition: specialize using A {a1, a2} • {b1,b2} • {b1,b2} {A, B} • {a1,a2,B} {A} {A, B} • {a1,a2,B} {A} {A, B} • {a1,a2,B} • Choosing A over B based on max info gain heurisitcs
Red partition: specialize using B {b1, b2} • {b1,b2} • {b1,b2} • {a1,a2,B} • {a1,a2,b2} {A} • {a1,a2,B} • {a1,a2,b2} {A} • {a1,a2,B} • {a1,a2,b1,b2} • Specializing B violating 2-anonymity, rolls back
Main advantages • Effective (less information loss) • Even though we impose a stronger privacy criteria • Local recoding vs. Global recoding • Efficient (less execution time) • Divide and conquer vs. bottom-up (exhaustive) enumeration • Linear in the input data & level of the hierarchy vs. worst case exponential in previous work
Experimental setup: market basket data • Real-world benchmark data • BMS-WebView-1, BMS-WebView-2, BMS-POS • No accompanying hierarchy data • Used synthetic hierarchy (as in the previous work) • Comparing our Partition-based algorithm (Partition), with previous Apriori-Anonymization (AA) [Terrovitis+08]
Less information losson market basket data • Why? Local recoding
Sensitivity analysis: consistently faster with varied parameters
Experimental setup: AOL query log • From a set-valued perspective • No accompanying hierarchy data, again • Use alphabetical hierarchy • Use WordNethierarchy • Compare with an early work [Adar07]
Reasonably efficient on AOL query log • Efficient given the size of the query log (2.2GB) • Information loss not as satisfactory as in market basket data • Words generalized to “event”, “process”, “thing”…
Conclusion • Developed faster, better information preserving anonymization algorithm • for set-valued data with no QI/SI distinction • Performed well on market basket data • less satisfying for search log data • Open and important question: stronger privacy models • what is a good stronger privacy model than k-anonymity for set-valued data with no QI/SI distinction?