1 / 28

Data Anonymization (1)

Data Anonymization (1). Outline. Problem concepts algorithms on domain generalization hierarchy Algorithms on numerical data. The Massachusetts Governor Privacy Breach. Governor of MA uniquely identified using ZipCode, Birth Date, and Sex. Name linked to Diagnosis.

ramya
Download Presentation

Data Anonymization (1)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Anonymization (1)

  2. Outline • Problem • concepts • algorithms on domain generalization hierarchy • Algorithms on numerical data

  3. The Massachusetts Governor Privacy Breach • Governor of MA uniquely identified using ZipCode, Birth Date, and Sex. • Name linked to Diagnosis 87 % of US population • Name • SSN • Visit Date • Diagnosis • Procedure • Medication • Total Charge • Name • Address • Date Registered • Party affiliation • Date last voted • Zip • Birth date • Sex • Zip • Birth date • Sex Quasi Identifier Medical Data Voter List Sweeney, IJUFKS 2002 3

  4. Definition • Table • Column: attributes, row: records • Quasi-identifier • A list of attributes that can potentially be used to identify individuals • K-anonymity • Any QI in the table appears at least k times

  5. Basic techniques • Generalization • Zip {02138, 02139}  0213* • Domain generalization hierarchy • A0 A1…An • Eg. {02138, 02139}  0213*  021* 02*0** • This hierarchy is a tree structure suppression

  6. Balance Better privacy guarantee Lower data utility • There are many schemes satisfying the k-anonymity specification. • We want to minimize the distortion of table, in order to maximize • data utility • Suppression is required if we cannot find a k-anonymity group for • a record.

  7. Criteria • Minimal generalization • Minimal generalization that satisfy the k-anonymization specification • Minimal table distortion • Minimal generalization with minimal utility loss • Use precision to evaluate the loss [sweeny papers] • Application-specific utility

  8. Complexity of finding optimal solution on generalization • NP-hard (bayardo ICDE05) • So all proposed algorithms are approximate algorithms

  9. Shared features in different solutions • Always satisfy the k-anonymity specification • If some records not, suppress them • Differences are at the utility loss/cost function • Sweeney’s precision metric • Discernibility & classification metrics • Information-privacy metric • Algorithms • Assume the domain generalization hierarchy is given • Efficiency • Utility maximization

  10. Metrics to be optimized • Two cost metrics – we want to minimize (bayardo ICDE05) • Discernibility • Classification • The dataset has a class label column – preserving the classification model # of items in the k-anony group # Records in minor classes in the group

  11. metrics • A combination of information loss and anonymity gain (wang ICDE04) • Information loss, anonymity gain • Information-privacy metric

  12. metrics • Information loss • Dataset has class labels • Entropy • a set S, labeled by different classes • Entropy is used to calculate the impurity of labels • Information loss of a generalization G {c1,c2,…cn}  p I(G) = info(Sp) - info (Rci) Info(S)= Pi is the percentage of label i

  13. Anonymity gain • A(VID) : # of records with the VID • AG(VID) >= A(VID): generalization improves or does not change A(VID) • Anonymity gain P(G) = x – A(VID) x = AG (VID) if AG (VID) <=K x = K, otherwise As long as k-anonymity is satisfied, further generalization of the VID does not gain

  14. Information-privacy combined metric IP = info loss/anonymity gain = I(G)/P(G) We want to minimize IP If P(G) ==0, use I(G) only Either small I(G) or large P(G) will reduce IP… If P(G)s are same, pick one with minimum I(G)

  15. Domain-hierarchy based algorithms • The sweeny’s algorithm • Bayardo’s tree pruning algorithm • Wang’s top-down and bottom up algorithms • They are all dimension-by-dimension methods

  16. Multidimensional techniques • Categorical data? • Categories are mapped to • numerize the categories • Bayardo 95 paper • Order matters? (no research on that) • Numerical data • K-anonymization  n-dim space partitioning • Many existing techniques can be applied

  17. Single-dimensional vs. multidimensional

  18. The evolving procedure Categorical(domain hierarchy)[sweeney, top-down/bottom-up]  numerized categories, single dimensional [bayardo05] numerized/numerical multidimensional[Mondrian,spatial indexing,…]

  19. Method 1: Mondrain • Numerize categorical data • Apply a top-down partioning process Step2.2 Step2.1 step1

  20. Allowable cut

  21. Method 2: spatial indexing • Multidimensional spatial techniques • Kd-tree (similar to Mondrain algorithm) • R-tree and its variations Upper layer Leaf layer R+-tree R-tree

  22. Compacting bounds Information is better preserved Example: uncompacted: age[1-80], salary[10k-100k] compacted: age[20-40], salary[10k-50k] Original Mondrain does not consider compacting bounds For R+-Tree, it is automatically done.

  23. Benefits of using R+-Tree • Scalable: originally designed for indexing disk-based large data • Multi-granularity k-anonymity: layers • Better performance • Better quality

  24. Performance Mondrain

  25. Utility • Metrics • Discenibility penalty • KL divergence: describe the difference between a pair of distributions • Certainty penalty Anonymized data distribution T: table, t: record, m: # of attributes, t.Ai generaled range, T.Ai total range

  26. Other issues • Sparse high-dimensionality • Transactional data boolean matrix “On the anonymization of sparse high-dimensional data” ICDE08 • Relate to the clustering problem of transactional data! • The above one uses matrix-based clustering • item based clustering (?)

  27. Other issues • Effect of numerizing categorical data • Ordering of categories may have certain impact on quality • General-purpose utility metrics vs. special task oriented utility metrics • Attacks on k-anonymity definition

More Related