290 likes | 373 Views
Exploring Linkability of User Reviews. Mishari Almishari and Gene Tsudik University of California, Irvine. Roadmap. Introduction Data Set & Problem Settings Linkability Results & Improvements Discussion Future Work & Conclusion. Motivation. Increasing P opularity of Reviewing Sites
E N D
Exploring Linkabilityof User Reviews Mishari Almishari and Gene Tsudik University of California, Irvine
Roadmap Introduction Data Set & Problem Settings Linkability Results & Improvements Discussion Future Work & Conclusion
Motivation Increasing Popularity of Reviewing Sites Yelp, more than 39M visitors and 15M reviews in 2010
Example category Rating
Motivation Rising awareness of privacy
Motivation How is it applied? Traceability/Linkability Linkability of Ad hoc Reviews Linkablility of Several Accounts
Goal Assess the linkability in user reviews
Roadmap Introduction Data Set & Problem Settings Linkability Results & Improvements Discussion Future Work & Conclusion
Data Set • 1 Million Reviews • 2000 Users • more than 300 reviews
IR: Identified Record AR: Anonymous Record IR AR Problem Formulation IR AR IR AR AR IR
TOP-X Linkability X: 1 and 10 1, 5, 10, 20,…60 Anonymous Record (AR) Problem Settings Matching Model Identified Records (IR’s)
Methodologies (1) Naïve Bayesian Model Decreasing Sorted List of IRs (2) Kullback-LeiblerDivergence (KLD) Increasing Sorted List of IRs Maximum-Likelihood Estimation
Tokens • Unigram: • “privacy”: “p”, “r”, “i”, “v”, “a”, “c”, “y” • 26 values • Digram • “privacy”: “pr”, “ri”, “iv”, “va”, “ac”, “cy” • 676 values • Rating • 5 values • Category • 28 values
Roadmap Introduction Data Set & Problem Settings Linkability Results & Improvements Discussion Future Work & Conclusion
NB -Unigram Unigram Results Linkability Ratio Size 60, LR 83%/ Top-1 LR 96% Top-10 Anonymous Record Size
Digram Results NB -Digram Size 20, LR 97%/ Top-1 Size10, LR 88%/ Top-1 Linkability Ratio Anonymous Record Size
Improvement (1): Combining Lexical and non-Lexical ones NB Model Gain, up to 20% Linkability Ratio Anonymous Record Size Size 30, 60 % To 80% Size 60, 83 % To 96%
What about Restricting Identified Record (IR) Size? NB Model KLD Model Linkability Ratio Linkability Ratio Anonymous Record Size Anonymous Record Size Performed better for smaller IR Affected by IR size Size 20 or less, improved
Improvement (2): Matching All IR’s At Once ✔ v4 v2 v3 v1 ✖ ✔ v7 v5 v6 v8 ✖ ✖ ✔ v9 v10 v12 v11 ✖ ✖ ✖ ✔ v15 v14 v13 v16
Matching All Results Restricted IR Full IR Linkability Ratio Linkability Ratio Anonymous Record Size Anonymous Record Size Gain, up to 16% Gain, up to 23% Size 30, From 74% To 90% Size 20, From 35% To 55%
Improvement (3): For Small IR Size Changing it to: 0.5 + Review Length Gain up to 5% Size 10, 89% To 92% Linkability Ratio Size 7, 79% To 84% Anonymous Record Size
Roadmap Introduction Data Set & Problem Settings Linkability Results & Improvements Discussion Future Work & Conclusion
Discussion • Unigram and Scalability • 26 VS 676 • 59 VS 676 • Less than 10% • Prolific Users • On the long run, will be prolific • Anonymous Record Size • A set of 60 reviews, less than 20% of minimum contribution • Detecting Spam Reviews
Roadmap Introduction Data Set & Problem Settings Linkability Results & Improvements Discussion Future Work & Conclusion
Future Work • Improving more for Small AR’s • Other Probabilistic Models • Using Stylometry • Review Anonymization • Exploring Linkability in other Preference Databases
Conclusion • Extensive Study to Assess Linkability of User Reviews • For large set of users • Using very simple features • Users are very exposed even with simple features and large number of authors Takeaway Point: Reviews can be accurately de-anonymized using alphabetical letter distributions