1 / 36

SplitX: High-Performance Private Analytics

SplitX: High-Performance Private Analytics. Ruichuan Chen (Bell Labs / Alcatel-Lucent) Istemi Ekin Akkus (MPI-SWS) Paul Francis (MPI-SWS). Data analytics is important. Evaluate system performance Understand user behavior Discover statistical patterns.

reyna
Download Presentation

SplitX: High-Performance Private Analytics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SplitX: High-Performance Private Analytics Ruichuan Chen (Bell Labs / Alcatel-Lucent) Istemi Ekin Akkus (MPI-SWS) Paul Francis (MPI-SWS)

  2. Data analytics is important Evaluate system performance Understand user behavior Discover statistical patterns

  3. Data exposure has become a major concern Third-partyTrackers Smart-phone Apps

  4. User-owned and operated Data exposure has to be brought under control! User-owned and operated principle Personal data should be stored in a local host under the user’s control.

  5. Motivation and problem Data Data Data Analyst How to make aggregate queries over distributed private user data while still preserving user privacy?

  6. Outline • Related work • SplitX system • Key insights • System design • Performance comparison • Implementation & deployment • Conclusion

  7. A general approach Based on differential privacy. Differential privacy adds noise to the output of a computation (i.e., query). Hide the presence or absence of a user. Database Data Data Data Analyst Query Module (add noise)

  8. Previous systems Data Data Data • Servers aggregate answers without seeing individual user data. • Differentially private noise is added to the aggregate result. Analyst Analyst Servers Akkus et al., CCS’12; Chen et al., NSDI’12; Dwork et al., EUROCRYPT’06; Hardt et al., CCS’12; Rastogi et al., SIGMOD’10; Shi et al., NDSS’11

  9. Primary technical problems • Scale poorly • Require public-key operations or something even more expensive.Akkus et al., CCS’12; Chen et al., NSDI’12; Dwork et al., EUROCRYPT’06; Rastogi et al., SIGMOD’10; Shi et al., NDSS’11 • Suffer from answer pollution • Even a single malicious user can substantially distort the aggregate result through a single answer.Hardt et al., CCS’12; Rastogi et al., SIGMOD’10; Shi et al., NDSS’11

  10. Outline • Related work • SplitX system • Key insights • System design • Performance comparison • Implementation & deployment • Conclusion

  11. SplitX • A high-performance private analytics system • 2 to 3 orders of magnitude more efficient in bandwidth • 3 to 5 orders of magnitude more efficient in computation • Resistant to answer pollution

  12. Components & assumptions Data Data Data Analyst Analyst Analysts are potentially malicious (violating user privacy) Servers are honest but curious 1) Follow the specified protocol 2) Try to exploit additional info that can be learned in so doing Servers (1 aggregator and 2 mixes) Clients are user devices. Clients are potentially malicious (distorting the final results)

  13. Outline • Related work • SplitX system • Key insights • System design • Performance comparison • Implementation & deployment • Conclusion

  14. Key insights: XOR encryption • How to achieve high performance? • Client wants to send M to aggregator • Client splits M, and sends split messages to aggregator via mixes • Aggregator joins split messages to recreate M generate R Mix1 recreate M M R M R M Client Aggregator Mix2 R R

  15. Key insights: XOR encryption • How to achieve high performance? M denotes that client sends two split messages of M to aggregator via Mix1 and Mix2. Mix1 M R M R generate R recreate M Client Aggregator Mix2 R R For clarity Mix1 Client Aggregator Mix2 M

  16. Key insights: query buckets • How to limit answer pollution? • Solution: • Ensure that a client cannot arbitrarily manipulate answers. • Divide answer’s value range into buckets. • Enforce a binary answer in each bucket.

  17. Key insights: query buckets • Query: “SELECT age FROM splitx” • 4 buckets: 0~19, 20~39, 40~59, and ≥60. • Answers: a ‘1’ or ‘0’ per bucket. • 30 years-old  0, 1, 0, 0 • Answers encoded in a bit-vector. • An answer from a malicious client cannot substantially distort the query result!

  18. Outline • Related work • SplitX system • Key insights • System design • Performance comparison • Implementation & deployment • Conclusion

  19. System design 1) Query publish/subscribe • Analyst publishes its queries • Client subscribes to an analyst’s queries 2) Query answering • Client answers queries • Mixes add differentially private noise • Mixes shuffle answers • Aggregator generates query results

  20. 1) Query publish/subscribe Mix1 Client Aggregator Analyst Mix2 Query1, Query2, … Analyst ID Query1, Query2, …

  21. 1) Query publish/subscribe • Query example: age distribution among male users? • QID: • SQL: • Buckets: • DP parameter ( ): • Tend: 123 SELECT age FROM splitx WHERE gender=‘male’ 0~19, 20~39, 40~59, and ≥60 1.0 11:59:59PM on Aug 16, 2013

  22. 2) Query answering • Client answers queries • Mixes add differentially private noise • Mixes shuffle answers • Aggregator generates query results

  23. Step 1: client answers queries • Client executes query over its local data and generates an answer • ‘1’ or ‘0’ per bucket • Encoded as a bit-vector

  24. Step 1: client answers queries • Client splits its answer, and sends the split answers with the query ID to the two mixes, respectively. Mix1 Client Aggregator Analyst Mix2 QID, answer Mix knows which query a client answered. Privacy violation!

  25. Step 2: mixes add DP noise • Each mix individually adds some random bit-vectors as the differentially private noise • How many bit-vectors needed? c: # clients queried : DP parameter random bit-vectors as noise

  26. Step 3: mixes shuffle split answers shuffle • Each mix maintains c+n split answers • Mixes shuffle the split answers for each column (i.e., bucket) in a synchronized way.

  27. Mixes transmit shuffled answers • Each mix transmits the shuffled split answers to the aggregator. c+n shuffled split answers Mix1 Client Aggregator Analyst Mix2 c+n shuffled split answers

  28. Step 4: aggregator generates query result • Join each bit position in the two split answer arrays. • Sum up the values for each bucket. • Obtain the noisy count for each bucket. =

  29. Privacy issue at the mixes • Client splits the answer, and sends the split answers with the query ID to the two mixes • Mix knows which query a specific client answered! Mix1 Client Aggregator Analyst Mix2 QID, answer

  30. Solution: double-splitting Mix1 Client Aggregator Analyst Mix2 QID, answer Aggregator Mix2 Mix1 Client Mix2 Mix1 Aggregator QID, answer

  31. Duplicate answer detection • A client can answer a query many times! • How to detect and remove duplicate answers? • Triple-splitting is needed  • Section 5 in the paper.

  32. Outline • Related work • SplitX system • Key insights • System design • Performance comparison • Implementation & deployment • Conclusion

  33. Computational overhead PDDP [NSDI’12] Akkus et al. [CCS’12] – “A” is #buckets that a client reports Three to five orders of magnitude more efficient in computation than previous systems

  34. Implementation • Client side • Google Chrome extension • Capture webpages browsed, searches made, extensions installed • Server side (mix + aggregator) • Web services on Jetty • RPCs defined in Thrift language

  35. Deployment • Query results from a 416-client deployment • Most visited websites: google, facebook, youtube • Most used apps: gmail, youtube, google drive • 91% of clients made ≤50 searches / day • 70% of clients visited >50 webpages / day • 97% of clients visited ≤100 websites / day

  36. Conclusion • SplitX: a high-performance private analytics system • Orders of magnitude more efficient than previous systems • Resistant to answer pollution • Key insights • XOR-based encryption • Query buckets

More Related