Benjamin Nguyen, LIFO, INSA Centre Val de Loire Systems and Data Security Team (SDS)

SQL/AA : Private and Scalable Execution of SQL Aggregates on a Secure Decentralized Architecturein TODS 43(1), 2016 Benjamin Nguyen, LIFO, INSA Centre Val de Loire Systems and Data Security Team (SDS) Joint work with Cuong Quoc To & Philippe Pucheral Inria Secure and Mobile Information Systems (SMIS), UVSQ Séminaire Confiance Numérique Clermont-Ferrand 10th October 2016

St Peter's Place, Roma Pope Benedikt People listnening Pope Francis People recording Mobile0,60€ Localisation 1,00€ Shoe size 0,15€  An era of massive generation of (personal) data • Data sources have turned digital • Analog processes, eg., photo • Paper interactions, eg., banking • Mechanical interactions, eg., transport • Communications, e.g., email, skype, sms... • Good news for the economy • $2B/year spend by US companies to buy • data about individuals [Source: Forrester Report] • R-o-I of marketing campaigns • $44.25/$1 [source: Direct Marketers Association] • 77% avg. [source: Marketing Sherpa ] • Companies based on personal data boast high market values • Facebook: $50 / account; Google: $38 billion revenues based on adverts on internet searches; monetisation of buying intentions (Amazon), recruiting market (linkedIn, viadeo), love market (Meetic), ... 2 /41

Targeted ads Profiling Photos Social Buying Mail Bank Sensors Location ?? Quantified self Controlled data exposure Globalcomputing How is personal data currently managed ? • Current Web model • Delegation  privacy issue • Huge data stores  security issue • Fragmentation  issue of completeness • Consensus  more user control • Personal Web: give data back to citizen • Quantified-self • Controlled data exposure • Crowd participation to global computations Security = protect against illegitimate access (hackers) Privacy = protect against illegitimate use (cloud provider, NSA) No access  No use  3 /41

Incentives based on empowerment & monetization ?  Mobile 0,60€ • Personal data do have a commercial value • Monetized by the Web dominant actors • Answer: Vendor Relationship Management • Eg., Ye$Profile: rent personal data (recurring income for citizen) • /!\ Possible negative value of the data • Impact of monetization • Control vs. delegate… Hold vs. centralize… • Enrich vs. fragment • Other potential incentives based on nudges (Exp. Economics) • Interesting questions at the crossroads of • Economics: viable business models, incentives • Legal studies: data property, data status, new laws required (?) • Computer sciences:Secure* PDS, data sharing, distributed comp., etc. Geolocation 1,00€ email 0,15€ Shoe size 0,15€ 4 /41

Where should data be stored to return control to users ? • On personal computers ? • Self-administration, security • On a cloud service ? • May worsen the privacy problem • And let’s not forget the founding principles of privacy preserving architectures : • User centric • Preserving users’ data ownership • Compliant with legislation and worldwide privacy preservation directives • Preserving usage • Based on (new) incentives to favor acceptance 5 /41

Where should data be stored to return control to users ?  On PlugDB !  SD card Smartcard (data) (secrets) Bluetooth MCU (data managt) Fingerprint reader USB On personal computers ? Self-administration, security On a cloud service ? May worsen the privacy problem PlugDB: a secure personal Web which… is self-administered interfaces with user devices stores, indexes, queries, recovers data authenticates, evaluates access & usage control …and offers strong privacy and security guarantees 6 /41

The Secure (Trusted) Personal Data Server Approach [VLDB’10] : Securely Store and query data locally TRUSTED CELL • Personal database is • Well-organized • Tamper resistant • Controlled by the owner(sharing, retention, audit) • Accessible in disconnected mode • Approach characteristics : • Based on tamper-resistant HW • Well Structured World (R-DB, limited apps) • Uniform equipment /41 7

PART IITrusted Cells • Introduction • Trusted Cells • Global SQL Queries • Cost Model and Experiments • Conclusion

Why do we trust trusted cells? • They can be self-managed platform • No DBA attack • + trusted hardware • Tamper-resistant / strengthened hardware, open hardware • + trusted software • Single software, small footprint (potential formal proof), open code • Isolated from applications • + protected also from its owner (opens door to usage control) • Users do not have all privileges… •  Good basis : ratio cost/benefit of an attack is high ! 9 /41

The Trusted Cell Asymmetric Architecture [CIDR’13] : Distributed management of securely stored data • TC asymmetric architecture • Built using Secure Portable Tokens as Trusted Cells (called here Trusted Data Server or TDS) / Cloud as Supporting Server Infrastructure (SSI). • Challenges : Local (Embedded) data management (not my work : Anciaux, Bouganim, Pucheralet al.) Global querying (Part III) Data export management (MinExp Project with CG78 & LIX, Anonymisation with Allard & Michel Ph.D.s) ASYMMETRIC HIGH POWER / AVAILABILITYLOW / NO TRUST LOW POWER / AVAILABILITYHIGH TRUST Export Data Secure Computation Durability, Availability Encrypted Private Data Generated (e.g. sensor) /41 10

PART IIIGlobal SQL Queries on the Asymmetric Architecture • The New Oil • Trusted Cells • Global SQL Queries • Cost Model and Experiments • Conclusion

Example Trusted Cell : a Trusted Data Server (TDS) Average Salary in Clermont-Ferrand Authorized Querier Unauthorized Querier How to compute global queries over decentralized personal data stores while respecting users’ privacy? 12 /41 Token Characteristics : • High security: • High ratio Cost/Benefit of an attack; • Secure against its owner; • Modest computing resources (~10Kb of RAM, 50MHz CPU); • Low availability: physically controlled by its owner; connects and disconnects at it will

Secure Global Computation on TCs PROBLEM : How to perform global queries on the asymmetric architecture? (i.e. using data from many/all cells) THREAT MODEL : Infrastructure (SSI) can be : Honest but curious : illicit data access Malicious (Covert Adversary) : DoS • TC can be : • Unbreakable (honest) • Broken (Malicious) HBC + Unbreakable  “simple protocols” presented here (see EDBT’14) Mal (+ Broken)  Must be prevented ! (via security primitives, see TODS’16 and DAPD’13) The « classical » problem of Secure Global Computation (e.g SMC) is more general and makes no trust assumption. /41

Is this a new problem ? Several approaches are possible to securely perform global computations: Use only an untrusted server/cloud/P2P and use generic (and costly) algorithms. (e.g. Secure Multi-Party Computing [Yao82, GMW87, CKL06], fully homomorphic encryption [Gent09]) Problem = COST Use only an untrusted server/cloud/P2P and develop a specific algorithm for each specific class of queries or applications. (e.g. DataMining Toolkit [CKV+02]) Problem = GENERICITY Introduce a tangible element of trust, through the use of a trusted component and develop a generic methodology to execute any centralized algorithm in this context. ([Katz07, GIS+10, AAB+10])  Problem = TRUST /41

Attack Model 15 /41 Querier: • Shares the secret key with TDSs (for encrypt the query & decrypt result). • Classical Access control policy (e.g. RBAC): • Cannot get the raw data stored in TDSs (get only the final result) • Can obtain only authorized views of the dataset ( do not care about inferential attacks) Supporting Server Infrastructure: • Doesn’t know query (so, attributes in GROUP BY clause) b/c query is encrypted by Querier before sending to SSI. • Has prior knowledge about data distribution. • Honest-but-curious attacker: Frequency-based attack • Malicious attacker: detect to deter

Frequency based attacks : Example Frequency based attacks [DCJ+03, CDC+05] Assume a set of (encrypted) values 1- order by frequency :  2- compare to existing known distributions of attributes (1, 1, 1, 1, 1, 1) (2, 1, 1, 1, 1) (3, 1, 1, 1) 3- infer the attribute & value(s) concerned IC=customer, =Alice, IB=balance, =200 4- link to other values /41 16

Quantifying an attack : Information Exposure Analysis (DCJP+03) • To measure Information Exposure, we consider the probability that an attacker (here the Honnest but Curious SSI) can reconstruct the plaintext table (or part of the table) using the encrypted table and his prior knowledge about global distributions of plaintext attributes. • Information Exposure is noted : • n is the number of tuples • k is the number of attributes • ICi,jis the value in row i and column j of the inverse cardinality ( = 1/number of plaintext values that could correspond) • Nj is the number of distinct plaintext values in the global distribution of attribute in column j (i.e., Nj≤ n). 17 /41

Simple Solution Overview Paul, 100K 3) Aggregation phase SSI groups John, 35K Mary, 43K … + TDS Decypher Broadcast query to TDS 4) Aggregate Filtering phase 2) Collection and Filtering phase Supporting Server Infrastructure (SSI) (#x3Z, 34) (#x3Z, 15) ($&1z, 48) … ($&1z, 24) SSI performs grouping SELECT age, AVG(salary) FROM user WHERE town = “Clermont” GROUP BY age HAVING MIN(salary) > 0 SIZE SELECT <attribute(s) and/or aggregate function(s)> FROM <Table(s) / SPTs> [WHERE <condition(s)>] [GROUP BY <grouping attribute(s)>] [HAVING <grouping condition(s)>] [SIZE <size condition(s)>]; 5) Result 1) Query 18 Stop condition: max #tuples or max time /41

Proposed Solutions 19 /41 The main difficulty is with AGGREGATE QUERIES !! Solutions vary depending on which kind of encryption is used, how the SSI constructs the partitions, and what information is revealed to the SSI. • Secure aggregation solution (presented briefly here) • Noise-based solutions (see paper) • random (white) noise • noise controlled by the complementary domain • Histogram-based solutions (see paper) We investigate these solutions along the directions of performance and security.

Noise Based Protocols • Objective : • Det_Enc on AG frequency-based attack. • Idea : • Add noise (fake tuples) to hide distribution of AG. • How many fake tuples (nf) needed?  disparity in frequencies among AG • small nf: random noise (here nf=10) • big nf: white noise ( nf>|max(x)-min(x)|2 ) • nf = n-1 per tuple : controlled noise (here nf=10*55=550) • Efficiency: • Each TDS handles tuples belonging to one group (instead of large partial aggregation as in SAgg) • However, high cost of generating and processing the very large number of fake tuples /41 Secure Aggregation Efficiency problem : nDet_Enc on AG SSI cannot gather tuples belonging to the same group into same partition.

Nearly Equi-Depth Histogram Solution True Distribution Nearly equi-depth histogram • We do not generate & process too many fake tuples • We do not handle too large partial aggregation Problem : Distribution must be discovered  This can be done “offline” using secure aggregation ! 21 /41 • Distribution of AG is discovered and distributed to all TDSs. • TDS allocates its tuple to corresponding bucket. • TDS send to SSI: {h(bucketId),nDet_Enc(tuple)} Consequences :

Secure Aggregation Q: SELECT Age, AVG(Salary) WHERE city = Clermont GROUP BY Age HAVING Min(Salary) > 0 (#x3Z, aW4r) ($f2&, bG?3) ($&1z, kHa3) (25, 35K) (45, 43K) (45, 37K) (F!d2, s7@z) (ZL5=, w2^Z) (25, [35K,1]) (45, [40K,2]) Decrypt Qi Check AC rules Decrypt Qi Check AC rules Decrypt Qi Check AC rules Hold partial aggregation (Gij,AGGk) … (#i3Z, afWE) (T?f2, s!@a) ($f2&, bGa3) (45y, Clermont, 43K) (53y, Paris, 100K) (25y, Clermont, 35K) (?i6Z, af~E) (T?f2, s5@a) (5f2A, bG!3) Evaluate HAVING clause encrypts its data using non-deterministic encryption ($f2&, bG?3) Final Agg (#f4R, bZ_a) (Ye”H, fw%g) (@!fg, wZ4#) Final Result (#f4R, bZ_a) (Ye”H, fw%g) (#x3Z, aW4r) No answer ? Supporting Server Infrastructure (SSI) Form partitions (fit resource of a TDS) } (#x3Z, aW4r) ($f2&, bG?3) ($&1z, kHa3) … (T?f2, s5@a) 22 Qi= <EK1(Q),Cred,Size> (25, 29.5K) (45, 43.7K) … /41 Querier

Information Exposure Analysis (Damiani et al. CCS 2003) • n: the number of tuples, • k: the number of attributes, • ICi,j : IC for row i and column j • Nj: the number of distinct plaintext values in the global distribution of attribute in column j (i.e., Nj≤ n) SAgg: ICi,j = 1/Nj for all i,j EDHist: requires finding all possible partitions of the plaintext values such that the sum of their occurrences is the cardinality of the hashed value: NP-Hard multiple subset sum problem 23 Noise_based & ED_Hist have a uniform distribution of the AG: ɛED_Hist= ɛNoise_based Plaintext: ɛS_Agg≤ ɛED_Hist =ɛNoise_based <1

PART IVCost Model and experiments • The New Oil • Trusted Cells • Global SQL Queries • Cost Model and Experiments • Conclusion

Unit Test Calibration } • Eval Board • 32 bit RISC CPU: 120 MHz • Crypto-coprocessor: AES, SHA • 64KB RAM, 1GB NAND-Flash • USB full speed: 12 Mbps SMIS developed token (manufactured by SMEs) Same technical characteristics Price = 50-300 EUR (small series, depending on caracteristics) 25 /41 Internal time consumption

Parameters for cost model Dataset size Ttuple : varies from 5 to 65 million Number of groups G : varies from 1 to 106 Number of TDSs participating in the computation as a percentage of all TDSs connected at a given time Ttds : varies from 1% to 100%). We fix two parameters and vary the other, measuring : execution time, parallelism of the protocol, total load, maximum load on one TDS When the parameters are fixed : Ttuple=106, G=103, % of TDS connected = 10% of Ttuple. We also compute and use the optimal value for all reduction factors as well as for. In the figures, we plot two curves for Rnf_Noise protocols RN (nf = 2) and WN (nf = 1000) to capture the impact of the ratio of fake tuples. /41

EXECUTION TIME Ttuple=106; G=1-106 Ttuple=5.106 - 35.106; G=1000 • Naïve, noise-based, ED&EW: • G increases, Ttuple fixed  Number of tuples in each group decreases • Depend only on the total number of tuples in each group (because all groups are processed in parallel)  exeTime decreases when G increases. • Naïve, RN, ED&EW: • Ttuple increases, Ttds increases accordingly  not much changes • Secure Count: • Number of recursive steps increases when Ttuple increases.  exeTime increase • Secure Count: • G increases  time for processing the big partial aggregation increases accordingly. • Cannot fully deploy the parallel computation (cannot divide each group for TDSs in parallel, each TDS has to handle the whole G groups)  exeTime increases • WN,CN: • Number of fake tuples increases linearly with the number of true tuples. •  exeTime also increases linearly to handle the fake & true tuples 27 /41

NUMBER OF PARTICIPATING TDSS Ttuple=106; G=1-106 Ttuple=5.106 - 35.106; G=1000 • Secure Count: • G increases  level of convergence is low & the size of each aggregation is big •  need less participating TDSs to build the aggregations to gain the high convergence level • WN, CN: • When true Ttuple increases, the fake tuples increases as well  more TDSs are needed to process fake tuples • Secure Count: • Level of parallelism is less than other solutions  needs least TDS • Other solutions: • Since each group is processed in parallel and independently  when G increases, the level of parallelism increases •  more TDSs are needed to participate in the parallel computation 28 /41

TOTAL LOAD (NETWORK OVERHEAD) Ttuple=106; G=1-106 Ttuple=5.106 - 35.106; G=1000 • Noised-based: • Highest load because of the fake tuples • When G increases but Tpds does not change • number of tuples (both true and fake) do not change total load is the same • Others: • Lower load since handle only true data • Noised-based: • When true Ttuple increases, the fake tuples increases linearly • total load is highest and increases 29 /41

MAXIMUM LOAD Ttuple=106; G=1-106 Ttuple=5.106 - 35.106; G=1000 • WN, CN: • Use all available PDSs •  maxLoad increases linearly when Ttuple increases • Others: • when Ttuple increases, the number of participating PDSs also increase accordingly  in general, the maxLoad does not increase too much • Secure Count: • When G increases, size of each aggregation is big • each PDS process bigger aggregation • When G increases, number of participating PDSs decrease  each participating PDS incurs higher load • Others: • When G increases, number of participating PDSs decrease & number of tuples in each group decreases • each PDS process less tuples  maxLoad decrease 30 /41

AVERAGE LOAD Ttuple=5.106 - 35.106; G=1000 Ttuple=106; G=1-106 • Secure Count: • Total load is unchanged but the number of participating TDSs is reduced when G increases •  the average load increases. • WN,CN: • High total load is the same & all PTpds=10^5 participate in the computation •  every PDSs incur the same amount of load • Others: • G increase, more participating PDSs & total load unchanged  AvgLoad decreases Although: TotalLoad(CN) > TotalLoad(SC) PTpds(CN) >> PTpds(SC)  AvgLoad(CN) < AvgLoad(SC) 31 /41

CONSUMED MEMORY Actual RAM size of TDS • Noise-based: • Need to store only 1 group regardless of G •  Require least RAM. • Histogram-based: • Each PDS store h groups (h>1) regardless of G •  Require higher RAM • SC: • Each PDS store all G groups • When G increases, RAM needed increases •  Require highest RAM • Exceed actual RAM’s size  future work 32 /41

Experimental Scalability (experiments on LIPN cluster)

COMPARISON WITH OTHER STATE-OF-THE-ART METHODS Answering aggregation queries in a secure system model. (Ge & Zdonic, VLDB 2007) DES: each value is decrypted and the computation is performed on the plaintext. Server must have access to secret key & plaintext (violates security requirements) Paillier: perform computation directly on the ciphertext using a secure homomorphic encryption scheme: enc(a + b) = enc(a) + enc(b) Server performs computation without having access to the secret key or plaintext. In the end, ciphertext are passed back to the trusted agent (i.e., Key Holder) to perform a final decryption and simple calculation of the final result • Hardware: • Linux workstation; • AMD Athlon-64 2Ghz processor; • 512 MB memory • SC: depends mostly on G (slightly on Ttuple) • Others: not depends on G, but mostly on Ttuple 34 /41

Metrics for the evaluation of the proposed solutions Average Time/Load Query Response Time Throughput Information Exposure Resource Variation 35 /41 Total Load

Trade-off between criteria 36 /41 Select .. From .. Where .. Group By AG G = card (AG) Security: S_Agg > ED_Hist Performance: G > 10: ED_Hist faster than S_Agg G <= 10: ED_Hist slower than S_Agg

PART VConclusion and perspectives • The New Oil • Trusted Cells • Global SQL Queries • Cost Model and Experiments • Conclusion

Short/Middle term research :Data intensive Computing on an Asymmetric Architecture SQL (With SMIS) Queries here do not have joins ! Take into account more attack models (e.g. Broken Tokens) Field experiment on usability (with ISN / A. Katsouraki PhD thesis) Private/Secure MapReduce (With LIPN -- some results in Coopis’15) Investigate compatibility of our protocols. Develop new protocols. Check performance ! Secure Graph computations (With LIX) Study social networking applications Secure K-core and k-truss computations (Rossi PhD thesis) Anonymisation Axel Michel’s Ph.D. : participation in aggregate queries with user constraints (k-anon) /41

Current results • Important features already devised (Inria + INSA CVL) • Authentification, Relational DBMS + RBAC, Search engine, (global) anonymization, distributed SQL, secure data recovery… • Contact me for refs  • Many interesting perspectives • Usage control & enforcement, data sharing schemes, complex computation (sandboxes), other data types (eg, time series), global computing models (eg, Map Reduce), etc. • Integrated into a platform for education (PlugDB) • “Systèmes d’Information Privacy-by-Design” at ENSIIE, UVSQ, INSA CVL : Tutorials, exercises, project topics… 39 /26

Promoting the Trusted Cells vision Trusted Cells “Core” Important features already devised (Inria + INSA CVL) Authentification, Relational DBMS + RBAC, Search engine, (global) anonymization, distributed SQL, secure data recovery… Contact me for refs  Many interesting perspectives Usage control & enforcement, data sharing schemes, complex computation (sandboxes), other data types (eg, time series), global computing models (eg, Map Reduce), etc. Integrated into a platform for education (PlugDB) “Systèmes d’Information Privacy-by-Design” at ENSIIE, UVSQ, INSA CVL : Tutorials, exercises, project topics…  develop a community ! Beyond Tamper Resistant HW Results are useable even with lower trust elements. Include social trust / reputation. Use virtualization. /41

QUESTIONS ? 41

AVERAGE TIME FOR PDS TO CONNECT Ttuple=106; G=1-106 Ttuple=5.106 - 35.106; G=1000 • Secure Count: • The number of participating PDSs is reduced when G increases •  the average time increases. • WN,CN: • High total load is unchanged & all PTpds=10^5 participate in the computation •  every PDSs take the same amount of time to process data • Others: • G increase, more participating PDSs •  AvgTime decreases • High AvgTime: • WN,CN: because of too many fake tuples • SC: because of very few participating PDSs 43 /41

Theoretical Scalability Tpds = 1%Ttuple Tpds = 10%Ttuple Tpds = 100%Ttuple Secure Count: has a (low) maximum number of participants. Others: WN have higher scalability than others (in the sense that adding participants count) 44 /41

Benjamin Nguyen, LIFO, INSA Centre Val de Loire Systems and Data Security Team (SDS)