Mário S. Alvim Ph.D. Thesis Defense École Polytechnique – LIX Supervised by catuscia palamidessi

Formal approaches to information hiding: an analysis of interactive systems, statistical disclosure control, and refinement of specifications Mário S. Alvim Ph.D. Thesis Defense École Polytechnique – LIX Supervised by catuscia palamidessi TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA

Part I Introduction Ph.D. Defense - Mário S. Alvim

Information hiding In many cases the broad and efficient dissemination of information is desirable. But in several situations it is undesirable, or even unacceptable, that part of the information be leaked. Information hiding deals with the problem of keeping secret part of the information processed by a computational system. Ph.D. Defense - Mário S. Alvim

Subfields of information hiding • Subfields of information hiding vary depending on: • Whatone wants to keep secret; • From which adversaryor attacker; • How powerful the adversary is. • The subfields are not mutually exclusive. • We observe an increasing covergence in the research. An individual’s identity? A message’s contents? The link between an individual and an action? Can he only observe the system? Can he interact with the system? An external entity? A user of the system? Ph.D. Defense - Mário S. Alvim

Our focus • Information flow: protecting the secret information w.r.t. what can be deduced from the observable behavior of the system. • Ex: Election system • Statistical disclosure control: protecting individual information within a statistical sample. observables secrets Alice -> X X=2, Y=1 Bob -> X Time Cindy > Y Heating Ph.D. Defense - Mário S. Alvim

The qualitative approach • By observing the system’sbehavior, the adversarycannotbe sure of what the secret is. • The principle of confusion: “For every observable output generated by a secret input value, there is another secret value that could also have generated the same output.” • Does not take into consideration the adversary’s level of (un)certainty about the secret. • Noninterference:the secrets do not alter the observable behavior of the system. • Unachievable in practice. Partitioning ? ... ... Ph.D. Defense - Mário S. Alvim

The quantitative approach • Takes into consideration the level of (un)certainty of the adversary. • Allows us to compare two systems w.r.t. the level of security they provide. • Makes use of probabilities. • Main approaches: • Bayes risk • Information theory Our focus on this thesis Ph.D. Defense - Mário S. Alvim

Plan of the presentation Part II Information theory as a framework for information leakage Part III Information flow in interactive systems Part IV Differential-privacy: the trade-off between privacy and utility Part V Safe equivalences for security properties Part VI Conclusion Ph.D. Defense - Mário S. Alvim

Part II Information theory as a framework for information leakage Ph.D. Defense - Mário S. Alvim

Information theory and communication • Information theory originally focused on how to transmit information through unreliable (or noisy) channels. • It allows us to reason about: • the degree of uncertaintiy of a random variable; • the amount of information one random variable carries about another random variable. Ph.D. Defense - Mário S. Alvim

Noisy channels Channel matrix Noisy channel • is a finite input alphabet • is a finite output alphabet • is the probability of output given input • is the channel matrix where input output secrets System’sbehavior observables Ph.D. Defense - Mário S. Alvim

Information leakage • General principle: • The uncertaintycan be measured in different ways, corresponding to different models of attack. • Models of guessing attacks (Köpf and Basin): • The adversary wants to determine the value of a random variable . • He can ask (adaptatively) several yes/no questions to an oracle. • The attacker knows the a priori distribution . • Different measures of uncertainty correspond to different models of attack. A subsequent question may depend on the answer to a previous question.. Ph.D. Defense - Mário S. Alvim

Shannon entropy Leakage as mutual information: Leakage Initial uncertainty Remaining uncertainty • Meaning in security: • The adversary can ask questions of the type “Does belong to ?” • is the lower bound to the expected number of questions necessary to determine the value of . Ph.D. Defense - Mário S. Alvim

Réniy min-entropy Leakage as min-entropy leakage:: Leakage Initial uncertainty Remaining uncertainty (Smith) • Meaning in security: • One try attack: “Is?” • Closely related to the Bayes risk. Ph.D. Defense - Mário S. Alvim

Part III Information flow in interactive systems Ph.D. Defense - Mário S. Alvim

The problem of interactivity • So far the information-theoretic approach has been applied only to systems where secrets do not depend on observables. • In interactive systems secrets and observables can interleaveand influence each other: • Auction protocols, web applications, command line programs, etc. • In such systems the classic information-theoretic approach fails. Ph.D. Defense - Mário S. Alvim

The problem of interactivity: an example Web based application A seller can offer a cheapor an expensiveproduct (observables) Two possible buyers: richor poor(secrets) Channel matrix: ? expensive cheap 0.5 0.5 poor rich poor rich t’ s’ t s S=0.4, t=0.6 • Channel matrix is not invariant w.r.t. input distribution. • Capacity can no longer be calculated. S=0.1, t=0.3 Ph.D. Defense - Mário S. Alvim

Our contribution • Extend the classic information-theoretic approach to interactive systems: • Modelling systems as Interactive Information-Hiding Systems (IIHSs); • Using channels with memory and feedback; • Re-interpreting the leakage in this more genereal scenario, finding a more adequate definition of leakage. • Show that the capacity of the channels associated to IIHSs is a continuous function of the Kantorovich metric Ph.D. Defense - Mário S. Alvim

Some necessary technicalities • is a set of symbols • In a sequence of symbols, represents the symbol at time • Example: In we have and • contains all the information about the joint behavior of the sequences of inputs and outputs up to time • By probability laws: feedback memory Ph.D. Defense - Mário S. Alvim

Channels with memory and feedback Code-functions “Interactor” Stochastic Kernels Delay • Mutual information can be slpit into its components: • directed information from input to output • directed information from output to intput • It can be shown that Ph.D. Defense - Mário S. Alvim

Modelling IIHS’s as channels with memory and feedback • Theorem: Given a fully probabilistic IIHS, it is always possible to construct a joint prob. dist. s.t. it always hold (): • And a corollary shows how to construct . Code-functions “Interactor” Stochastic Kernels Combine altogether in a new joint probability distribution Combine altogether in a new joint probability distribution Delay Combine altogether in a new joint probability distribution Comes directly from the IIHS Deterministic: how to embed into it? Behavior of the channel Behavior of the IIHS Ph.D. Defense - Mário S. Alvim

Leakage • In the classical information theoretic approach: • In channels with memory and feedback: • The worst case leakage is the capacity of the channel: • where is the set of all possible input distributions 3 examples of Info. Leakage A priori uncertainty of the input distribution A posteriori uncertainty Leakage A priori uncertainty of the “reactor” A posteriori uncertainty Leakage Ph.D. Defense - Mário S. Alvim

Part IV Differential privacy: the trade-off between privacy and utility Ph.D. Defense - Mário S. Alvim

Statistical databases • A statistical database is a collection of data of several participants. • Usersof the database can ask statistical queries, such as: • Average height, maximum salary, most common disease. • Usually we consider the global information relative to the database as public, while the individual information about a participant is private. Ph.D. Defense - Mário S. Alvim

An example • A statistical database contains the salary of several employees. • A user has the some side information: • There are 100 people in the database (counting query) • The average salary is 3.000 €(average query) • Then Robert is included in the database. The user repeat the queries and finds out that the average salary is now 3.050 €. • And she can conclude that Robert earns 8.050 €: privacy breach! Newspapers,common sense,previous queries, etc Previous knowledge Ph.D. Defense - Mário S. Alvim

General problem • How to ensure that the queries provide statistical information about the whole sample without harming the privacy of the participants? • Usually it is done by adding randomization: instead of reporting the real answer for the query, a noisy answer is reported to the user. • The noise is carefully added to obfuscate the link between the values of participants in the database and the reported answer to the query. • Yet the noise should avoid reporting answers that are “too far away” from the real answers. Ph.D. Defense - Mário S. Alvim

A model of utility and privacy • Participants: • Values: • Universe of databases: • Randomized function: • where Absence is included as a special symbol, e.g. null Channel ratio dataset reportedanswer -d.p. randomized function Ph.D. Defense - Mário S. Alvim

Differential Privacy • Differential privacy [Dwork]: the effect of the presence of any individual in a database will be negligible, even when an adversary has auxiliary knowledge. • We can also consider presence/absence of any individual, or his value. • It is a strong statistical guarantee. • Formally (discrete case): • Two databases and differing on the presence/value of at most one row are called neighbors or adjacent. We write . • A function provides -differential privacy if, for every , and for all possible answer to the query: Ph.D. Defense - Mário S. Alvim

A model of utility and privacy Oblivious mechanisms: the reported answer depends only on the real answer, and not on the database. dataset real answer reported answer randomization mechanism Leakage query (-diff. priv. randomized function) Utility Ph.D. Defense - Mário S. Alvim

Our contribution (1) Does-d.p. induce a bound on the information leakage of the randomized function ? (2) Does -d.p. induce a bound on the information leakage relative to an individual? (3) Does -d.p. induce a bound on the utility? (4) Given a query and a value , can we construct a randomized function satisfying -d.p. and also presenting maximum utility? In the worst case scenario where the attacker knows the values of all other participants. Ph.D. Defense - Mário S. Alvim

The adopted measures of utility and leakage • Leakage is modeled as min-entropy leakage: • Utility is modeled with gain functions: • Binarygain function: if and otherwise. • In the binary case is the Bayes risk. Ph.D. Defense - Mário S. Alvim

Methodology The adjacency relation on the database domain induces a graph . The relation can be extended to the real answers domain :if and then is also a graph. We consider two special types of graphs: • Distance-regular Ph.D. Defense - Mário S. Alvim

Some theorems • Given a channel from to , we perform transformations which: • Are valid for the uniform input distribution; • Preserve the a posteriori min-entropy • Provide -d.p. • This allows us to find very regular matrices. • And therefore a bound on any graph Corresponds to the maximum value of . dist-regular Ph.D. Defense - Mário S. Alvim

The proof technique • The previous theorems can be applied to any channel from to . • Leakage: we apply the theorems to the channel from to • Utility: we apply the theorems to the channel from to randomization mechanism Leakage dataset real answer reported answer query Utility (-diff. priv. randomized function) Ph.D. Defense - Mário S. Alvim

The bounds • Leakage: we apply the theorems to the channel from databasesto reported answers • Proposition: is both distance-regular and • Utility: we apply the theorems to the channel from real answersto reported answers • when the graph is distance-regular or Ph.D. Defense - Mário S. Alvim

Our contribution (1) Does-d.p. induce a bound on the information leakage of the randomized function ? Yes: (2) Does -d.p. induce a bound on the information leakage relative to an individual? Yes: It works in every case, as is always dist-reg. and Ph.D. Defense - Mário S. Alvim

Our contribution (3) Does -d.p.induce a bound on the utility? Yes: (4) Given a query and a value , can we construct a randomized function satisfying -d.p.and also presenting maximum utility? Yes: , where Only when is also dist.-reg. or Only when is also dist.-reg. or Ph.D. Defense - Mário S. Alvim

An example • A database with tuples: • voter id, voter city, candidate • There are 6 cities: A, B, C, D, E, F • Query: Which city had more votes for a given candidate? • Clearly the gain is binary • is a clique Optimal mechanism: Ph.D. Defense - Mário S. Alvim

Part V Safe equivalences for security properties Ph.D. Defense - Mário S. Alvim

Equivalences in security • Equivalence relations are often used to formalize information hiding properties. • Examples: • A system guarantees anonymity for users and if: (trace equivalence) • Votesof users and for candidates and are confidential in a system if: (bisimulation) Ph.D. Defense - Mário S. Alvim

The role of nondeterminism • In the presence of nondeterminism, there is a (dangerous) implicit assumption: • all the nondeterministic possibilities of the specification will be possible under every implementation of (or at least that the adversary will believe so). • Nondeterminism can have different natures: • Nondeterminism by design: preserved under refinement; • Underspecification: not necessarily preserved under refinement. Ph.D. Defense - Mário S. Alvim

Nondeterminism by design: is secure. Should be presereved in the implementation Mix Mix Mix Ph.D. Defense - Mário S. Alvim

Underspecification: But is not secure. User May be eliminated in the implementation User User Ph.D. Defense - Mário S. Alvim

Motivation • Two types of nondeterminism: • Angelic: inherent to the system, like in . The scheduler has freedom to help the system. • Demonic: underspecification, like in . The design should guarantee that even in the worst case choice (by the scheduler), the security is still preserved. • Problem: in the equivalence approach the nondeterminism is considered only as angelic. Ph.D. Defense - Mário S. Alvim

Contribution A formalism to handle both angelic and demonic nondeterminism. Notions of safe equivalences: safe trace-equivalence and safe-bisimulation. We show that these notions of safe equivalences imply “no leakage”. Ph.D. Defense - Mário S. Alvim

Admissible schedulers • Global schedulers • Communication, interleaving • Cannot see the internal choices of the components Global nondeterminism (implementation freedom) • Local schedulers Local nondeterminism (inherent to the system) • Local schedulers • Randomness, noise • One for each component • Cannot see internal choices of the other components. Ph.D. Defense - Mário S. Alvim

Safe bisimulation • Safe bisimulation • such that, whenever , then for all admissible global schedulers : Ph.D. Defense - Mário S. Alvim

Safe trace-equivalence • Safe trace-equivalence • such that, whenever : • is but not • Theorem: safe-bisimulation implies safe trace-equivalence Ph.D. Defense - Mário S. Alvim

Safe nondeterministic information hiding Definition: A system is leakage-free if for all observable and secrets we have • Example:(Binary secret) • is but not • Now is also Ph.D. Defense - Mário S. Alvim

Safe nondeterministic information hiding Definition: A system is leakage-free if for all observable and secrets we have • Theorem: If then is leakage free. • Corollary: If then is leakage free. Ph.D. Defense - Mário S. Alvim

Mário S. Alvim Ph.D. Thesis Defense École Polytechnique – LIX Supervised by catuscia palamidessi