250 likes | 370 Views
Privacy-oriented Data Mining by Proof Checking. Stan Matwin (joint work with Amy Felty ) SITE University of Ottawa, Canada stan@site.uottawa.ca. The TAMALE Group. 4 profs Some 30 graduate students Areas: machine learning, data mining, text mining, NLP, data warehousing Research in
E N D
Privacy-oriented Data Mining by Proof Checking Stan Matwin (joint work with Amy Felty ) SITE University of Ottawa, Canada stan@site.uottawa.ca
The TAMALE Group • 4 profs • Some 30 graduate students • Areas: machine learning, data mining, text mining, NLP, data warehousing • Research in • Inductive Logic Programming • Text mining • Learning in the presence of knowledge • Applications of ML/DM (e.g. in SE: tools for maintenance personnel) UBC 29/7/03
Why did I get into this research? • what is already being done… and why it ’s not enough • the main idea • its operation • discussion – ’correctness ’ • prototype - Coq and CIC • example • some technical challenges • acceptance? UBC 29/7/03
Some useful concepts... • opting outvs opting in • Use Limitation Principle: data should be used only for the explicit purpose for which it has been collected UBC 29/7/03
…and existing technical proposals On the web: P3P Platform for Privacy Preferences • W3C standard • XML specifications - on websites and in browsers - of what can be collected and for what purpose - ULP? • Handles cookies • Data exchange protocol more than privacy protocol: no provisions for opting out after an initial opt-in • the ULP part is in NL…not verifiable UBC 29/7/03
Agrawal ’s data perturbation transformations • data is perturbed by random distortion: xi xi+r r uniform or gaussian • a procedure to reconstruct a PAC-esitimation of the original distribution (but not the values) • a procedure to build an accurate decision tree on the perturbed distribution UBC 29/7/03
Agrawal ’s transformations cont’d • proposes a measure to quantify privacy –estimateintervals and their size • lately extended to non-numerical attributes, and to association rules • does not address the ULP • how do we know it is applied? UBC 29/7/03
the main idea: towards a verifiable ULP • User sets permissions: what can and cannot be done with her data • Any claim that a software respects these permissions is a proof of a theorem about the software • Verifying the claim is then checking that proof against the software UBC 29/7/03
Who are the players? • User C • Data miner Org • Data mining software developer Dev • Independent verifier Veri • …BUT no one owns the dataD UBC 29/7/03
D: database scheme A: given set of database and data mining operations S: source code for A PC(D,A): C’s permissions T(PC,S): theorem that S respects PC R(PC,S): proof of T(PC,S): B: binary code of S UBC 29/7/03
Discussion - properties • It can be proven that C ’s permissions are respected (or not): PC is in fact a verifiable ULP • PC can be negative (out) or positive (in) • proof construction needs to be done only once for a given PC ,D and A • Scheme is robust against cheating by Dev or Org UBC 29/7/03
Acceptance issues • No Org will give Veri access to S • Too much overhead to check R(PC,S) for each task, and each user • Too cumbersome for C • Based on all Orgs buying in UBC 29/7/03
Acceptance1:Veri’s operation- access • Veri needs • PC from C • R(S, PC) from Dev • S from Dev • B from Org • Veri could check R(S, PC) at Dev’s • Veri needs to verify that S(belonging normally to Dev) corresponds to Bthat Org runs. UBC 29/7/03
Acceptance2:overhead • Veri runs proof checking on a control basis • Org’s execution ovhd ? UBC 29/7/03
Issues • Naming the fields: XML or disclosure • restricted class of theorems for a given P-automating proof techniques for this class UBC 29/7/03
Acceptance3: C’s perspective • Building PCs must be easy for C, based on D and processing schema; initially a closed set? • permissions could be encoded on a credit card, smart card, in the electronic wallet • or in the CA; they can then be dynamically modified and revoked UBC 29/7/03
« Political » aspects: who is Veri? • generally trusted – • « consumer association »? • « Ralph Nader »? • « transparency international »? • IT expert at the level of instrumenting and running the proof checker – connection to Open Software Foundation? • theorem proving can be cast as « better testing » UBC 29/7/03
how to make Orgs buy in? • The first Org is needed to volunteer • a Green Data Mining logo will be granted and administered (verified) by Veri • other Orgs will have an incentive to join UBC 29/7/03
Future work • Build the tools • expand the prototype • extend from Weka to commercial data mining packages • Integrate with P3P? • find a willing Org UBC 29/7/03
Link between S and B • compilation not an option • watermarking solution: B is watermaked by a slightly modified compiler with MD5(tar(S)) =128 bytes • marks are inserted by a trusted makefile-and-compiler in locations in B given by Veri and unknown to Org UBC 29/7/03
Link… • Veri, given access to S, can verify that B corresponds toS • An attack by I requires hacking the compiler • An attack by Org requires knowing the locations of watermarks UBC 29/7/03
Example C restricts her Employee data from participating in a join with her Payroll data RecordPayroll : Set := mkPay{PID : nat; JoinInd : bool; Position : string; Salary: nat}. RecordEmployee : Set := mkEmp{Name : string; EID : nat; …}. RecordCombined : Set := mkComb{CID : nat; CName : string; Csalary: nat; …}. UBC 29/7/03
FixpointJoin [Ps: list Payroll]: (list Employee) (list Combined) := [Es : list Employee] CasesPs of nil(nil Combined) | (cons p ps) (app (check_JoinInd_and_find_employee_recordp Es) (Join ps Es)) end. (check_JoinInd_and_find_employee_recordp Es) if a record is found in Es whose EID matches Ps PID and JoinInd permits Join, then a list of length 1 with the result of Join is returned, otherwise empty UBC 29/7/03
Definition Pc:= [S:((list Payroll)(list Employee) (list Combined)) Prop] Ps:list Payroll}. Es:list Employee. (UniqueJoinInd Ps) P: Payroll.(In P Ps) ((JoinInd P)=false not C:Combined ((In C (S Ps Es)) ((CID C)=(PID P))) • PC(S) is written as (PC Join): Coq expands the definition of PC and provides the theorem • request to proof checking operator of Coq will check this proof i.e; it will check that the user permissions are encoded into the Join program given • Whole proof: 300 lines of Coq code; proof checking: 1 sec on a 600MHz machine UBC 29/7/03