– with special attention to location ( ) privacy

Bettina Berendt Dept. Computer Science K.U. Leuven – with special attention to location ( ) privacy SPACE WEB MINING and PRIVACY : foes or friends?

SPACE WEB MINING PRIVACY

BASICS

What is Web Mining? And who am I? Knowledge discovery (aka Data mining): "the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data." Web Mining: the application of data mining techniques on the content, (hyperlink) structure, and usage of Web resources. Web structure mining Web usage mining Web mining areas: Web content mining Navigation, queries, content access & creation

Why Web / data mining? “the database of Intentions“ (J. Battelle)

Location-based services and augmented reality www.poynt.com

Semiotically augmented reality: semapedia and related ideas

Mobile Social Web

What's special about spatial information? 1. Interpreting Rich inferences from spatial position to personal properties and/or identity possible Pos(A,9-17) = P1 → workplace(A,P1) Pos(A,20-6) = P2 → home(A,P2) An even richer „database of intentions“?! Pos(A,now) = P3 & temp(P3,now,hot) → wants(A,ice-cream) (location-based services) Pos(A, t in 13-18) = Pos(Demonstration,13-18) → suspicious(A) (ex. Dresden phone surveillance case 2011)

What's special about spatial information? 2. Sending, or: Opt-out impossible?! Physically: You cannot be nowhere Corollary: You cannot be in two places at once → limits on identity-building Contractually: Rental car with tracking, ... Culturally I: Opt-out may preclude basics of identity construction No mobile phone/internet communication Culturally II: Opt-out considered suspicious in itself (ex. A. Holm surveillance case 2007)

FOES ?

Behaviour on the Web (and elsewhere) Data

(Web) data analysis and mining Data Privacy problems!

Technical background of the problem: • The dataset allows for Web mining (e.g., which search queries lead to which site choices), • it violates k-anonymity (e.g. "Lilburn"  a likely k = #inhabitants of Lilburn)

Inferences Data mining / machine learning: inductive learning of models („knowledge“) from data Privacy-relevant (Re-)identification: inferences towards identity Profiling: inferences towards properties Application of the inferred knowledge

What is identity merging?Or: Is this the same person?

Data integration: an example Paper published by the MovieLens team (collaborative-filtering movie ratings) who were considering publishing a ratings dataset, see http://movielens.umn.edu/ Public dataset: users mention films in forum posts Private dataset (may be released e.g. for research purposes): users‘ ratings Film IDs can easily be extracted from the posts Observation: Every user will talk about items from a sparse relation space (those – generally few – films s/he has seen) [Frankowski, D., Cosley, D., Sen, S., Terveen, L., & Riedl, J. (2006). You are what you say: Privacy risks of public mentions. In Proc. SIGIR‘06] Generalisation with more robust de-anonymization attacks and different data: [Narayanan A, Shmatikov V (2009) De-anonymizing social networks. In: Proc. 30th IEEE Symposium on Security and Privacy 2009]

Given a target user t from the forum users, find similar users (in terms of which items they related to) in the ratings dataset Rank these users u by their likelihood of being t Evalute: If t is in the top k of this list, then t is k-identified Count percentage of users who are k-identified E.g. measure likelihood by TF.IDF (m: item) Merging identities – the computational problem

Results

What do you think helps?

What is classification (and prediction)?

Predicting political affiliation from Facebook profile and link data (1): Most Conservative Traits Lindamood et al. 09 & Heatherly et al. 09

Predicting political affiliation from Facebook profile and link data (2): Most Liberal Traits per Trait Name Lindamood et al. 09 & Heatherly et al. 09

What is collaborative filtering? "People like what people like them like"

User-based Collaborative Filtering • Idea: People who agreed in the past are likely to agree again • To predict a user’s opinion for an item, use the opinion of similar users • Similarity between users is decided by looking at their overlap in opinions for other items

Example: User-based Collaborative Filtering

Similarity between users • How similar are users 1 and 2? • How similar are users 1 and 4? • How do you calculate similarity?

Popular similarity measures Cosine basedsimilarity Adjusted cosine basedsimilarity Correlation based similarity

Algorithm 1: using entire matrix 5 7 7 Aggregation function: often weighted sum Weight depends on similarity 8 4

Algorithm 2: K-Nearest-Neighbour Neighbours are people who have historically had the same taste as our user 5 7 7 Aggregation function: often weighted sum Weight depends on similarity 8 4

Summary: Lots of data → lots of privacy threats (and opportunities) The Web incites one of the semiotically richest (and often machine-processable) types of interaction Space incites data-rich types of interaction → two rich sources of „the database of intentions“

How many people see an ad? Television: sample viewers, extrapolate to population Web: count viewers/clickers through clickstream City streets: count pedestrians / motorists? Too many streets! → Solution intuition: sample streets, predict

Fraunhofer IAIS (2007): predict frequencies based on similar streets Street segments modelled as vectors Spatial / geometric information Type of street, direction, speed class, … Demographic, socio-economic data about vicinity Nearby points of interest (buffer around segment, count #POI) KNN algorithm Frequency of a street segment = weighted sum of frequencies from most similar k segments in sample Dynamic + selective calculation of distance to counter the huge numbers of segments and measurements

IP filtering:a deterministic classification modelIP → country

Where do people live who will buy the Koran soon? Technical background of the problem: • A mashup of different data sources • Amazon wishlists • Yahoo! People (addresses) • Google Maps each with insufficient k-anonymity, allows for attribute matching and thereby inferences

Multiple views on traffic Weather Major events Incident reports Operator ID: Nick Heading: INCIDENT Message: INCIDENT INFORMATION Cleared 1637: I-405 SB JS I-90 ACC BLK RL CCTV 1623 – WSP, FIR ON SCENE • Event store • Learning • Reasoning Traffic Prediction: space data + Web data + ... E.g. LARKC project: I. Celino, D. Dell'Aglio, E. Della Valle, R. Grothmann, F. Steinke and V. Tresp: Integrating Machine Learning in a Semantic Web Platform for Traffic Forecasting and Routing. IRMLeS 2011 Workshop at ESWC 2011.

PEACEFUL COEXISTENCE ?

Recall (a simple view): Cryptographic privacy solutions Data not all !

"Privacy-preserving data mining" Data not all !

Privacy-preserving data mining (PPDM) Database inference problem: "The problem that arises when confidential information can be derived from released data by unauthorized users” Objective of PPDM : "develop algorithms for modifying the original data in some way, so that the private data and private knowledge remain private even after the mining process.” Approaches: Data distribution Decentralized holding of data Data modification Aggregation/merging into coarser categories Perturbation, blocking of attribute values Swapping values of individual records sampling Data or rule hiding Push the support of sensitive patterns below a threshold

Example 1: Collaborative filtering

Collaborative filtering: ideaand architecture Basic idea of collaborative filtering: "Users who liked this also liked ..."  generalize from "similar profiles" Standard solution: At the community site / centralized: Compute, from all users and their ratings/purchases, etc., a global model To derive a recommendation for a given user: find "similar profiles" in this model and derive a prediction Mathematically: depends on simple vector computations in the user-item space

– with special attention to location ( ) privacy