Data Mining Over a Large, Dynamic Network

Data Mining Over a Large, Dynamic Network An Overview and Algorithm for Outlier Detection An Overview and Algorithm for Outlier Detection

Talk Outline • Introduction • Data mining over a large, dynamic network -- what and why? • Some recent research on “epidemic-style” algorithms • Example problem • Outlier detection over a large, dynamic network • An algorithm and experiments in a wireless sensor network • Wrap-up summary

Large, Dynamic Networks • Internet, peer-to-peer file sharing networks • Kazaa: 5M simultaneously on-line users (Jan 2004) • Wireless sensor networks • 800 node WSN built at Berkley (2001) • 10000+ node WSNs are expected in the near future

Large, Dynamic Networks • Large number of network nodes each with a local data set, Di • Dynamic environment: • nodes may enter/leave at will • nodes’ local data, Di, may change at will Distributed (time varying) dataset: D = UDi

Why Mine a Large, Dynamic Network? • Goal: learn global properties of the dataset, D. Does the fraction of nodes in the WSN reporting a problem exceed a threshold? Download patterns in an internet, P2P file-sharing network.

The Challenges • Large network size  No global operators/synchronizations   Be communication-efficient • Dynamic environment (changing local data, node joins, departures)   Algorithms robust to data/network change

Epidemic-Style Algorithms Each node in the network • maintains an estimate of the desired global property (e.g. global average, majority) • selectively communicates with neighboring nodes in the network All nodes’ estimates eventually converge on the true global property.

Epidemic-Style Algorithms A good choice for data mining over large, dynamic networks: • highly decentralized • asynchronous • robust to data/network change

Epidemic-Style Algorithm Examples • Majority voting • A useful basic primitive • Facility location, decision tree induction • All using majority voting as a building block • Outlier detection  This talk.

Outlier Detection Over a Large, Dynamic Network • Applications: • data cleaning in wireless sensor networks • detecting misconfigured machines grid networks But first, what do I mean by an “outlier”?

Overview of Outlier Detection on Centralized Data • Outlier – A point in a dataset which appears to be inconsistent with the remainder of the dataset. • Outlier detection is studied widely • Statistics • Machine Learning • Data Mining

Centralized Outlier Detection (cont.) • Outlier detection problem: • Given dataset D, score parameter k, and integer n, find the top n scored points in D.

Outlier Detection Over a Large, Dynamic Network • Problem: • Each node Ni has: a local data set Di (time varying), and parameters k and n. • Solve the unsupervised outlier detection problem over D = UDi. • At the end, each node has, O[D], the top n outliers over D.

Algorithm • Assumptions: • Reliable message delivery • Reliable immediate neighborhood knowledge • Choice of score function • Algorithm works for any scoring function satisfying • “anti-monotonicity” and “smoothness”

Nodes send data points to their immediate neighbors based on events: • Local data change (birth or death of points in Di) • Receipt of data points from immediate neighbors • Change in immediate neighborhood

Some Notation Given neighboring nodes N1 and N2, • Let P1→2 denote all the points sent by N1 to N2 • Let P2→1 denote all the points sent by N2 to N1 • Let P1 denote the set of points currently held by N1: D1 U P21

Node Event Response: Basic Idea • Each node maintains an estimate of O[D], the global outliers. • N1 maintains O[P1]; N2 maintains O[P2] • N1 sends any point to N2 that might change N2's estimate. • Call these necessary points from P1 How does N1 compute its necessary points?

N1 approximates N2’s estimate O[P2] as O[P12 U P21] The following are necessary points from P1 1. O[P1] and its k nearest neighbors 2. k nearest neighbors of O[P12 U P21] Z0 = union of 1. and 2. If Z0 were sent to N2, its estimate might change.

N1 would approximate N2’s new estimate as O[Z0 U P12 U P21] The following would be necessary points from P1 1. Z0 2. k-nn of O[Z0 U P12 U P21] Z1 = union of 1. and 2. If Z1 were sent to N2, its estimate might change.

N1 would approximate N2’s new estimate as O[Z1 U P12 U P21] Z2 = Z1 U k-nn of O[Z1 U P12 U P21] Etc…. Let Z* = Zi+1 = Zi for i as small as possible (fixed-point).

Z* is the set of necessary points from P1. N1 actually sends Z*/(P12 U P21) N1 knows N2 already has (P12 U P21)

Some Key Algorithm Features • Highly decentralized • Asynchronous • Robust to network/data change Each node’s estimate is guaranteed to converge to O[D] (the globally outliers).

Experiments in a Wireless Sensor Network

Wrap-Up Summary • Data mining over a large, dynamic network • Motivation (wireless sensor networks, internet P2P information sharing networks) • Some recent research on epidemic-style algorithms • Outlier detection over a large, dynamic network • An algorithm and experiments in a wireless sensor network

THE END Thank you for your attention. Chris Giannella: cgiannel@acm.org Slides Available From: www.cs.nmsu.edu/~cgiannel/Mining_Dynamic_Network2008.ppt

Selected Research Contributions in Distributed Data Mining • DM over a large, dynamic network • K-means clustering (SDM 2006) • Outlier detection (ICDCS 2006) • Decision tree induction (Statistical Analysis and Data Mining 2008) • DDM on astronomy data (NASA funded) • Principal component analysis and outlier detection (SDM 2007) • Privacy preserving data mining • Distance-preserving data perturbation (PKDD 2006)

Mining a Large-Scale, Distributed Dataset The problem: • Each node Ni has: a data set Di (time varying). • Compute some function, F, of the global dataset D = U Di. • Each node learns F(D).

Two Main Centralization Problems • Communication Cost • Centralization creates a communication bottleneck • Privacy Loss • Data from one source may be too sensitive to reveal to others

DDM on Sensitive Data • Privacy Preserving Data Mining (PPDM) • Example: multiple banks • Detect fraudulent transactions based on all the companies' data → Secure Multi-party Computation (cryptography) • Example: census data • Release the data to other organizations for analysis → Data transformation

DDM on Non-Sensitive Data • Problem: data centralization creates a communication bottleneck • Static environment, data and network do not change • Parallel computing • Dynamic environment • Remainder of the talk

Some Common PPDM Techniques • Secure multi-party computation (cryptography) • Build complicated distributed algorithms from simple, secure primitives (e.g. secure sum) • Data transformation • Release a transformed dataset

DDM Introduction Summary • Basic DDM framework • DDM analysis of sensitive data (Privacy Preserving Data Mining) • Data transformation vs. cryptographic approaches • Analysis of non-sensitive data • Static environment (parallel computing) • Dynamic environment (remainder of the talk)

Centralized Outlier Detection (cont.) • Define a function, S, such that for any set A of data points in Rn, and a data point x in Rn, • S(A,x) defines the outlier score of x with respect to A • The larger the score, the more x is deemed to be inconsistent with A. • Example: S(A,x) = Average Euclidean distance of x to its k nearest neighbours in A. (k a fixed constant)

Epidemic-Style Algorithms ??????????Each node in the network • maintains an estimate of the desired global property (e.g. global average, majority) • selectively communicates with immediately neighboring nodes upon detecting an event • local data/network change, message receipt. All nodes’ estimates eventually converge on the true global property

Basic Idea • O[P1] is N1's current estimate of O[D]. • N1 sends any point to N2 that might change N2's outlier estimate, O[P2].

N1 should send its outlier estimate and their K nearest neighbours • B0 = O[P1] U KNN(O[P1]) • Assuming B0 is sent • N1 should also send its K nearest neighbours to N2’s outlier estimate, O[P2] • N1 guesses O[P2] to be O[B0 U P1→2 U P2→1] • N1 should send • B1 = B0 U KNN(O[B0 U P1→2 U P2→1])

Assuming B1 is sent, N1 should send • B2 = B0 U KNN(O[B1 U P1→2 U P2→1]) • Assuming B2 is sent, N1 should send • B3 = B0 U KNN(O[B2 U P1→2 U P2→1]) • Etc…. • N1 sendsa fixed-point B*: • B* = B0 U KNN(O[B* U P1→2 U P2→1])

Selected Research Contributions • Privacy preserving data mining • Distance preserving data transformation (PKDD 2006) • Outlier detection using secure multi-party computation (in preparation) • DDM in dynamic environments • K-means clustering (SDM 2006) • Outlier detection (ICDCS 2006) ===> Remainder of the talk

Theorem • If data and network are fixed, then • The algorithm terminates • i.e. All nodes stop sending messages • Upon termination, O[Pi] = O[D] for every node Ni • i.e. The outlier estimate for every node equals the true outliers. • The algorithm is guaranteed to converge on the correct result provided enough time elapses between changes

References • “K-Means Clustering Over a Large, Dynamic Network”, Siam Conf. Data Mining (SDM) 2006 • “In-Network Outlier Detection in Wireless Sensor Networks”, Int. Conf. Distributed Computing Systems (ICDCS) 2006 • “Distributed Decision Tree Induction in Peer-to-Peer Systems”, Statistical Analysis and Data Mining (to appear) • “Distributed Top-K Outlier Detection from Astronomy Catalogs using the DEMAC System”, Siam Conf. Data Mining (SDM) 2007 • “An Attacker's View of Distance Preserving Maps for Privacy Preserving Data Mining”, European Conf. Principals and Practice of Data Mining (PKDD) 2006

Data Mining Over a Large, Dynamic Network

Data Mining Over a Large, Dynamic Network

Presentation Transcript

Data Mining for Network Intrusion Detection

Data Mining for Network Intrusion Detection

Mining Large Graphs

Large scale genomic data mining

Big Data over a 100G Network at Fermilab

Summary: Social Network Data Mining Privacy

Large scale genomic data mining

Verifying and Mining Frequent Patterns from Large Windows over Data Streams

Mining Large Data at SDSC

Collecting Data over the Network

Large Graph Mining

Verify and mining frequent patterns from large windows over data streams

STRING Large-scale data and text mining

Large Graph Mining

Mining Top-K Large Structural Patterns in a Massive Network

Data Mining of Very Large Data

Data transfer over the wide area network with a large round trip time

Mining social network data

Artificial Neural Network in Data Mining

Data Mining over Hidden Data Sources

Special Challenges With Large Data Mining Projects

Verifying and Mining Frequent Patterns from Large Windows over Data Streams