1 / 74

Tools and Algorithms for Querying and Mining Large Graphs

Tools and Algorithms for Querying and Mining Large Graphs. Hanghang Tong Machine Learning Department Carnegie Mellon University htong@cs.cmu.edu http://www.cs.cmu.edu/~htong. Thesis Committee. Christos Faloutsos William Cohen Jeff Schneider Philip S. Yu. Graphs are everywhere!.

ismet
Download Presentation

Tools and Algorithms for Querying and Mining Large Graphs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tools and Algorithms for Querying and Mining Large Graphs Hanghang Tong Machine Learning Department Carnegie Mellon University htong@cs.cmu.edu http://www.cs.cmu.edu/~htong

  2. Thesis Committee • Christos Faloutsos • William Cohen • Jeff Schneider • Philip S. Yu

  3. Graphs are everywhere!

  4. Motivating Questions: (high level) • Given a large graph, we want to +Task A: Querying • +Task B: Mining CePS on DBLP [Tong+ KDD 06] T3 on CIKM [Tong+ CIKM 08] Will return to this later…

  5. Motivating Questions (in details) • Querying[Goal: query complex relationship] • Q.1. Find complex user-specific patterns; • Q.2. Link Prediction & Proximity Tracking; • Q.3. Answer all the above questions quickly. • Mining[Goal: find interesting patterns] • M.1. Spot Anomalies; • M.2. Mine time & space; • M.3. Detect communities.

  6. Thesis Overview Q1 Q2 Q2 Q3 Q3 M1 M1 M2 M2 M3 M3

  7. Questions That We Ask Thesis Overview Completed Proposed Q1 CePS, G-Ray, ProSIN (KDD06, KDD07 a, ICDM08) pTrack/cTrack (SDM08, SAM08) Q2 Q2 DAP (KDD07 b) Q3 FastProx (ICDM06, KAIS07, KDD07 b, ICDM08) P3 Q3 FastProx (SDM08, SAM08) M1 M1 Colibri-S (KDD08 b) P3 Colibri-D (KDD08 b) M2 M2 P2 P3 T3/MT3 (CIKM08) M3 P1 P3 P1 M3

  8. Thesis Overview: Impact Querying Mining Footnote: Our work for Q1 has been transferred into IBM product (Cyano)

  9. Roadmap • Preliminary • Q1 • Q2 • Q3 • Introduction • Completed Work • Querying • Mining • Proposed Work

  10. Preliminary: Proximity Measurement a.k.a Relevance, Closeness, ‘Similarity’…

  11. Questions That We Ask Thesis Overview Completed Proposed Q1 CePS, G-Ray, ProSIN (KDD06, KDD07 a, ICDM08) pTrack/cTrack (SDM08, SAM08) Q2 Q2 DAP (KDD07 b) Q3 FastProx (ICDM06, KAIS07, KDD07 b, ICDM08) P3 Q3 FastProx (SDM08, SAM08) M1 M1 Colibri-S (KDD08 b) P3 Colibri-D (KDD08 b) M2 M2 P2 P3 T3/MT3 (CIKM08) M3 P1 P3 P1 M3

  12. Competed work on Q1 • Goal: Find complex user-specific patterns, • Q1.1. Center-Piece Subgraph Discovery, • e.g., master-mind criminal given some suspects X, Y and Z? • Q1.2. Best Effort Pattern Match, • e.g., Money-laundry ring • Q1.3 Interactive querying (e.g. Negation) • e.g., find most similar conferences wrt KDD, but not like ICML?

  13. Q1.1 Center-Piece Subgraph Discovery [Tong+ KDD 06] Input Output CePS Node CePS Original Graph Q: How to find hub for the black nodes? Red: Max (Prox(A, Red) x Prox(B, Red) x Prox(C, Red))

  14. CePS: Example (AND Query) • DBLP co-authorship network: • - 400,000 authors, 2,000,000 edges

  15. K_SoftAND: Relaxation of AND Asking AND query? No Answer! Disconnected Communities Noise

  16. CePS: 2 SoftAND DB Stat.

  17. Q1.2. Best-Effort Pattern Match [Tong+ KDD 2007 b] Query Graph Interception Data Graph Matching Subgraph Input Output Q: How to find matching subgraph?

  18. details G-Ray: How to? matching node matching node matching node matching node Goodness = Prox (12, 4) x Prox (4, 12) x Prox (7, 4) x Prox (4, 7) x Prox (11, 7) x Prox (7, 11) x Prox (12, 11) x Prox (11, 12) Observation: , etc.

  19. Effectiveness: star-query Databases Intelligent Agent Bio-medical Query Result

  20. Effectiveness: line-query Theory Databases Learning Bio-medical Query Result

  21. Q1.3: Interactive Querying User Feedback User Feedback User Feedback User Feedback

  22. Q1.3 ProSIN for Interactive Querying [Tong+ ICDM 08] what are most related conferences wrt KDD? (DBLP author-conference bipartite graph)

  23. Q1.3 ProSIN for Interactive Querying [Tong+ ICDM 08] what are most related conferences wrt KDD? (DBLP author-conference bipartite graph)

  24. Q1.3 ProSIN for Interactive Querying [Tong+ ICDM 08] what are most related conferences wrt KDD? (DBLP author-conference bipartite graph)

  25. Questions That We Ask Thesis Overview Completed Proposed Q1 CePS, G-Ray, ProSIN (KDD06, KDD07 a, ICDM08) pTrack/cTrack (SDM08, SAM08) Q2 Q2 DAP (KDD07 b) Q3 FastProx (ICDM06, KAIS07, KDD07 b, ICDM08) P3 Q3 FastProx (SDM08, SAM08) M1 M1 Colibri-S (KDD08 b) P3 Colibri-D (KDD08 b) M2 M2 P2 P3 T3/MT3 (CIKM08) M3 P1 P3 P1 M3

  26. Q2.1 Link Prediction: direction [Tong+ KDD 07 a] ? • Q: Given the existence of the link, • what is the direction of the link? • A: (DAP) Compare Prox(ij) and Prox(ji) i i j i >70% density i Web Link - 4, 000 nodes - 10, 000 edges Prox (ij) - Prox (ji)

  27. Q2.2 pTrack/cTrack: Challenge[Tong+ SDM 08] • Observations (CePS, GRay, ProSIN…) • All for static graphs • Proximity: main tool • Graphs are evolving over time! • New nodes/edges show up; • Existing nodes/edges die out; • Edge weights change… • Q: How to make everything incremental? • A: Track Proximity!

  28. pTrack/cTrack: Trend analysis on graph level T. Sejnowski Rank of Influence G.Hinton C. Koch M. Jordan Year

  29. pTrack: Problem Definitions • [Given] • (1) a large, skewed time-evolving bipartite graphs, • (2) the query nodes of interest • [Track] • (1) top-k most related nodes for each query node at each time step t; • (2) the proximity score (or rank of proximity) between any two query nodes at each time step t

  30. pTrack: Philip S. Yu’s Top-5 conferences up to each year DBLP: (Au. x Conf.) - 400k aus, - 3.5k confs - 20 yrs Databases Performance Distributed Sys. Databases Data Mining

  31. KDD’s Rank wrt. VLDB over years Prox. Rank (Closer) Data Mining and Databases are getting closer & closer Year

  32. cTrack:10 most influential authors in NIPS community up to each year T. Sejnowski M. Jordan Author-paper bipartite graph from NIPS 1987-1999. 1740 papers, 2037 authors, spreading over 13 years

  33. Questions That We Ask Thesis Overview Completed Proposed Q1 CePS, G-Ray, ProSIN (KDD06, KDD07 a, ICDM08) pTrack/cTrack (SDM08, SAM08) Q2 Q2 DAP (KDD07 b) Q3 FastProx (ICDM06, KAIS07, KDD07 b, ICDM08) P3 Q3 FastProx (SDM08, SAM08) M1 M1 Colibri-S (KDD08 b) P3 Colibri-D (KDD08 b) M2 M2 P2 P3 T3/MT3 (CIKM08) M3 P1 P3 P1 M3

  34. Proximity is the main tool • Q.1: CePS, G-Ray, ProSIN • Q.2: DAP, pTrack/cTrack a.k.a Relevance, Closeness, ‘Similarity’… Q: What is a `good’ Score?

  35. 0.03 0.04 10 9 0.10 12 2 0.08 0.02 0.13 8 1 0.13 11 3 0.04 4 0.05 6 5 0.13 7 0.05 Random walk with restart [Pan+ KDD 2004] Nearby nodes, higher scores Ranking vector More red, more relevant

  36. Why RWR is a good score? j i : adjacency matrix. c: damping factor all paths from i to j with length 1 all paths from i to j with length 2 all paths from i to j with length 3 RWR summarizes all the weighted paths from i to j

  37. Computing RWR • OntheFly • No Pre-Computation; • Light Storage Cost (W) • Slow On-Line Response: O(mE) • Pre-Compute • Fast On-Line Response • Prohibitive Pre-Compute Cost: O(n3) • Prohibitive Storage Cost: O(n2) ~

  38. Q: How to Balance? On-line Off-line Goal: Efficiently Get (elements) of

  39. 10 10 9 9 12 12 2 2 8 8 1 1 11 11 3 3 4 4 10 10 9 9 6 6 5 5 12 12 2 2 8 8 1 1 11 11 7 7 3 3 4 4 6 6 5 5 7 7 0.04 10 10 0.03 9 9 10 9 12 12 0.10 2 2 12 0.13 0.08 2 8 8 1 1 0.02 11 11 8 3 3 1 11 0.13 3 0.04 4 4 4 6 6 5 5 0.05 6 5 0.13 7 7 7 0.05 B_Lin: Basic Idea[Tong+ ICDM 2006] Find Community Combine Fix the remaining

  40. details B_Lin: details + = ~ ~ W 1: within community W ~ + ~ Cross community

  41. details B_Lin: details ~ ~ W1 W -1 Easy to be inverted LRA difference -1 ~ ~ I – c I – c –cUSV Sherman–Morrison Lemma! If Then

  42. B_Lin: summary • Pre-Compute Stage • Q: • A: A few small, instead of ONE BIG, matrices inversions • On-Line Stage • Q: Efficiently recover one column of Q • A: A few, instead of MANY, matrix-vector multiplications Efficiently compute and store Q

  43. Query Time vs. Pre-Compute Time Log Query Time • Quality: 90%+ • On-line: • Up to 150x speedup • Pre-computation: • Two orders saving Our Results Log Pre-compute Time

  44. More on Scalability Issues for Querying(the spectrum of ``FastProx’’) • B_Lin: one large linear system • [Tong+ ICDM06, KAIS08] • BB_Lin: the intrinsic complexity is small • [Tong+ KAIS08] • FastUpdate: time-evolving linear system • [Tong+ SDM08, SAM08] • FastAllDAP: multiple linear systems • [Tong+ KDD07 a] • Fast-ProSIN: dealing w/ on-line feedback • [Tong+ ICDM 2008]

  45. Roadmap • M1: Spotting Anomalies • M2: Mining Time • Introduction • Completed Work • Querying • Mining • Proposed Work

  46. Questions That We Ask Thesis Overview Completed Proposed Q1 CePS, G-Ray, ProSIN (KDD06, KDD07 a, ICDM08) pTrack/cTrack (SDM08, SAM08) Q2 Q2 DAP (KDD07 b) Q3 FastProx (ICDM06, KAIS07, KDD07 b, ICDM08) P3 Q3 FastProx (SDM08, SAM08) M1 M1 Colibri-S (KDD08 b) P3 Colibri-D (KDD08 b) M2 M2 P2 P3 T3/MT3 (CIKM08) M3 P1 P3 P1 M3

  47. Motivation [Tong+ KDD 08 b] • Q: How to find patterns? • e.g., communities, anomalies, etc. • A: Low-Rank Approximation (LRA) for Adjacency Matrix of the Graph. X X A L M R ~ ~

  48. LRA for Graph Mining: Example Adj. matrix: A L M R John X X ICDM Tom KDD Conf. Cluster Bob Interaction Carl ISMB Au. clusters Van ~ ~ RECOMB Roy Author Conf. Recon. error is high  ‘Carl’ is abnormal

  49. Challenges: How to get (L, M, R)? • Efficiently • both time and space • Intuitively • easy for interpretation • Dynamically • track patterns over time None of Existing Methods Fully Meets Our Wish List!

  50. Why Not SVD and CUR/CX? • SVD: Optimal in L2 and LF • Efficiency • Time: • Space: (L, R) are dense • Interpretation • Linear Combination of many columns • Dynamic: Not Easy • CUR: Example-based • Efficiency • Better than SVD • Redundancy in L • Interpretation • Actual Columns from A xxxx • Dynamic: Not Easy

More Related