1 / 48

GraphLab Tutorial

2. GraphLab Tutorial. Yucheng Low. GraphLab Team. Yucheng Low. Joseph Gonzalez. Aapo Kyrola. Danny Bickson. Carlos Guestrin. Jay Gu. Development History. GraphLab 0.5 (2010). Internal Experimental Code. Insanely Templatized. First Open Source Release (< June 2011 LGPL

Download Presentation

GraphLab Tutorial

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 2 GraphLab Tutorial Yucheng Low

  2. GraphLab Team Yucheng Low Joseph Gonzalez Aapo Kyrola Danny Bickson Carlos Guestrin Jay Gu

  3. Development History GraphLab 0.5 (2010) Internal Experimental Code Insanely Templatized First Open Source Release (< June 2011 LGPL >= June 2011 APL) GraphLab 1 (2011) Nearly Everything is Templatized Shared Memory : Jan 2012 Distributed : May 2012 GraphLab2 (2012) Many Things are Templatized

  4. Graphlab 2 Technical Design Goals • Improved useability • Decreased compile time • As good or better performance than GraphLab 1 • Improved distributed scalability • … other abstraction changes … (come to the talk!)

  5. Development History • Ever since GraphLab 1.0, all active development are open source (APL): code.google.com/p/graphlabapi/ (Even current experimental code. Activated with a --experimental flag on ./configure )

  6. Guaranteed Target Platforms • Any x86 Linux system with gcc >= 4.2 • Any x86 Mac system with gcc 4.2.1 ( OS X 10.5 ?? ) • Other platforms? … We welcome contributors.

  7. Tutorial Outline • GraphLab in a few slides + PageRank • Checking out GraphLab v2 • Implementing PageRank in GraphLab v2 • Overview of different GraphLab schedulers • Preview of Distributed GraphLab v2 (may not work in your checkout!) • Ongoing work… (however much as time allows)

  8. Warning • A preview of codestill in intensive development! • Things may or may not work for you! • Interface may still change! • GraphLab 1  GraphLab 2 still has a number of performance regressions we are ironing out.

  9. PageRank Example • Iterate: • Where: • αis the random reset probability • L[j] is the number of links on page j 1 2 3 4 5 6

  10. The GraphLab Framework Scheduler Graph Based Data Representation Update Functions User Computation Consistency Model

  11. Data Graph A graph with arbitrary data (C++ Objects) associated with each vertex and edge • Graph: • Link graph • Vertex Data: • Webpage • Webpage Features • Edge Data: • Link weight

  12. The GraphLab Framework Scheduler Graph Based Data Representation Update Functions User Computation Consistency Model

  13. Update Functions An update function is a user defined program which when applied to a vertex transforms the data in the scopeof the vertex pagerank(i, scope){ // Get Neighborhood data (R[i], Wij, R[j]) scope; // Update the vertex data // Reschedule Neighbors if needed if R[i] changes then reschedule_neighbors_of(i); }

  14. Dynamic Schedule b d a c e f g CPU 1 b a h i k h j a b i CPU 2 Process repeats until scheduler is empty

  15. Source Code Interjection 1 Graph, update functions, and schedulers

  16. --scope=edge --scope=vertex

  17. Consistency False Trade-off # “iterations” per second Trade-off Consistency “Throughput” Goal of ML algorithm: Converge

  18. Ensuring Race-Free Code • How much can computation overlap?

  19. The GraphLab Framework Scheduler Graph Based Data Representation Update Functions User Computation Consistency Model

  20. Importance of Consistency Fast ML Algorithm development cycle: Build Test Debug Tweak Model Necessary for framework to behave predictably and consistently and avoid problems caused by non-determinism. Is the execution wrong? Or is the model wrong?

  21. Full Consistency Full Consistency Guaranteed safety for all update functions

  22. Full Consistency Full Consistency Parallel update only allowed two vertices apart  Reduced opportunities for parallelism

  23. Obtaining More Parallelism Full Consistency Not all update functions will modify the entire scope! Edge Consistency Belief Propagation: Only uses edge data Gibbs Sampling: Only needs to read adjacent vertices

  24. Edge Consistency Edge Consistency

  25. Obtaining More Parallelism Full Consistency Edge Consistency Vertex Consistency “Map”operations. Feature extraction on vertex data

  26. Vertex Consistency Vertex Consistency

  27. The GraphLab Framework Scheduler Graph Based Data Representation Update Functions User Computation Consistency Model

  28. Shared Variables • Global aggregation through Sync Operation • A global parallel reduction over the graph data • Synced variables recomputed at defined intervals while update functions are running Sync: Loglikelihood Sync: Highest PageRank

  29. Source Code Interjection 2 Shared variables

  30. What can we do with these primitives? …many many things…

  31. Matrix Factorization • Netflix Collaborative Filtering • Alternating Least Squares Matrix Factorization Model: 0.5 million nodes, 99 million edges Users Netflix d Movies

  32. Netflix Speedup Increasing size of the matrix factorization

  33. Video Co-Segmentation • Discover “coherent”segment types acrossa video (extends Batra et al. ‘10) • 1. Form super-voxels video • 2. EM & inference in Markov random field • Large model: 23 million nodes, 390 million edges Ideal GraphLab

  34. Many More • Tensor Factorization • Bayesian Matrix Factorization • Graphical Model Inference/Learning • Linear SVM • EM clustering • Linear Solvers using GaBP • SVD • Etc.

  35. Distributed Preview

  36. GraphLab 2 Abstraction Changes (an overview couple of them) (Come to the talk for the rest!)

  37. Exploiting Update Functors (for the greater good)

  38. Exploiting Update Functors (for the greater good) • Update Functors store state • Scheduler schedules update functor instances. • We can use update functors as a controlled asynchronous message passing to communicate between vertices!

  39. Delta Based Update Functors structpagerank : publiciupdate_functor<graph, pagerank> { double delta; pagerank(double d) : delta(d) { } void operator+=(pagerank& other) { delta += other.delta; } void operator()(icontext_type& context) { vertex_data& vdata = context.vertex_data(); vdata.rank += delta; if(abs(delta) > EPSILON) { doubleout_delta = delta * (1 – RESET_PROB) * 1/context.num_out_edges(edge.source()); context.schedule_out_neighbors(pagerank(out_delta)); } } }; // Initial Rank: R[i] = 0; // Initial Schedule: pagerank(RESET_PROB);

  40. Asynchronous Message Passing • Obviously not all computation can be written this way. But when it can; it can be extremely fast.

  41. Factorized Updates

  42. PageRank in GraphLab structpagerank : publiciupdate_functor<graph, pagerank> { void operator()(icontext_type& context) { vertex_data& vdata = context.vertex_data(); double sum = 0; foreach( edge_typeedge, context.in_edges() ) sum += context.const_edge_data(edge).weight * context.const_vertex_data(edge.source()).rank; doubleold_rank = vdata.rank; vdata.rank = RESET_PROB + (1-RESET_PROB) * sum; double residual = abs(vdata.rank – old_rank) / context.num_out_edges(); if (residual > EPSILON) context.reschedule_out_neighbors(pagerank()); } };

  43. PageRank in GraphLab structpagerank : publiciupdate_functor<graph, pagerank> { void operator()(icontext_type& context) { vertex_data& vdata = context.vertex_data(); double sum = 0; foreach( edge_typeedge, context.in_edges() ) sum += context.const_edge_data(edge).weight * context.const_vertex_data(edge.source()).rank; doubleold_rank = vdata.rank; vdata.rank = RESET_PROB + (1-RESET_PROB) * sum; double residual = abs(vdata.rank – old_rank) / context.num_out_edges(); if (residual > EPSILON) context.reschedule_out_neighbors(pagerank()); } }; Parallel “Sum” Gather Atomic Single Vertex Apply Parallel Scatter [Reschedule]

  44. Decomposable Update Functors • Decompose update functions into 3 phases: Gather Apply Scatter Scope Y Y Y Y Apply the accumulated value to center vertex Parallel Sum + + … +  Δ Update adjacent edgesand vertices. Y Y Y Y User Defined: User Defined: User Defined: Apply( , Δ)  Gather( )  Δ Scatter( )  Y Y Δ1 + Δ2 Δ3

  45. Factorized PageRank structpagerank : publiciupdate_functor<graph, pagerank> { double accum= 0, residual= 0; void gather(icontext_type& context, const edge_type& edge) { accum += context.const_edge_data(edge).weight * context.const_vertex_data(edge.source()).rank; } void merge(const pagerank& other) { accum += other.accum; } void apply(icontext_type& context) { vertex_data& vdata = context.vertex_data(); double old_value = vdata.rank; vdata.rank = RESET_PROB + (1 - RESET_PROB) * accum; residual = fabs(vdata.rank – old_value) / context.num_out_edges(); } void scatter(icontext_type& context, const edge_type& edge) { if (residual > EPSILON) context.schedule(edge.target(), pagerank()); } };

  46. Demo of *everything* PageRank

  47. Ongoing Work • Extensions to improve performance on large graphs. (See the GraphLab talk later!!) • Better distributed Graph representation methods • Possibly better Graph Partitioning • Off-core Graph storage • Continually changing graphs • All New rewrite of distributed GraphLab (come back in May!)

  48. Ongoing Work • Extensions to improve performance on large graphs. (See the GraphLab talk later!!) • Better distributed Graph representation methods • Possibly better Graph Partitioning • Off-core Graph storage • Continually changing graphs • All New rewrite of distributed GraphLab (come back in May!)

More Related