440 likes | 524 Views
Beehive: A Framework for Graph Data Analytics on Cloud Computing Platforms. Anand Tripathi, Vinit Padhye, Tara Sasank Sunkara Department of Computer Science University of Minnesota Presentation by Tara Sasank Sunkara eBay Inc.
E N D
Beehive: A Framework for Graph Data Analytics on Cloud Computing Platforms Anand Tripathi, Vinit Padhye, Tara Sasank Sunkara Department of Computer Science University of Minnesota Presentation by Tara Sasank Sunkara eBay Inc. Acknowledgements: This work was partly supported by NSF award 1319333 and by the computing resources of Minnesota Supercomputing Institute (MSI)
Outline • Project Goals • Beehive Computation Model • Beehive System Architecture • Beehive Programming framework • Architectural mechanisms and optimizations • Experimental evaluation • Algorithmic techniques for performance improvement • Conclusion and future work 2
Project Goals • Many data analytics applications require processing of large scale graph data • Analysis of such large scale graph data requires parallel processing utilizing a cluster computing environment. • Parallelism in many graph problems tends to be fine-grained and irregular, and it is not easy to extract parallelism through static analysis and data partitioning. • This is calledamorphous parallelism.
Project Goals • Problem: How to extract amorphous parallelism in large-scale graph problems? • Graph problems with amorphous parallelism cannot be easily partitioned for programming using the MapReduce model. • The Beehive framework has been developed to address this problem, providing an alternate programming model.
Project Goals The design of the Beehive framework has been driven by the following goals: • Provide a programming model which enables extraction of amorphous parallelism using a speculative execution model based on optimistic concurrency control. • Provide simple abstractions and programming primitives that eliminate complex message-passing paradigms • Provide support for fault-tolerance and recovery • This aspect is the focus of our on-going work.
Beehive Computation Model It has three key elements: • A distributed key-value based storage system which maintains graph data in the memory of cluster computing nodes. • A task-pool model for parallel execution of tasks on cluster nodes • Worker threads executing tasks as atomic transactions in parallel. • A transaction model which ensures atomicity and isolation of the tasks • In case of any read-write or write-writeconflicts among parallel tasks, one of them commits and the others are aborted. • Speculatively harness amorphous parallelism using optimistic concurrency control techniques.
Beehive System Architecture • Beehive system executes on a collection of computing nodes in a cluster • A Beehive process (called Beehive Node ) executing on a cluster node contains the following components: • Local workpool of tasks to be executed • A pool of worker threads • A component of the global key-value based data storage service • The system contains a Global Transaction Validation Service for optimistic concurrency control 7
Beehive computation model • Computation Model: Task and Transaction • Task – Computation for a task is specific to the application problem and the algorithm • A task reads and updates some vertices • A task can create new tasks on its completion • Transaction – Every task is executed as a transaction. • Transaction is validated by the ‘global validator’ • On an abort, the task is re-executed as a new transaction • On commit the updates are written to the Beehive storage 9
Distributed Key-Value Based Storage • Graph data is stored as a collection of key-value based items in a distributed storage across cluster nodes • Typically each vertex is stored with vertex-id as key • Data is maintained in-memory at cluster nodes • A task can access any item with location-transparency • Key-value items can be relocated dynamically, for example for graph clustering, to improve locality of data with tasks • Relieves programmer from the burden of explicitly using message-passing primitives.
Task-pool Model • A distributed pool of ready-to-run tasks is maintained across the cluster nodes. • Each cluster node contains a pool of worker threads • The size of this pool is declared by the application program • A worker thread’s function is to repeatedly pick a task from the local pool and execute it as a transaction using optimistic concurrency control methods: • On commitment of the transaction-task, it updates the global storage and possibly creates new tasks • On abortion, the worker repeats the task execution as a new transaction
Transactional Model of Task Execution • Computation tasks in a graph analytics program are executed as transactions. • The transaction execution model is based on optimistic concurrency control methods [Kung-Robinson]: • A transaction (task) reads required graph data from the key-value storage system in its local buffer • Performs all updates on the buffered data • After computation phase, it goes into validation phase to detect any read-write and write-write conflicts with any other concurrent transactional tasks • On commitment, it writes the updated data items into the key-value storage, and it may create new tasks which are inserted into the task pool
Transactional Model of Task Execution Execution phases of a transactional task Computation Phase Read data from storage into local buffers; Compute and modify Data in local buffers Validation Phase Check for read-write and write-write conflicts with other parallel tasks Write Phase Writethe updated data in the local buffers to the storage system Add new tasks to the Task-Pool commit abort Re-execute the task as a new transaction
Why optimistic model? Initially we investigated a conflict-free transactional task scheduling model No two tasks with overlapping working sets (read/write set items) can be executed concurrently Major disadvantages of conflict-free scheduling approach: the read/write sets of the tasks may not be known a-priori. Highly pessimistic. We also considered a locking based approach but it was not adopted due to the complexity of issues such as lock management and deadlocks. 14
Transaction Model • Transaction (task) acquires ‘Start-Timestamp” when it begins execution • Read and Compute Phase • Validation service checks that no concurrent transaction committed after the start-timestamp has any read-writeor write-writeconflicts. • Validation service commits the transaction and assigns it a Commit-Timestamp. • Transaction writes the updates to the global key-value storage. • Reports completion to the global validation service.
Transaction Validation Model Validation Service maintains two counters: • Last assigned Commit Timestamp (CTS) • Once a transaction is validated it will be assigned a timestamp(counter) • Stable Timestamp (STS) • Updates of all committed transactions up to this commit timestamp value have been pushed to the global storage. • STS is used as the start timestamp of any new transaction. 101 102 105 106 100 104 103 STS CTS Updates written to the global storage Updates NOT yet written to the global storage
Example problem • Max-flow problem - Pre-flow Push algorithm • For each vertex with excess flow, push the excess flow to neighbor vertex who are at a lower height. • If there is no neighbor vertex of lower heightwith available edge capacity, lift the heightof the vertex. • Keep doing this till the flow of all vertices except the source and the sinkare balanced. 17
Max-flow algorithm Task Vertex H=6 H=6 H=5 H=5 T T T H=4 H=4 H=4 H=4 T H=3 H=3 T
Beehive Programming Framework • Framework provides Worker thread class. • This class can be suitably inherited by an application-defined worker class. • A worker thread picks a task from the local workpool and executes the doTask() method. • This method can be overridden by an application when inheriting from the worker class. • Framework provides mechanisms for executing a task’s computation as a transaction.
Research Problems Architectural Mechanisms Task distribution strategies – sender initiated vs. receiver-initiated Task placement – Locality aware vs. Load aware Task validation - Single Global validation vs. Hierarchical validation Support barrier synchronous model for phased execution Non-transactional task execution Algorithmic techniques for performance improvement by reducing remote data access costs. Task Granularity Caching 21
Task Distribution Model • A task completion may result in creation of new tasks • The new tasks are distributed across different Beehive nodes in two ways: • Locality-aware: Affinity of a task to execute at a particular Beehive node based either on data locality or task function Affinity may be one of the following three types: • Strong: Must execute at a designated node • For example some initialization task • Weak: Prefer to execute at the designated node. • No-Affinity: Can be executed at any node • Load-aware: Balancing of load at different Beehive nodes
Load Distribution Models Load distribution strategies for new tasks created. K-way split : Local work-pool invokes load distributor on every task completion, split new generated tasks to K peers (inclusive local node). Random – any K-1 other peers Round Robin – next K-1 peers Load Aware – K-1 least loaded peers Beehive framework provides mechanism to obtain load information of other Beehive nodes. 23
Task Validation Approaches Single Global Validation: Global validator at Global Task Management Service Every transaction has to get validated to commit and update the shared storage. Hierarchical Validation: A local validator at every Beehive node additional to global validator. Filters requests to global validator by aborting transactions that conflict with locally executed concurrent transactions Reduced the load on the global validator by more than 60% in our experiments. 24
Hierarchical Validation • Used the Max-Flow problem for this evaluation • 30%-60% of validation requests filtered at local validator • More significant gains in bigger graph with more threads Vertices Beehive Nodes/ threads Local aborts Global aborts Global commits Total validation requests 100 10/10 1321 1475 821 3617 100 5/10 3605 4088 2603 10296 1600 10/10 284677 185135 194600 664412 1600 10/20 366410 96643 163325 626378 1600 10/40 181287 58079 68925 308291 25
Two models of parallel execution • Many problems with structured parallelism can be executed using the Barrier synchronization model, without requiring the transactional task execution model. • A application can specify the execution mode as either TRANSACTION MODE or BARRIER MODE • Barrier model is useful for problems with structured parallelism and BSP based programming models. • Ex: Pagerank
Experimental Evaluation • We programmed several graph problems to evaluate the performance of the Beehive framework and its mechanisms • Max-Flow Problem using Preflow-Push Algorithms • Minimum Weight Spanning Tree problem using Gallgher-Humblet-Spira Algorithm • Graph-Coloring problem • PageRank problem • This problem was programmed using the Barrier model of execution • Experiments were conducted on the Itasca cluster of Minnesota Supercomputing Institute: • Each cluster node has 8 cores, 2.8 GHz, 22 GB memory
Max-Flow Problem • Implemented Preflow-Push Algorithm • Evaluation with graphs of different sizes and edge capacities • Graphs generated using Washington Graph Generator • Used Random-Level Graphs
Impact of Affinity levels • We have evaluated graphs with setting different affinity levels. • With strong affinity set the execution took more time. • Weak or No affinity performed almost similar.
Task granularity Coarse grained tasksk Fine grained task T T T Task Vertex Task Vertex Set
Performance improvement techniques • Increased task granularity • In the Max-flow problem • Increasing the task size to vertex and its neighborhood • It may increase number of aborts per transaction as Read/Write sets are bigger. • Advantages: • Reduces number of tasks • Reduced network access costs by parallel reads and writes 31
Improvement with increased task granularity • Max-flow problem for a 1600 vertex graph. • Reduced number of tasks to 1/3rd • No significant increase in the fraction of aborts • Data below is for a graph of 1600 vertices 32
Performance improvement through caching • When a task is re-executed because of an abort, we avoid re-fetching the working set data items which have not been modified. • This required us to include additional functionality in the validator: • Validator indicates which data items have been modified. • Task re-fetches only those modified items. 33
Minimum Spanning Tree Probelm • Given is undirected graph with edge weights. • Implemented Gallagher-Humblet-Spira Algorithm • A vertex merges with its nearest neighbor to form a cluster, and becomes cluster-head. • Successively, a cluster merges with its nearest node outside its cluster or nearest other cluster. • Computation stops when no more merging is possible. • The number of clusters finally left are the connected components of the graph.
Data access patterns Problem in merging clusters: • Identifying the cluster head of the target cluster may require following cluster head pointers on a chain of vertices. • This may introduce significant remote data access cost Solution: • Update the cluster head pointers of vertices in a cluster to directly point to the cluster head while merging. • This can be performed asynchronously as a background task • Push some of this computation into the storage service. 35
Graph Coloring • A coloring task is executed for each vertex. • It reads the colors, if any assigned, of all its neighbors. • Chooses the lowest numbered unused color for the vertex
PageRank Problem • Barrier model for phased execution. • Non-transactional execution. 38
Amount of parallelism • Abort rates for a 10000 vertex graph. • Ratio of abort/commit close to 7.3 for the max-flow problem. • Signifies low parallelism achievable for this particular problem. • Graph coloring problem has just 10003 tasks. • One task per vertex • Three bookkeeping tasks
Related Work • Distributed GraphLab [Low et al] is closest to our work but that system does not support optimistic execution model and dynamic graph structures. It expects either graph colored for parallel execution or provides a locking engine to acquire locks on the vertices and its neighborhood. • Piccolo [Power] provides a programming model based on shared data store but does not provide transactional semantics for multi-item updates. And run time resolves conflicts using user-defined accumulation functions. • Pregel [Malewicz] – bulk synchronous message passing abstraction with messages between vertices for communication. May not be suitable for all types of graph processing. • Dryad [Isard] is based on data-flow model. • Parallel BGL [Gregor] is a C++ based library for distributed memory multi-processors, using the notion of active messages and executes in BSP like phases.
Conclusion • Optimistic task scheduling methods can be effectively used for exploiting amorphous parallelism in graph problems. • Relieves programmer from the burden of explicit message passing and synchronizations, • But implementation of the algorithm should be driven towards amortizing or reducing remote data access costs. • Hierarchical validation helps filtering around 30%-60% of validation requests • Performance improvement can be achieved using data caching, increasing task granularity, and algorithm re-design to reduce remote data access costs. • Load aware task placement is more efficient than locality aware task placement. • Optimal cluster size for better performance. • Because of the remote data latencies start dominating execution times. 41
Current and Future Work • Fault tolerance • Checkpointing and recovery on failures • Efficient clustering methods and initial loading of data • This can significantly improve data locality for tasks • Adaptive methods to control the degree of optimistic execution to reduce the abort rate. • Hybrid scheduling mechanisms to shift dynamically from optimistic execution to conflict-free scheduling. • Optimizing algorithm implementation to reduce data access/computations if possible. • Programming of application problems from social networking domain, ML/DM algorithms. 42