1 / 27

Herodotos Herodotou Shivnath Babu Duke University

Profiling,What-if Analysis,and Cost-based Optimization of MapReduce Programs. Herodotos Herodotou Shivnath Babu Duke University. Abstract. MapReduce has emerged as a viable competitor to database systems in big data analytics.

Download Presentation

Herodotos Herodotou Shivnath Babu Duke University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Profiling,What-if Analysis,and Cost-based Optimization of MapReduce Programs Herodotos Herodotou Shivnath Babu Duke University

  2. Abstract • MapReduce has emerged as a viable competitor to database systems in big data analytics. • MapReduce systems lack a feature that has been key to the historicalsuccess of database systems, namely, cost-based optimization. • Introduce the first Cost-based Optimizerfor simple to arbitrarily complex MapReduce programs.

  3. Outline • Introduction • Profiler • What-if engine • Cost-based optimizer • Experimental evaluation • Conclusion

  4. Introduction • MapReduce job J • J = <p, d, r, c> • p: MapReduce program • d: map(k1, v1)reduce(k2, list(v2)) • r: Cluster resources • c: Configuration parameter settings

  5. Introduction • Phase of Map Task Execution • Read, Map, Collect, Spill, Merge • Phase of Reduce Task Execution • Shuffle, Merge, Reduce, Write

  6. Introduction job j = < program p, data d, resources r, configuration c > • Space of configuration choices: • Number of map tasks • Number of reduce tasks • Partitioning of map outputs to reduce tasks • Memory allocation to task-level buffers • Multiphase external sorting in the tasks • Whether output data from tasks should be compressed • Whether combine function should be used

  7. Introduction

  8. Introduction Use defaults or set manually (rules-of-thumb)

  9. Introduction • Cost-based Optimization to Select Configuration Parameter Settings Automatically • perf = F(p, d, r, c) • perf is some performance metric of interest for jobs • Optimizing the performance of program p for given input data d and cluster resources r requires finding configuration parameter settings that give near-optimal values of perf.

  10. Applying Cost-based Optimization Goal Just-in-Time Optimizer Searches through the space S of parameter settings What-if Engine Estimates perf using properties of p, d, r, and c

  11. Serialize, Partition map Memory Buffer Job Profile Sort, [Combine], [Compress] split Merge • Concise representationof program execution as a job • Records information at the level of “task phases” • Generated by Profiler through measurement or by the What-if Enginethrough estimation DFS Read Map Collect Spill Merge

  12. Job Profile Fields

  13. Generating Profiles by Measurement • Dynamic instrumentation • Monitors task phases of MapReduce job execution • Event-condition-action rules are specified, leading to run-time instrumentation of Hadoop internals • We currently use BTrace (Hadoop internals are in Java)

  14. Generating Profiles by Measurement

  15. Profiler • Using Profiles to Analyze Job Behavior

  16. What-if Engine • A what-if question has the following form • Given the profile of a job j =<p; d1; r1; c1i>that runs a MapReduce program p over input data d1 and cluster resources r1 using configuration c1, what will the performance of program p be if p is run over input data d2 and cluster resources r2 using configuration c2? That is, how will job j0 = <p,d2,r2, c2>perform? • The What-if Engine executes the following two steps to answer a what-if question • Estimating a virtual job profile for the hypothetical job j’. • Using the virtual profile to simulate how j’ will execute. We will discuss these steps in turn.

  17. What-if Engine • Estimating Dataflow and Cost fields • detailed set of analytical (white-box) models for estimating the Dataflow and Cost fields in the virtual job profile for j'. • Estimating Dataflow Statistics fields • Dataflow proportionality assumption • Estimating Cost Statistics fields • Cluster node homogeneity assumption • Simulating the Job Execution • Task Scheduler Simulator

  18. What-if Engine

  19. Virtual Profile Estimation • Given profile for job j = <p, d1, r1, c1> • Estimate profile for job j' = <p, d2, r2, c2>

  20. White-box Models • Detailed set of equations for Hadoop • Example: Input data properties Dataflow statistics Configuration parameters Calculate dataflow in each task phase in a map task

  21. Cost-based Optimizer (CBO) • MapReduce program optimization can be defined as • Given a MapReduce program p to be run on input data d and • cluster resources r, find the setting of configuration parameters • for the cost model F represented by the What-if • Engine over the full space S of configuration parameter settings. • The CBO addresses this problem by making what-if calls with settings c of the configuration parameters selected through an enumeration and search over S • . • Once a job profile to input to the What-if Engine is available, the CBO uses a two-step process, discussed next.

  22. Cost-based Optimizer • Subspace Enumeration • More efficient search techniques can be developed if the individual parameters in c can be grouped into clusters. • Equation 2 states that the globally-optimal setting copt can be found using a divide and conquer approach by : • breaking the higher-dimensional space S into the lower-dimensional subspaces S(i) • considering an independent optimization problem in each smaller subspace • composing the optimal parameter settings found per subspace to give the setting copt

  23. Cost-based Optimizer • Search Strategy within a Subspace • searching within each enumerated subspace to find the optimal configuration in the subspace. • Recursive Random Search (RRS) • RRS provides probabilistic guarantees on how close the setting it finds is to the optimal setting • RRS is fairly robust to deviations of estimated costs from actual performance • RRS scales to a large number of dimensions

  24. Just-in-Time Optimizer

  25. Experimental Evaluation

  26. Experimental Evaluation

  27. Thank you!

More Related