1 / 16

Hadoop System simulation with Mumak

Hadoop System simulation with Mumak. Fei Dong, Tianyu Feng , Hong Zhang Dec 8, 2010. Agenda. Objective Comparison between MRPerf and Mumak Modifications to Mumak Results and discussion Conclusion. Objective. Large scale distributed system has enormous amount of parameters.

daxia
Download Presentation

Hadoop System simulation with Mumak

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hadoop System simulation with Mumak Fei Dong, TianyuFeng, Hong Zhang Dec 8, 2010

  2. Agenda • Objective • Comparison between MRPerf and Mumak • Modifications to Mumak • Results and discussion • Conclusion

  3. Objective • Large scale distributed system has enormous amount of parameters. • Running time of a user program depends non-linearly on these parameters. • Predict the running time under various settings to help user choose the “optimal” setting. • We start by varyingthe most basic parameter: cluster size.

  4. MRPerf and Mumak • MRPerf • Build upon a network simulator • Calculate the task running time and network delay from physical parameters • Implemented the Hadoop system in TCL • Flexible in simulation

  5. MRPerf and Mumak Running Time Map slots per node Reduce slots per node 4 nodes double rack data center (Chunk Size = 64M) By MRPerf

  6. MRPerf and Mumak 4 nodes (Chunk Size = 64M) By Mumak

  7. MRPerf and Mumak • Mumak • Inherit the JobTracker class from Hadoop and only defines the simulation interface • Use trace file to build the cluster topology / job story, then feed it into simulator • Can only reproduce previous finished experiment • Designed to verify/debug Hadoop system design • Only simulate the Map/Reduce tasks, no sort phase and shuffle phase

  8. MRPerf and Mumak • The approach taken by MRPerf is better • Take in parameters to estimate running time • Can make predictions • MRPerf is simulating their implementation of Hadoop • The design of Mumak is better • Inherit source code from Hadoop • Easy to understand and to extend • We decide to take the good parts of MRPerf and then implement them in the framework of Mumak • Modify the Rumen log to change the parameters • Modify Mumak source code to add network simulator

  9. Implementation • Simulate a different cluster size • Hack the rumen log, change data replication factor/ locality • Modify the topology, add in / delete nodes, for example, from 2 slave nodes to 6 slave nodes. • The job tracker will assign the tasks to different nodes.

  10. Implementation • Simulate network delay • We defined a simple network simulator interface • Modified the source code of Mumak to add in the network delay • Actual the network delay can be ignored

  11. Results and Discussion

  12. Results and Discussion

  13. Results and Discussion

  14. Results and Discussion • Limitations and future work • Sort phase time not included • Only used single rack topology • Prediction is not always consistent for the same job with the same configuration

  15. Conclusion • Our objective is to predict the running time with different parameters • We take the methods of MRPerf and implemented it on Mumak • To have more flexible and accurate prediction, more modification to Mumak is needed • Independent from trace file • Solve the unstable problem

  16. Questions?

More Related