340 likes | 561 Views
MapReduce. By: Jeffrey Dean & Sanjay Ghemawat. Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe. Paper. MapReduce : Simplified Data Processing on Large Clusters. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI' 04)
E N D
MapReduce By: Jeffrey Dean & Sanjay Ghemawat Presented by: WarunikaRanaweera Supervised by: Dr. NalinRanasinghe
Paper MapReduce: Simplified Data Processing on Large Clusters In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI' 04) Also appears in the Communications of the ACM (2008)
Authors – Jeffrey Dean • Ph.D. in Computer Science – University of Washington • Google Fellow in Systems and Infrastructure Group • ACM Fellow • Research Areas: Distributed Systems and Parallel Computing
Authors – Sanjay Ghemawat • Ph.D. in Computer Science – Massachusetts Institute of Technology • Google Fellow • Research Areas: Distributed Systems and Parallel Computing
Large Computations • Calculate 30*50 Easy? • 30*50 + 31*51 + 32*52 + 33*52 + .... + 40*60 Little bit hard?
Large Computations • Simple computation, but huge data set • Real world example for large computations • 20+ billion web pages * 20kB webpage • One computer reads 30/35 MB/sec from disc • Nearly four months to read the web
Good News: Distributed Computation • Parallelize tasks in a distributed computing environment • Web page problem solved in 3 hours with 1000 machines
Though, the bad news is... • Complexities in Distributed Computing • How to parallelize the computation? • Coordinate with other nodes • Handling failures • Preserve bandwidth • Load balancing
MapReduce to the Rescue • A platform to hide the messy details of distributed computing • Which are, • Parallelization • Fault-tolerance • Data distribution • Load Balancing • A programming model • An implementation
MapReduce: Programming Model • Example: Word count the 1 the 1 the 1 the 1 quick 1 brown 1 fox 1 the 1 fox 1 ate 1 the 1 mouse 1 • the quick • brown fox • the fox ate • the mouse the 3 quick 1 brown 1 fox 2 ate 1 mouse 1 Document Mapped Reduced
Programming Model: Example • Eg: Word count using MapReduce • the quick • brown fox • the fox ate • the mouse the, 3 quick, 1 brown, 1 fox, 2 ate, 1 mouse, 1 Map the, 1 quick, 1 brown, 1 fox, 1 Reduce the, 1 fox, 1 ate,1 the, 1 mouse, 1 Map Map Output Reduce Input
The Map Operation Input Text file Output (“fox”, “1”) map(String key, String value): for each word w in value: EmitIntermediate(w, "1"); Document Name Document Contents Intermediate key/value pair – Eg: (“fox”, “1”)
The Reduce Operation Input (“fox”, {“1”, “1”}) Output (“fox”, “2”) reduce(String key, Iterator values): int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); Word List of Counts (Output from Map) Accumulated Count
MapReduce in Practice • Reverse Web-Link Graph Source Web page 1 Source Web page 2 Source Web page 5 Target (My web page) Source Web page 4 Source Web page 3
MapReduce in Practice Contd. • Reverse Web-Link Graph (“My Web”, “Source 1”) (“Not My Web”, “Source 2”) Map (“My Web”, “Source 3”) (“My Web”, “Source 4”) (“My Web”, “Source 5”) Source web pages Target Source pointing to the target Reduce (“My Web”, {“Source 1”, “Source 3”,.....})
Implementation: Execution Overview User Program (1) Fork (1) Fork Master (1) Fork (2) Assign Map Input Layer Map Layer Intermediate Files Reduce Layer Output Layer (2) Assign Reduce Split 0 Worker Split 1 Worker O/P File 0 (3) Read (4) Local Write Split 2 Worker (6) Write (5) Remote Read Split 3 Worker O/P File 1 Worker Split 4
Complexities in Distributed Computing MapReduce to the Rescue • Complexities in Distributed Computing, to be solved • Automatic parallelization using Map & Reduce • How to parallelize the computation? • How to parallelize the computation? • Coordinate with other nodes • Handling failures • Preserve bandwidth • Load balancing
Implementation: Parallelization • Restricted Programming model • User specified Map & Reduce functions • 1000s of workers, different data sets Worker1 Data User-defined Map/Reduce Instruction Worker2 Worker3
MapReduce to the Rescue • Complexities in Distributed Computing, solving.. • Automatic parallelization using Map & Reduce • Coordinate with other nodes • Coordinate nodes using a master node • Handling failures • Preserve bandwidth • Load balancing
Implementation: Coordination • Master data structure • Pushing information (meta-data) between workers Master Information Information Map Worker Reduce Worker
MapReduce to the Rescue • Complexities in Distributed Computing , solving.. • Automatic parallelization using Map & Reduce • Coordinate nodes using a master node • Handling failures • Fault tolerance (Re-execution) & back up tasks • Preserve bandwidth • Load balancing
Implementation: Fault Tolerance • No response from a worker task? • If an ongoing Map or Reduce task: Re-execute • If a completed Map task: Re-execute • If a completed Reduce task: Remain untouched • Master failure (unlikely) • Restart
Implementation: Back Up Tasks • “Straggler”: machine that takes a long time to complete the last steps in the computation • Solution: Redundant Execution • Near end of phase, spawn backup copies • Task that finishes first "wins"
MapReduce to the Rescue • Complexities in Distributed Computing , solving.. • Automatic parallelization using Map & Reduce • Coordinate nodes using a master node • Fault tolerance (Re-execution) & back up tasks • Preserve bandwidth • Saves bandwidth through locality • Load balancing
Implementation: Optimize Locality • Same data set in different machines • If a task has data locally, no need to access other nodes
MapReduce to the Rescue • Complexities in Distributed Computing , solved • Complexities in Distributed Computing , solving.. • Automatic parallelization using Map & Reduce • Coordinate nodes using a master node • Fault tolerance & back up tasks • Saves bandwidth through locality • Load balancing through granularity • Load balancing
Implementation: Load Balancing • Fine granularity tasks: map tasks > machines • 1 worker several tasks • Idle workers are quickly assigned to work
Extensions • Partitioning • Combining • Skipping bad records • Debuggers – local execution • Counters
Performance – Back Up Tasks • 44% increment in time • Very long tail • Stragglers take >300s to finish 891 S 1283 S Normal Execution No backup tasks
Performance – Fault Tolerance • 5% increment in time • Quick failure recovery 933 S 891 S Normal Execution 200 processes killed
MapReduce at Google • Clustering for Google News and Google Product Search • Google Maps • Locating addresses • Map tiles rendering • Google PageRank • Localized Search
Current Trends – HadoopMapReduce • Apache HadoopMapReduce • Hadoop Distributed File System (HDFS) • Used in, • Yahoo! Search • Facebook • Amazon • Twitter • Google
Current Trends – HadoopMapReduce • Higher level languages/systems based on Hadoop • Amazon Elastic MapReduce • Available for general public • Process data in the cloud • Pig and Hive
Conclusion • Large variety of problems can be expressed as Map & Reduce • Restricted programming model • Easy to hide details of distributed computing • Achieved scalability & programming efficiency