80 likes | 295 Views
MapReduce: Simplified Data Processing on Large Clusters. Google’s Experience and Large Scale Indexing Presented by Chris Moore. Contents. Abstract Introduction Programming Model Implementation Refinements Performance Experience Google’s Experience with MapReduce
E N D
MapReduce: Simplified Data Processing on Large Clusters Google’s Experience and Large Scale Indexing Presented by Chris Moore
Contents • Abstract • Introduction • Programming Model • Implementation • Refinements • Performance • Experience • Google’s Experience with MapReduce • Improvements with MapReduce To Search Indexing • Related Work • Conclusion
Abstract • MapReduce - Programming model used for large data sets • Map: key / value -> intermediate key / value pair • Reduce: Merges all int. values associated with the same int. key • Easy utilization of parallel and distributed computing • Hundreds of Programs / Thousands of jobs
Introduction • Issue – How to handle very large data sets? • Parallelize computation, distribute data, failures • Map / Reduce allows for easier programming while a library handles the above issues • Simple, powerful interface • Automatic parallelization and distribution of large computations
Google’s Experience With MapReduce • Extraction of data for popular queries • Google Zeitgeist • Extracting properties of web pages • Geographical locations of web pages for localized search • Clustering problems for Google News and Shopping • Large-scale machine learning problems and graph computations
Google Search – Large Scale Indexing • Production Indexing System • Produces data structures for searches • Completely rewritten with MapReduce • What it does: • Crawler gathers approx. 20 TB of documents • Indexing Process: 5-10 map reduce operations
Improvements on the Indexing System • Indexing code is Simpler • 3800 lines of C++ to 700 w/ MapReduce • Improved Performance • Separates unrelated computations • Avoids extra passes over data • Easier to Operate • MapReduce handles issues without operator intervention • Machine failures, slow machines, networking hiccups
Conclusion of Part Six • Google has seen many benefits and improvements in MapReduce since 2003 • MapReduce has completely changed the way Google handles its search indexing • Source – MapReduce: Simplified Data Processing on Large Clusters • Jeffrey Dean and Sanjay Ghemawat