240 likes | 476 Views
Ghana. Map Reduce - an overview. AGENDA. Understanding MapReduce Map Reduce - An Introduction Word count – default Word count – custom . Map Reduce. Programming model to process large datasets Supported languages for MR Java Ruby Python C++
E N D
Ghana Map Reduce - an overview
AGENDA • Understanding MapReduce • Map Reduce - An Introduction • Word count – default • Word count – custom
Map Reduce • Programming model to process large datasets • Supported languages for MR • Java • Ruby • Python • C++ • Map Reduce Programs are Inherently parallel. • More data more machines to analyze. • No need to change anything in the code.
Understanding MapReduce • Start with WORDCOUNT example • “Do as I say, not as I do”
Understanding MapReduce pseudo code define wordCount as Map<String,long>; for each document in documentSet { T = tokenize(document); for each token in T { wordCount[token]++; } } display(wordCount); • This works until the no.of documents to process is not very large
Understanding MapReduce -pseudo code • Spam filter • Millions of emails • Word count for analysis • Working from a single computer is time consuming • Rewrite the program to count form multiple machines
Understanding MapReduce -pseudo code • How do we attain parallel computing ? • All the machines compute fraction of documents • Combine the results from all the machines
Understanding MapReduce -pseudo code STAGE 1 define wordCount as Map<String,long>; for each document in documentSUBSet{ T = tokenize(document); for each token in T { wordCount[token]++; } }
Understanding MapReduce -pseudo code STAGE 2 define totalWordCount as Multiset; for each wordCount received from firstPhase { multisetAdd(totalWordCount, wordCount); } Display(totalWordcount)
Understanding MapReduce -pseudo code Master Comp-1 Comp-2 Documents Comp-3 Comp-4
Understanding MapReduce -pseudo code • Problems • STAGE 1 • Documents segregations to be well defined • Bottle neck in network transfer • Data-intensive processing • Not computational intensive • So better store files over processing machines • BIGGEST FLAW • Storing the words and count in memory • Disk based hash-table implementation needed Master Comp-1 Comp-2 Documents Comp-3 Comp-4
Understanding MapReduce -pseudo code • Problems • STAGE 2 • Phase 2 has only once machine • Bottle Neck • Phase 1 highly distributed though • Make phase 2 also distributed • Need changes in Phase 1 • Partition the phase-1 output (say based on first character of the word) • We have 26 machines in phase 2 • Single Disk based hash-table should be now 26 Disk based hash-table • Word count-a , worcount-b,wordcount-c Master Comp-1 Comp-2 Documents Comp-3 Comp-4
Understanding MapReduce -pseudo code Master Comp-1 Comp-10 Comp-2 Comp-20 Documents Comp-3 Comp-30 . . . Comp-4 Comp-40
Understanding MapReduce -pseudo code • After phase-1 • From comp-1 • WordCount-A comp-10 • WordCount-B comp-20 • . • . • . • Each machine in phase 1 will shuffle its output to different machines in phase 2
Word Count -- retrospection • This is getting complicated • Store files where are they are being processed • Write disk-based hash table obviating RAM limitations • Partition the phase-1 output • Shuffle the phase-1 output and send it to appropriate reducer
Word Count -- retrospection • This is more than a lot for word count • We haven’t even touched the fault tolerance • What if comp-1 or com-10 fails • So, A need of frame work to take care of all these things • We concentrate only on business
Understanding MapReduce -pseudo code Interim output MAPPER REDUCER Master Comp-1 Comp-10 Comp-20 Comp-2 Shuffling Documents Partitioning HDFS Comp-3 Comp-30 . . . Comp-40 Comp-4
MapReduce • Mapper • Reducer Mapper filters and transforms the input Reducer collects that and aggregate on that. Extensive research is done two arrive at two phase strategy
MapReduce • Mapper,Reducer,Partitioner,Shuffling • Work together common structure for data processing
MapReduce - WordCount • Mapper • <key,words_per_line> : Input • <word,1> : output • Reducer • <word,list(1)> : Input • <word,count(list(1))> : Output
MapReduce • As said, don’t store the data in memory • So keys and values regularly have to be written to disk. • They must be serialized. • Hadoop provides its way of deserialization • Any class to be key or value have to implement WRITABLE class.
Word Count – default • Let’s try to execute the following command • hadoopjar hadoop-examples-0.20.2-cdh3u4.jar wordcount • hadoop jar hadoop-examples-0.20.2-cdh3u4.jar wordcount<input> <output> • What does this code do ?
CUSTOM WORD-COUNT • Switch to eclipse