1 / 28

Cloud Computing Mapreduce (2)

Cloud Computing Mapreduce (2). Keke Chen. Outline. Hadoop streaming example Hadoop java API Framework important APIs Mini-project. A nice book. Hadoop: The definitive Guide You can read it online from campus network - ohiolink  ebook center safari online. Hadoop streaming.

aradia
Download Presentation

Cloud Computing Mapreduce (2)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cloud ComputingMapreduce (2) Keke Chen

  2. Outline • Hadoop streaming example • Hadoop java API • Framework • important APIs • Mini-project

  3. A nice book • Hadoop: The definitive Guide • You can read it online from campus network - ohiolink  ebook center safari online

  4. Hadoop streaming • Simple and powerful interface for programming • Application developers do not need to learn hadoop java APIs • Good for simple, adhoc tasks

  5. Note: • Map/Reduce uses the local linux file system for processing and hosting temporary data • HDFS is used to host application data HDFS Node Local file system

  6. Hadoop streamining • http://hadoop.apache.org/common/docs/current/streaming.html • /usr/local/hadoop/bin/hadoop jar \ /usr/local/hadoop/hadoop-streaming-1.0.3.jar \ -input myInputDirs -output myOutputDir \ -mapper myMapper -reducer myReducer • Reducer can be empty: -reducer None • myMapper and myReducer can be any executable • Mapper/reducer will take stdin and output to stdout • Files in myInputDirs are fed into mapper as stdin • Mapper’s output will be the input of reducer

  7. Packaging files with job submission • /usr/local/hadoop/bin/hadoop jar \/usr/local/hadoop/hadoop-streaming-1.0.3.jar \ -input “/user/hadoop/inputdata” \ -output “/user/hadoop/outputdata” \ -mapper “python myPythonScript.py myDictionary.txt” \ -reducer “/bin/wc” \ -file myPythonScript.py \ -file myDictionary.txt • -file is good for small files Input parameter for the script

  8. Using hadoop library classes hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -D mapred.reduce.tasks=12 \ -input myInputDirs \ -output myOutputDir \ -mapper org.apache.hadoop.mapred.lib.IdentityMapper \ -reducer org.apache.hadoop.mapred.lib.IdentityReducer \ -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

  9. Large files and archives • Upload large files to HDFS first • Use –files option in streaming, which will download files to local working directory • -files hdfs://host:fs_port/user/testfile.txt#testlink • -archives hdfs://host:fs_port/user/testfile.jar#testlink • Cache1.txt, cache2.txt are in testfile.jar • Then, locally testlink/cache1.txt, textlink/cache2.txt

  10. Wordcount • Problem: counting frequencies of words for a large document collection. • Implement mapper and reducer respectively, using python • Some good python tutorials at http://wiki.python.org/

  11. Mapper.py import sys for line in sys.stdin: line = line.strip() words = line.split() for word in words: print ‘%s\t1’ % (word)

  12. Reducer.py import sys word2count={} for line in sys.stdin: line = line.strip() word, count = line.split(‘\t’, 1) try: count = int(count) word2count[word] = word2count.get(word, 0)+ count except ValueError: pass for word in word2count: print ‘%s\t%s’% (word, word2count[word])

  13. Running wordcount hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -mapper "python mapper.py" \ -reducer "python reducer.py" \ -input text -output output2 \ -file /localpath/mapper.py -file /localpath/reducer.py

  14. Running wordcount hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -mapper "python mapper.py" \ -reducer "python reducer.py" \ -input text -output output2 \ -file mapper.py -file reducer.py \ -jobconf mapred.reduce.tasks=2 \ -jobconf mapred.map.tasks=4

  15. If mapper/reducer takes files as parameters hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -mapper "python mapper.py" \ -reducer "python reducer.py myfile" \ -input text -output output2 \ -file /localpath/mapper.py -file /localpath/reducer.py -file /localpath/myfile

  16. Hadoop Java APIs • hadoop.apache.org/common/docs/current/api/ • benefits • Jave code is more efficient than streaming • More parameters for control and tuning • Better for iterative MR programs

  17. Important base classes • Mapper<keyIn, valueIn, keyOut, valueOut> • Function map(Object, Writable, Context) • Reducer<keyIn, valueIn, keyOut, valueOut> • Function reduce(WritableComparable, Iterator, Context) • Combiner • Partitioner

  18. The framework public class Wordcount{ public static class MapClass extends Mapper<Object, Text, Text, LongWritable> { public void setup(Mapper.Context context){…} public void map(Object key, Text value, Context context) throws IOException {…} } public static class ReduceClass Reducer<Text, LongWritable, Text, LongWritable> { public void setup(Reducer.Context context){…} public void reduce(Text key, Iterator<LongWritable> values, Context context) throws IOException{…} } public static void main(String[] args) throws Exception{} }

  19. The wordcount example in java • http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Example%3A+WordCount+v1.0 • Old/New framework • Old framework for version prior to 0.20

  20. Mapper of wordcount public static class WCMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }

  21. WordCount Reducer public static class WCReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }

  22. Function parameters • Define map/reduce parameters according to your application • Have to use writable classes in org.apache.hadoop.io • E.g. Text, LongWritable, IntWritable etc. • Template parameters and the function parameters should be matched • Map’s output and reduce’s input parameters should be matched.

  23. Configuring map/reduce • Passing global parameter settings to each map/reduce process • In main function, set parameters in a Configuration object Configuration conf = new Configuration(); Job job = new Job(conf, "cloudvista"); job.setJarByClass(Wordcount.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class); job.setMapperClass(WCMapper.class); //job.setCombinerClass(WCReducer.class); job.setReducerClass(WCReducer.class); //job.setPartitionerClass(WCPartitioner.class); job.setNumReduceTasks(num_reduce); FileInputFormat.setInputPaths (job, input); FileOutputFormat.setOutputPath (job, new Path(output_path )); System.exit(job.waitForCompletion(true)?0:1);

  24. How to run your app 1. Compile to jar file 2. Command line hadoop jar your_jar your_parameters • Normally you need to pass in • Number of reducers • Input files • Output directory • Any other application specific parameters

  25. Access Files in HDFS? Example: In map function Public void setup(Mapper.Context context){ Configuration conf = context.getConfiguration(); string filename = conf.get(“yourfile"); Path p = new Path(filename); // Path is used for opening the file. FileSystem fs = FileSystem.get(conf);//determines local or HDFS FSInputStream file = fs.open(p); while (file.available() > 0){ … } file.close(); }

  26. Combiner • Apply reduce function to the intermediate results locally after the map generates the result key1 combine Key1, value1 Key2, value2 … Keyn, valueN Map1 reduces Key n Map’s local

  27. Partitioner • If map’s output will generate N keys (N>R, R:# of reduces) • By default, N keys are randomly distributed to R reduces • You can use partitioner to define how the keys are distributed to the reduces.

  28. Mini project 1 • Learn to use HDFS • Read and run wordcount example http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html • Write a MR program for inverted-index /user/hadoop/prj1.txt • Implement two versions • Script/exe + streaming • Hadoop Java API • The file has “docID \t docContent” per line • Generating inverted index Word \t a list of “DocID:position”

More Related