IDS594 Special Topics in Big Data Analytics

IDS594Special Topics in Big Data Analytics Week3

What is Hadoop? • Hadoop is a software framework for distributed processingof large datasets across large clusters of computers • Hadoop is open-source implementation for Google MapReduce • Hadoop is based on a simple programming model called MapReduce • Hadoop is based on a simple data model, any data will fit • Hadoop framework consists on two main layers • Distributed file system (HDFS) • Execution engine (MapReduce)

Hadoop Infrastructure • Hadoop is a distributedsystem like distributed databases. • However, there are several key differences between the two infrastructures • Data model • Computing model • Cost model • Design objectives

Hadoop: Big Picture

HDFS: Hadoop Distributed File System • HDFS is a master-slave architecture • Master: namenode • Slave: datanode (100s or 1000s of nodes) • Single namenode and many datanodes • Namenode maintains the file system metadata • Files are split into fixed sized blocks and stored on data nodes (Default 64MB) • Data blocks are replicated for fault tolerance and fast access (Default is 3)

HDFS Architecture • Default placement policy: where to put a given block? • Frist copy is written to the node creating the file (write affinity) • Second copy is written to a datanode within the same rack • Third copy is written to a datanode in a different rack • Objectives: load balancing, fast access, fault tolerance

MapReduce: Hadoop Execution Layer • JobTracker knows everything about submitted jobs • Divides jobs into tasks and decides where to run each task • Continuously communicating with TaskTrackers • TaskTrackers execute task (multiple per node) • Monitors the execution of each task • Continuously sending feedback to JobTracker • MapReduce is a master-slave architecture • Master: JobTracker • Slave: TaskTrackers (100s or 1000s of tasktrackers) • Every datanode is running a TaskTracker

High-level MapReduce Pipeline

Hadoop MapReduce Data Flow

Hadoop Computing Model • Mapper and Reducers consume and produce (key, value) pairs • Users define the data type of the Key and Value • Shuffling and Sorting phase • Map output is shuffled such that all same-key records go the same reducer • Each reducer may receive multiple key sets • Each reducer sorts its records to group similar keys, then process each group

Hadoop Configuration and Installation

conf/hadoop-env.sh • Configuring the Environment of the HadoopDaemons • JAVA_HOME • Individual daemons using the configuration options HADOOP_*_OPTS. • HADOOP_LOG_DIR - The directory where the daemons' log files are stored. They are automatically created if they don't exist. • HADOOP_HEAPSIZE - The maximum amount of heapsize to use, in MB e.g. 1000MB. This is used to configure the heap size for the hadoop daemon. By default, the value is 1000MB.

conf/core-site.xml • Configuring the Hadoop daemons <configuration> <property> <name>hadoop.tmp.dir</name> <value>/Users/${user.name}/hadoop-store</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration>

conf/hdfs-site.xml • dfs.name.dir • Path on the local file system where the NameNode stores the namespace and transactions logs persistently. • If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy. • dfs.data.dir • Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks. • If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. • dfs.replication • Each data block is replicated for redundancy.

conf/mapred-site.xml • mapred.job.tracker • Host or IP and port of JobTracker. • host:portpair. • mapred.tasktracker.{map|reduce}.tasks.maximum • The maximum number of MapReduce tasks, which are run simultaneously on a given TaskTracker, individually. • Defaults to 2 (2 maps and 2 reduces), but vary it depending on your hardware. • others

slave file • Typically you choose one machine in the cluster to act as the NameNode and one machine as to act as the JobTracker, exclusively. The rest of the machines act as both a DataNode and TaskTracker and are referred to as slaves. • List all slave hostnames or IP addresses in your conf/slaves file, one per line.

Start/Stop Hadoop • bin/hadoopnamenode -format • Start the hadoop daemons: • $ bin/start-all.sh • Stop the hadoop daemons: • $ bin/stop-all.sh

Online Documentation Windows: http://ebiquity.umbc.edu/Tutorials/Hadoop/00%20-%20Intro.html Mac: http://hadoop.apache.org/docs/stable/single_node_setup.html

MapReduce • Map function • Reduce function • Job configuration https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.htm

Map Function Input type Output type • public static class Map extendsMapReduceBaseimplements Mapper<LongWritable, Text, Text, IntWritable> { • private final static IntWritable one = new IntWritable(1); • private Text word = new Text(); • public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { • String line = value.toString(); • StringTokenizertokenizer = new StringTokenizer(line); • while (tokenizer.hasMoreTokens()) { • word.set(tokenizer.nextToken()); • output.collect(word, one); • } • } • }

Reduce Function Input type Output type • public static class Reduce extends MapReduceBaseimplements Reducer<Text, IntWritable, Text, IntWritable> { • public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { • intsum = 0; • while (values.hasNext()) { • sum += values.next().get(); • } • output.collect(key, new IntWritable(sum)); • } • }

Job Configuration public static void main(String[] args) throws Exception { JobConfconf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); }

HDFS Commands https://hadoop.apache.org/docs/current2/hadoop-project-dist/hadoop-common/FileSystemShell.html

IDS594 Special Topics in Big Data Analytics