Hadoop Introducing Installation and Configuration

HadoopIntroducingInstallation and Configuration 数据挖掘研究组 Data Mining Group @ Xiamen University

A Distributed data-intensive Programming Framework Distributed storage Hadoop Parallel computing 数据挖掘研究组 Data Mining Group @ Xiamen University

Introducing to HDFS • Hadoop Distributed File System (HDFS) • An open-source implementation of GFS • has many similarities with distributed file systems. • However, comes differences with it. • HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. • HDFS provides high throughput access to application data and is suitable for applications that have large data sets. 数据挖掘研究组 Data Mining Group @ Xiamen University

How it works?

Features of it • An important feature of the design : • data is never moved through the namenode. • Instead, all data transferoccurs directly between clients and datanodes 数据挖掘研究组 Data Mining Group @ Xiamen University

MapReduce? Let’s talk it next time……… 数据挖掘研究组 Data Mining Group @ Xiamen University

“Running Hadoop?” What means for it?“Running Hadoop” means running a set of daemons.NameNodeDataNode Secondary NameNodeJobTrackerTaskTracker 数据挖掘研究组 Data Mining Group @ Xiamen University

Who Works for who? • NameNode • Sec ND • JobTracker • DataNode • TaskTracker Hadoop

NameNode • Hadoop employs a master/slave architecture for both distributed storage and distributed computation. • NameNode is the master of HDFS that directs the slave DataNode daemons to perform the low-level I/O tasks • NameNode is the bookkeeper of HDFS • keeps track of how your fi les are broken down into fi le blocks • keeps track of the overall health of the distributed fi lesystem

DataNode • reading and writing HDFS blocks for clients • communicate with other DataNodes to replicate its data blocks for redundancy 数据挖掘研究组 Data Mining Group @ Xiamen University

NameNode and DataNode

Secondary NameNode • SNN is an assistant daemon for monitoring the state of the cluster HDFS • differs from the NameNode in that this process doesn’t receive or record any real-time changes to HDFS • communicates with the NameNode to take snapshots of the HDFS metadata • Recovery: NameNode failure ???? We reconﬁgure the cluster to use the SNN as the primary NameNode

JobTracker • the liaison between your application and Hadoop • submit your code to your cluster, the JobTracker determines the execution plan • determining which ﬁ les to process • assigns nodes to different tasks • monitors all tasks as they’re running • a task fail???? JobTrackerwill relaunch the task on a different node

TaskTracker • Each TaskTracker is responsible for executing the individual tasks that the JobTracker assigns 数据挖掘研究组 Data Mining Group @ Xiamen University

JobTracker and TaskTracker

Installation and Configuration • Pseudo-distributed mode All daemons run on on the machine • Fully distributed mode What Different? 数据挖掘研究组 Data Mining Group @ Xiamen University

Installation forPseudo-distributed mode • Prerequisites • Ubuntu Linux • Hadoop 0.20.2 • Sun Java 6 $sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner“ $sudo apt-get update $sudo apt-get install sun-java6-jdk 数据挖掘研究组 Data Mining Group @ Xiamen University

Configuring SSH • Hadoop requires SSH access to manage its nodes, remote machines plus your local machine if you want to use Hadoop on it • $ sduo apt-get install openssh-server • $ ssh-keygen -t rsa -P “” • The second line will create an RSA key pair with an empty password. Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without your interaction ,since you don’t want to enter the passphrase every time Hadoop interacts with its nodes.

Configuring SSH • $ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys • sshlocalhost • The authenticity of host 'localhost (::1)' can't be established. RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'localhost' (RSA) to the list of known hosts. Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/Linux Ubuntu 10.04 LTS [...snipp...] 数据挖掘研究组 Data Mining Group @ Xiamen University

extract Hadoop package • $ cd /usr/local • $ sudo tar xzf hadoop-0.20.2.tar.gz • $ sudochown -R dm:dm hadoop-0.20.2 数据挖掘研究组 Data Mining Group @ Xiamen University

Update ~/.bashrc • $sudo vim ~/.bashrc • # Set Hadoop-related environment variables export HADOOP_HOME=/usr/local/hadoop • # Set JAVA_HOME JAVA_HOME=/usr/lib/jvm/java-6-sun • # Add Hadoop bin/ directory to PATH export PATH=$PATH:$HADOOP_HOME/bin 数据挖掘研究组 Data Mining Group @ Xiamen University

hadoop.tmp.dir • Create /app/hadoop/tmp. • Hadoop’sdefault configurations usehadoop.tmp.diras the base temporary directory both for the local file system and HDFS • $ sudomkdir -p /app/hadoop/tmp • $ sudochowndm:dm/app/hadoop/tmp 数据挖掘研究组 Data Mining Group @ Xiamen University

Configuration hadoop-env.sh • Configure JAVA_HOME environment variable for Hadoop • Change • # The java implementation to use. Required. # export JAVA_HOME=/usr/lib/j2sdk1.5-sun • to • # The java implementation to use. Required. export JAVA_HOME=/usr/lib/jvm/java-6-sun 数据挖掘研究组 Data Mining Group @ Xiamen University

Key stage • Configuration Key propertyies for hadoopdaemons • These propertyies should be set in XML files ,which locate in”/usr/local/hadoop-0.20.2/conf” core-site.xml mapred-site.xml hdfs-site.xml 数据挖掘研究组 Data Mining Group @ Xiamen University

Key propertyies for hadoop daemons • fs.default.name(core-site.xml) • hadoop.tmp.dir(core-site.xml) • mapred.job.tracker(mapred-site.xml) • dfs.data.dir(hdfs-site.xml) • dfs.replication(hdfs-site.xml) 数据挖掘研究组 Data Mining Group @ Xiamen University

Configuration core-site.xml • Add the following lines between the <configuration> ... </configuration> tags <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description>A base for other temporary directories. </description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem. </description> </property>

Configuration mapred-site.xml <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> 数据挖掘研究组 Data Mining Group @ Xiamen University

Configuration hdfs-site.xml <property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> 数据挖掘研究组 Data Mining Group @ Xiamen University

Formatting the name node • formatting the Hadoop filesystem which is implemented on top of the local filesystem of your “cluster” • $ bin/hadoopnamenode –format Installation Done! 数据挖掘研究组 Data Mining Group @ Xiamen University

Fully distributed mode

Networking • assign the Static IP for all the hosts • Update /etc/hosts on both machines with the following lines:(for master AND slaves) 192.168.0.1 master 192.168.0.2 slave 数据挖掘研究组 Data Mining Group @ Xiamen University

SSH access add the hduser@master‘s public SSH key (which should be in $HOME/.ssh/id_rsa.pub) to the authorized_keys file of hduser@slave (in this user ’ $HOME/.ssh/authorized_keys) 数据挖掘研究组 Data Mining Group @ Xiamen University

Masters vs. Slaves • one machine in the cluster is designated as the NameNode another machine(maybe the same) as JobTracker. These are the actual “masters”. • The rest of the machines in the cluster must act as both DataNode and TaskTracker. These we call “slaves” 数据挖掘研究组 Data Mining Group @ Xiamen University

Masters vs. Slaves • conf/masters (master only) master • conf/slaves (master only) master slave 数据挖掘研究组 Data Mining Group @ Xiamen University

conf/*-site.xml (all machines) How? 数据挖掘研究组 Data Mining Group @ Xiamen University

Formatting the NameNode $bin/hadoop namenode –format $bin/start-all.sh $jps $bin/stop-all.sh 数据挖掘研究组 Data Mining Group @ Xiamen University

Thank youAny Question? 数据挖掘研究组 Data Mining Group @ Xiamen University

Hadoop Introducing Installation and Configuration

Hadoop Introducing Installation and Configuration

Presentation Transcript

NCL1155/1170 Configuration and Installation

“Introducing Hadoop on Azure:

Time Track Installation and Configuration

“Introducing Hadoop on Azure:

BDII Server Installation and Configuration

LFC Server Installation and Configuration

LFC Server Installation and Configuration

Installation and Configuration

Storage Element Installation and Configuration

User Interface installation and configuration

VOMS Installation and configuration

User Interface installation and configuration

Installation and configuration

AMGA Server Installation and configuration

Introducing IPCR Upgrade Configuration

CE + WN installation and configuration

Bacula Installation and Configuration

SRM Installation and Configuration

CE + WN installation and configuration

User Interface installation and configuration

WMS+LB Server Installation and Configuration

gLite WMS Installation and configuration