Hadoop Elad Eldor , Tikal

HadoopElad Eldor, Tikal

Agenda Introduction Assumptions & goals NameNode and DataNode The File system namespace Data replication Staging & Replication pipelining Persistence of file system metadata Robustness Space Reclamation

Introduction HDFS is the primary distributed storage used by Hadoop applications. Part of the Apache Hadoop Core project Was originally built as infrastructure for the Apache Nutch web search engine project Enables applications to work with 1000’s of nodes and petabytes of data Insipred by Google’s MapReduce and Google File System (GFS)

Assumptions & goals

Hardware failures The norm rather than the exception An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data Huge number of components, each component has a non-trivial probability of failure. => some component of HDFS is always non-functional detection of faults and quick, automatic recovery from them

Streaming Data Access Applications that run on HDFS need streaming access to their data sets HDFS is designed more for batch processing rather than interactive use by users

Large Data Sets HDFS is designed to support very large files Applications that are compatible with HDFS are those that deal with large data sets These applications write their data only once but they read it one or more times They require these reads to be satisfied at streaming speeds A typical file in HDFS is gigabytes to terabytes in size A typical block size used by HDFS is 64 MB Thus, an HDFS file is chopped up into 64 MB chunks It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster Supports tens of millions of files

Simple Coherency Model HDFS applications need a write-once-read-many access model for files A file once created, written, and closed need not be changed There is a plan to support appending-writes to files in the future

“Moving Computation is Cheaper than Moving Data” A computation requested by an application is much more efficient if it is executed near the data it operates on Especially true when the size of the data set is huge Minimizes network congestion and increases the overall throughput of the system. It's often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running

NameNode & DataNode

NameNode and DataNode HDFS has a master / slave architecture An HDFS cluster consists of a single NameNode, which is a master server Multiple DataNodes, usually one per node in the cluster HDFS exposes a file system namespace and allows user data to be stored in files Internally, a file is split into one or more blocks These blocks are stored in a set of DataNodes

NameNode Executes file system namespace operations: opening, closing, and renaming files and directories. Determines the mapping of blocks to DataNodes A typical deployment has a dedicated machine that runs only the NameNode instance Each of the other machines in the cluster runs one DataNode instance user data never flows through the NameNode

DataNode(s) Serve read and write requests from the file system’s clients Perform block creation, deletion, and replication upon instruction from the NameNode

The File System namespace

The File System Namespace HDFS supports a traditional hierarchical file organization A user or an application can create directories and store files inside them, remove files, move files from one directoy to another, or rename a file. HDFS does not support: access permissions hard links or soft links The NameNode maintains the file system namespace Records any change in the namepespace

Data Replication

Data Replication Very large files are stored across machines in large cluster Each file is stored as a sequence of blocks all blocks in a file except the last block are the same size The blocks of a file are replicated for fault tolerance block size and replication factor are configurable per file An application can specify the number of replicas of a file ( File’s replication factor) replication factor can be specified at file creation time and can be changed later

NameNode & Data Replication All data-replication information is stored by the NameNode. The NameNode makes all decisions regarding replication of blocks It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster Receipt of a Heartbeat implies that the DataNode is functioning properly A Blockreport contains a list of all blocks on a DataNode

Replica Placement - Assumptions Large HDFS instances run on a cluster of computers that commonly spread across many racks. Communication between two nodes in different racks has to go through switches BW (machines in the same rack) > BW (machines in different racks) The chance of rack failure is far less than that of node failure So how the replication is performed?

Replica placement - Solution Rack-aware replica placement policy Common case (replication factor == 3): Put one replica on one node in the local rack Another on a different node in the local rack The last on a different node in a different rack cuts the inter-rack write traffic improves write performance reduce the BW used when reading data Doesn’t compromise data reliability and availability

Replica placement – Solution (cont’) 1/3 of replicas are on one node 2/3 of replicas are on one rack The other 1/3 are evenly distributed across the remaining racks

Replica Selection HDFS tries to satisfy read requests from a replica that’s closest to the reader. Preferably a replica on the same rack as the reader node If the HDFS cluster spans over multiple datacenters, prefer the replica in the local DC.

Replication during HDFS Safemode On startup, the NameNode enters a special state called Safemode No replication occurs in this state NN receives Heartbeat and Blockreport messages from the DataNodes Blockreport – a list of data blocks that a DataNode hosts Each block has a specified minimal number of replicas A block is considered safely replicated when the min has checked in with the NN After a configurable percentage of replicated data blocks checks in with the NN, it exits the Safemode state. NN determines the list of data blocks that still have fewer than the specified number of replicas NN then replicates these blocks on other DataNodes

Staging & Replication pipelining

Staging A client request to create a file doesn't reach the NameNode immediately First, the HDFS client caches the file data into a temporary local file Application writes are transparently redirected to this temporary local file When the local file accumulates data worth over one HDFS block size, the client contacts the NameNode The NameNode inserts the file name into the file system hierarchy and allocates a data block for it The NameNode responds to the client request with the identity of the DataNode and the destination data block Then the client flushes the block of data from the local temporary file to the specified DataNode When a file is closed, the remaining un-flushed data in the temporary local file is transferred to the DataNode The client then tells the NameNode that the file is closed At this point, the NameNode commits the file creation operation into a persistent store

Replication Pipelining When a client is writing data to an HDFS file, it’s first written to a local file Suppose the HDFS file has a replication factor of 3 When the local file accumulates a full block of user data, the client retrieves a list of DataNodes from the NN This list contains the DataNodes that will host a replica of that block The client then flushes the data block to the 1st DataNode, which: starts receiving the data in small portions (4 KB) writes each portion to its local repository transfers that portion to the 2nd DataNode in the list The 2nd DataNode: starts receiving each portion of the data block writes that portion to its repository flushes that portion to the 3rd DataNode Finally, the third DataNode writes the data to its local repository => A DataNode receives data from the previous one in the pipeline and at the same time forwards data to the next one in the pipeline.

Persistence of File System Metadata

EditLog and FsImage HDFS namespace is stored by the NameNode NN uses a transaction log called EditLog persistently record every change that occurs to file system metadata creating a new file in HDFS changing the replication factor of a file The NameNode uses a file in its local host OS file system to store the EditLog The entire file system namespace is stored in a file called the FsImage mapping of blocks to files file system properties, The FsImage is stored as a file in the NameNode’s local file system too

Checkpoint The NameNode keeps an image of the entire file system namespace and file Blockmap in memory. A NN with 4 GB of RAM is plenty to support a huge number of files and directories. When the NN starts, it performs the checkpoint process: reads the FsImage and EditLog from the disk. applies all the transactions from the EditLog to the in-memory representation of the FsImage flushes out this new version into a new FsImage on disk truncate the old EditLog (because its transactions have been applied to the persistent FsImage) a checkpoint only occurs when the NameNode starts up

DataNode storage DN stores HDFS data in files in its local file system It stores each block of data in a separate file in its local FS When a DataNode starts up, it performs a Blockreport: scans through its local file system generates a list of all HDFS data blocks that correspond to each of these local files sends this report to the NameNode:

Robustness

HDFS Robustness The primary objective of HDFS is to store data reliably even in the presence of failures 3 common types of failures are: NameNode failures DataNode failures Network partitions Each DataNode sends a Heartbeat message to the NameNode periodically. A network partition can cause a subset of DataNodes to lose connectivity with the NameNode The NameNode detects this condition by the absence of a Heartbeat message The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them Any data that was registered to a dead DN isn’t available to HDFS anymore DN death may cause the replication factor of some blocks to fall below their specified value

Re-replication The NN constantly tracks which blocks need to be replicated and initiates replication whenever necessary The necessity for re-replication may arise due to many reasons a DataNode may become unavailable a replica may become corrupted a hard disk on a DataNode may fail the replication factor of a file may be increased Re-Balancer

Space Reclamation

File Deletes and Undeletes When a file is deleted by a user or an application, it isn’t immediately removed from HDFS Instead, HDFS first renames it to a file in the /trash directory The file can be restored quickly as long as it remains in /trash A file remains in /trash for a configurable amount of time After the expiry of its life in /trash, the NameNode deletes the file from the HDFS namespace The deletion of a file causes the blocks associated with the file to be freed

Decrease Replication Factor replication factor of a file can be reduced The NN selects replicas that can be deleted The next Heartbeat transfers this information to the DataNode The DN then removes the corresponding blocks the corresponding free space appears in the cluster

HDFS Communication Protocols All HDFS protocols are layered on top of TCP/IP A client establishes a connection to a configurable TCP port on the NN machine It talks the ClientProtocol with the NN The DataNodes talk to the NN using the DataNode Protocol A Remote Procedure Call (RPC) abstraction wraps both the Client Protocol and the DataNode protocol By design, the NN never initiates any RPCs. it only responds to RPC requests issued by DataNodes or clients

Hadoop Map/Reduce

Hadoop Map/Reduce - Intro an open source implementation of the MapReduce programming model Relies on its own DFS - Hadoop DFS replicates data blocks in a reliable manner places them on different nodes computation is then performed by Hadoop on these nodes used by Yahoo for processing large data sets

Hadoop Map/Reduce – cont’ Provides an API for writing applications which process vast amounts of data in-parallel on large clusters in a reliable & fault-tolerant manner A Map/Reduce job usually splits the input data-set into independent chunks The chunks are processed by the map tasks in a completely parallel manner The outputs of the maps are sorted and given as input to the reduce tasks both the input and the output of the job are stored in a file-system Typically, the compute nodes and the storage nodes are the same (DataNodes) DataNodes run both Hadoop Map/Reduce and HDFS Map/Reduce tasks are scheduled on the nodes where data is already present resulting in very high aggregate bandwidth across the cluster

JobTracker and TaskTracker The Map/Reduce framework consists of: A single master JobTracker One slave TaskTracker per cluster-node The master (NameNode): scheduling the jobs' component tasks on the slaves monitoring them re-executing the failed tasks The slaves (DataNode): execute the tasks as directed by the master Minimally, applications specify the input/output locations and supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes

Inputs and Outputs The Map/Reduce framework operates exclusively on <key, value> pairs the input to the job is a set of <key, value> pairs The output of the job is a set of <key, value> pairs The two pairs can be of different types The key and value classes have implement the following: Writable – because they need to be Serializable WritableComparable interface to facilitate sorting by the framework Input and Output types of a Map/Reduce job: (input) <k1, v1> map -> <k2, v2> combine -> <k2, v2> reduce -> <k3,v3> (output)

Deploying Hadoop – 3 possible modes Standalone (default) Everything is run as a single Java process Pseudo-Distributed Hadoop is configured to run on a single machine, with different Hadoop daemons run as different Java processes Fully-Distributed (Cluster mode) one machine in the cluster is the NameNode another machine as the JobTracker There’s exactly one NameNode in each cluster A SecondaryNameNode is optional The rest of the machines within the cluster act as both DataNodes and TaskTrackers The DataNode holds the system data and manages its local hard disk The TaskTrackers carry out map and reduce operations

User Interfaces The Mapper and Reducer interfaces provide the map and reduce methods. Mapper maps input key/value pairs to a set of intermediate key/value pairs Maps are the individual tasks that transform input records into intermediate records The transformed intermediate records do not need to be of the same type as the input records Mapper implementations are passed the JobConf for the job via the JobConfigurable.configure(JobConf) method and override it to initialize themselves The framework then calls map(WritableComparable, Writable, OutputCollector, Reporter) for each key/value pair in the InputSplit for that task Applications can then override the Closeable.close() method to perform any required cleanup Output pairs do not need to be of the same types as input pairs A given input pair may map to zero or many output pairs Output pairs are collected with calls to OutputCollector.collect(WritableComparable,Writable)

Search problem search for the total number of occurrences of word ‘ABC’ Input: 100,000 elements of data to be processed. Divide the data into smaller chunks of 10,000 These chunks are inserted into 10 buckets, each contains 10K elements Apply a function named map() executes the search algorithm on a single bucket repeats it concurrently for all the buckets in parallel Stores the result in another set of buckets (result buckets) Apply a function named reduce() iterates (concurrently) through the result buckets, takes in each value, and then performs some kind of processing, if needed aggregate the individual values apply some kind of business logic Return the expected result

Step 1: Dividing the data The buckets (blocks) are created by someone for you may be on a single machine (DataNode) or on multiple machines petabytes of data could be segmented into 1000’s of buckets Placed on different machines in the cluster Processing could be performed in parallel by the DataNodes

Step 2: The map() function the map() function understands exactly where it should go to process the data local disk another node in the cluster Non MapReduce application: Processes data on multiple threads Fetches data from a data source (remote DB server) Executes on the machine where it’s running MapReduce implementation: Computation happens on the distributed nodes. bringing data to the place where the map() function resides VS map() execute at the place where the data resides

Step 3: The reduce() function the reduce() function operates on one or more lists of intermediate results fetching each of them from memory, disk, or a network transfer Performs a function on each element of each list

Step 4: Get the final output The final result of the complete operation performed by collating and interpreting the results from all processes running reduce() operations. get the final output 0 or some data element

Hadoop Elad Eldor , Tikal

Hadoop Elad Eldor , Tikal

Presentation Transcript

ELAD : Recent Clinical Data

Hadoop , Hadoop , Hadoop !!!

Elad Verbin Aarhus University

ancient City of tikal

Elad Verbin Aarhus University

What Happened to Tikal?

Middle to Late Classic Tikal

Tikal, The Maya City

Tikal Rulers

Hadoop

Hadoop

Elad dror Forbes

Best Tikal Tours From Flores