Introduction to Hadoop

Introduction to Hadoop Capabilities, Accelerators and Solutions

Big Data *** MapReduce: Simplified Data Processing on Large Clusters, Communications of the ACM, vol. 51, no. 1 (2008), pp. 107-113, Jeffrey Dean and Sanjay Ghemawat • Google processed over 400 PB of data on datacenters composed of thousands of machines in September 2007 alone *** • Today, every organization has it’s own big data problem and most are using Hadoop to solve it.

Where is Big Data? Big Data Has Reached Every Market Sector Source – McKinsey & Company report. Big data: The next frontier for innovation, competition and productivity. May 2011.

Big Data Value Creation Opportunities

What is Hadoop? • Hadoop is an open-source project overseen by the Apache Software Foundation • Originally based on papers published by Google in 2003 and 2004 • Hadoop is an ecosystem, not a single product • Hadoop committers work at several different organizations • – Including Facebook, Yahoo!, Twitter, Cloudera, Hortonworks

Hadoop - Inspiration You Say, “tomato…” Google was awarded a patent for “map reduce – a system for large scale data processing” in 2010, but blessed Apache Hadoop by granting a license.

Hadoop Timeline • Started for Nutch at Yahoo! by Doug Cutting in early 2006 • Hadoop 2.x, released in 2012, is basis for all current, stable Hadoop distributions • Apache Hadoop 2.0.xx • CDH4.* • HDP2.*

Typical Data Strategy

How Hadoop fits in? Hadoop can complement the existing DW environment as well replace some of the components in a traditional data strategy.

How Hadoop fits in? • Storage • HDFS – It’s a file system, not a DBMS • HBase - Columnar storage that serves low-latency read / write request • Extract / Load • Source / Target is RDBMS - Sqoop, hiho • Stream processing - Flume, Scribe, Chukwa, S4, Storm • Transformation • Map-reduce (Java or any other language), Pig, Hive, Oozie etc. • Talend and Informatica have built products to abstract complexity of map-reduce • Analytics • RHadoop, Mahout • BI – All existing players are coming up with Hadoop connectors

Hadoop Ecosystem

Hadoop Ecosystem Continued…

Map-reduce – Programming model Single map task and a single reduce task - Multiple map tasks with a single reduce task -

Map-reduce – Programming model

Hadoop Map Reduce • What happens during a Map-reduce job’s lifetime? • Clients submit MapReduce jobs to the JobTracker, a daemon that resides on “master node” • The JobTracker assigns Map and Reduce tasks to other nodes on the cluster • These nodes each run a software daemon known as the TaskTracker • The TaskTracker is responsible for actually instantiating the Map or Reduce task, and reporting progress back to the JobTracker • Terminology – • A job is a ‘full program’ – a complete execution of Mappers and Reducers over a dataset • A task is the execution of a single Mapper or Reducer over a slice of data • A task attempt is a particular instance of an attempt to execute a task • There will be at least as many task attempts as there are tasks • If a task attempt fails, another will be started by the JobTracker • Speculative execution can also result in more task attempts than completed tasks

Pig Latin • Client-side utility • Pig interpreter converts the pig-script to Java map-reduce jobs and submits it to JobTracker • No additional installs needed on Hadoop Cluster • Pig performance ~ 1.4x Java MapReduce jobs, but lines of code needed ~ 1/10th • Developed at Yahoo! • Data-flow oriented language • High-level language for routing data, allows easy integration of Java for complex tasks • Data-types include sets, associative arrays, tuples

Hive • SQL-based data warehousing app • Feature set is similar to Pig • Language is more strictly SQL-esque • Supports SELECT, JOIN, GROUP BY, etc. • Uses “Schema on Read” philosophy • Features for analyzing very large data sets • Partition columns • Sampling • Buckets • Requires install of metastore on Hadoop cluster • Developed at Facebook

HBase • Distributed, versioned, column-oriented store on top of HDFS • Goal - To store tables with billion rows and million columns • Provides an option of “low-latency” (OLTP) reads/writes along with support for batch-processing model of map-reduce • HBase cluster consists of a single “HBase Master” and multiple “RegionServers” • Facebook uses HBase to drive its messaging infrastructure • Stats - Chat service supports over 300 million users who send over 120 billion messages per month • Nulls are not stored by design and typical table storage looks like –

Sqoop • RDBMS to Hadoop • Command-line tool to import any JDBC supported database into Hadoop • And also export data from Hadoop to any database • Generates map-only jobs to connect to database and read/write records • DB specific connectors contributed by vendors – • Oraoop for Oracle by Quest software • Teradata connector from Teradata • Netezza connector from IBM • Developed at Cloudera • Oracle has come up with “Oracle Loader for Hadoop” and claim that it is optimized for “Oracle Database 11g”

InformaticaHParser • Graphical interface to design data transformation jobs • Converts designed DT jobs to Hadoop Map-reduce jobs • Out-of-the-box Hadoop parsing support for industry-standard formats, including Bloomberg, SWIFT, NACHA, HIPAA, HL7, ACORD, EDI X12, and EDIFACT etc.

Flume • Flume is a distributed, reliable, available service for efficiently moving large amounts of data as it is produced • Developed at Cloudera

Machine Learning • Apache Mahout • Scalable machine learning library most of the algorithms implemented on top Apache Hadoop using map/reduce paradigm • Supported Algorithms – • Recommendation mining - takes users’ behavior and find items said specified user might like. • Clustering - takes e.g. text documents and groups them based on related document topics. • Classification - learns from existing categorized documents what specific category documents look like and is able to assign unlabeled documents to the appropriate category. • Frequent item set mining - takes a set of item groups (e.g. terms in a query session, shopping cart content) and identifies, which individual items typically appear together. • RHadoop (from Revolution Analytics) and RHIPE (from Purdue University) allows executing R programs over Apache Hadoop

Graph Implementations Graph implementations follow the bulk-synchronous parallel model, popularized by Google’s Pregel – Giraph (submitted to Apache Incubator) GoldernOrb Apache Hama More – http://www.quora.com/What-are-some-good-MapReduce-implementations-for-graphs

Hadoop Distributions

Hadoop Variants / Flavors / Distributions • Apache Hadoop – • Completely open and up-to-date version of Hadoop • Cloudera’s distribution including Hadoop (CDH) • Open source Hadoop tools packaged with “closed” management suite (SCM) • Profits by providing support (Cost-model is per node in Cluster) & Trainings • Hortonworks Data Platform • Spun-off in 2011 from Yahoo!’s core Hadoop team • Open source Hadoop tools packaged with “open” management suite (Apache Ambari) • Profits by providing support (Cost-model is per node in Cluster) &Trainings • Signed a deal with Microsoft to develop Hadoop for Windows • MapR • Claims to have developed faster version of HDFS • MapR’s distribution powers EMC’s Greenplum products • Oracle Big Data Appliance & IBM BigInsights • Powered by CDH • More may exist……..

Hadoop - Key Contributors

References • Hadoop: The Definitive Guide • by Tom White (Cloudera Inc.) • Hadoop in Action • by Chuck Lam () • HBase: The Definitive Guide • by Lars George (Cloudera Inc.) • Mahout in Action • by Sean Owen, Robin Anil, • Ted Dunning, and Ellen Friedman • Programming Pig • by Alan Gates (Hortonworks)

Thank You .

Introduction to Hadoop