1 / 30

Introduction to Hadoop

Introduction to Hadoop. Capabilities, Accelerators and Solutions. Big Data. *** MapReduce: Simplified Data Processing on Large Clusters, Communications of the ACM, vol. 51, no. 1 (2008), pp. 107-113, Jeffrey Dean and Sanjay Ghemawat.

tarala
Download Presentation

Introduction to Hadoop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Hadoop Capabilities, Accelerators and Solutions

  2. Big Data *** MapReduce: Simplified Data Processing on Large Clusters, Communications of the ACM, vol. 51, no. 1 (2008), pp. 107-113, Jeffrey Dean and Sanjay Ghemawat • Google processed over 400 PB of data on datacenters composed of thousands of machines in September 2007 alone *** • Today, every organization has it’s own big data problem and most are using Hadoop to solve it.

  3. Where is Big Data? Big Data Has Reached Every Market Sector Source – McKinsey & Company report. Big data: The next frontier for innovation, competition and productivity. May 2011.

  4. Big Data Value Creation Opportunities

  5. What is Hadoop? • Hadoop is an open-source project overseen by the Apache Software Foundation • Originally based on papers published by Google in 2003 and 2004 • Hadoop is an ecosystem, not a single product • Hadoop committers work at several different organizations • – Including Facebook, Yahoo!, Twitter, Cloudera, Hortonworks

  6. Hadoop - Inspiration You Say, “tomato…” Google was awarded a patent for “map reduce – a system for large scale data processing” in 2010, but blessed Apache Hadoop by granting a license.

  7. Hadoop Timeline • Started for Nutch at Yahoo! by Doug Cutting in early 2006 • Hadoop 2.x, released in 2012, is basis for all current, stable Hadoop distributions • Apache Hadoop 2.0.xx • CDH4.* • HDP2.*

  8. Typical Data Strategy

  9. How Hadoop fits in? Hadoop can complement the existing DW environment as well replace some of the components in a traditional data strategy.

  10. How Hadoop fits in? • Storage • HDFS – It’s a file system, not a DBMS • HBase - Columnar storage that serves low-latency read / write request • Extract / Load • Source / Target is RDBMS - Sqoop, hiho • Stream processing - Flume, Scribe, Chukwa, S4, Storm • Transformation • Map-reduce (Java or any other language), Pig, Hive, Oozie etc. • Talend and Informatica have built products to abstract complexity of map-reduce • Analytics • RHadoop, Mahout • BI – All existing players are coming up with Hadoop connectors

  11. Hadoop Ecosystem

  12. Hadoop Ecosystem Continued…

  13. Map-reduce – Programming model Single map task and a single reduce task - Multiple map tasks with a single reduce task -

  14. Map-reduce – Programming model

  15. Hadoop Map Reduce • What happens during a Map-reduce job’s lifetime? • Clients submit MapReduce jobs to the JobTracker, a daemon that resides on “master node” • The JobTracker assigns Map and Reduce tasks to other nodes on the cluster • These nodes each run a software daemon known as the TaskTracker • The TaskTracker is responsible for actually instantiating the Map or Reduce task, and reporting progress back to the JobTracker • Terminology – • A job is a ‘full program’ – a complete execution of Mappers and Reducers over a dataset • A task is the execution of a single Mapper or Reducer over a slice of data • A task attempt is a particular instance of an attempt to execute a task • There will be at least as many task attempts as there are tasks • If a task attempt fails, another will be started by the JobTracker • Speculative execution can also result in more task attempts than completed tasks

  16. Pig Latin • Client-side utility • Pig interpreter converts the pig-script to Java map-reduce jobs and submits it to JobTracker • No additional installs needed on Hadoop Cluster • Pig performance ~ 1.4x Java MapReduce jobs, but lines of code needed ~ 1/10th • Developed at Yahoo! • Data-flow oriented language • High-level language for routing data, allows easy integration of Java for complex tasks • Data-types include sets, associative arrays, tuples

  17. Hive • SQL-based data warehousing app • Feature set is similar to Pig • Language is more strictly SQL-esque • Supports SELECT, JOIN, GROUP BY, etc. • Uses “Schema on Read” philosophy • Features for analyzing very large data sets • Partition columns • Sampling • Buckets • Requires install of metastore on Hadoop cluster • Developed at Facebook

  18. HBase • Distributed, versioned, column-oriented store on top of HDFS • Goal - To store tables with billion rows and million columns • Provides an option of “low-latency” (OLTP) reads/writes along with support for batch-processing model of map-reduce • HBase cluster consists of a single “HBase Master” and multiple “RegionServers” • Facebook uses HBase to drive its messaging infrastructure • Stats - Chat service supports over 300 million users who send over 120 billion messages per month • Nulls are not stored by design and typical table storage looks like –

  19. Sqoop • RDBMS to Hadoop • Command-line tool to import any JDBC supported database into Hadoop • And also export data from Hadoop to any database • Generates map-only jobs to connect to database and read/write records • DB specific connectors contributed by vendors – • Oraoop for Oracle by Quest software • Teradata connector from Teradata • Netezza connector from IBM • Developed at Cloudera • Oracle has come up with “Oracle Loader for Hadoop” and claim that it is optimized for “Oracle Database 11g”

  20. InformaticaHParser • Graphical interface to design data transformation jobs • Converts designed DT jobs to Hadoop Map-reduce jobs • Out-of-the-box Hadoop parsing support for industry-standard formats, including Bloomberg, SWIFT, NACHA, HIPAA, HL7, ACORD, EDI X12, and EDIFACT etc.

  21. Flume • Flume is a distributed, reliable, available service for efficiently moving large amounts of data as it is produced • Developed at Cloudera

  22. Machine Learning • Apache Mahout • Scalable machine learning library most of the algorithms implemented on top Apache Hadoop using map/reduce paradigm • Supported Algorithms – • Recommendation mining - takes users’ behavior and find items said specified user might like. • Clustering - takes e.g. text documents and groups them based on related document topics. • Classification - learns from existing categorized documents what specific category documents look like and is able to assign unlabeled documents to the appropriate category. • Frequent item set mining - takes a set of item groups (e.g. terms in a query session, shopping cart content) and identifies, which individual items typically appear together. • RHadoop (from Revolution Analytics) and RHIPE (from Purdue University) allows executing R programs over Apache Hadoop

  23. Graph Implementations Graph implementations follow the bulk-synchronous parallel model, popularized by Google’s Pregel – Giraph (submitted to Apache Incubator) GoldernOrb Apache Hama More – http://www.quora.com/What-are-some-good-MapReduce-implementations-for-graphs

  24. Hadoop Distributions

  25. Hadoop Variants / Flavors / Distributions • Apache Hadoop – • Completely open and up-to-date version of Hadoop • Cloudera’s distribution including Hadoop (CDH) • Open source Hadoop tools packaged with “closed” management suite (SCM) • Profits by providing support (Cost-model is per node in Cluster) & Trainings • Hortonworks Data Platform • Spun-off in 2011 from Yahoo!’s core Hadoop team • Open source Hadoop tools packaged with “open” management suite (Apache Ambari) • Profits by providing support (Cost-model is per node in Cluster) &Trainings • Signed a deal with Microsoft to develop Hadoop for Windows • MapR • Claims to have developed faster version of HDFS • MapR’s distribution powers EMC’s Greenplum products • Oracle Big Data Appliance & IBM BigInsights • Powered by CDH • More may exist……..

  26. Hadoop - Key Contributors

  27. Hadoop - Key Contributors

  28. Hadoop - Key Contributors

  29. References • Hadoop: The Definitive Guide • by Tom White (Cloudera Inc.) • Hadoop in Action • by Chuck Lam () • HBase: The Definitive Guide • by Lars George (Cloudera Inc.) • Mahout in Action • by Sean Owen, Robin Anil, • Ted Dunning, and Ellen Friedman • Programming Pig • by Alan Gates (Hortonworks)

  30. Thank You .

More Related