1 / 14

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. By: Muhammad Mudassar MS-IT-8. What is going on . Data analysis techniques are changing Enterprises moving to cheaper commodity hardware

Download Presentation

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads By: Muhammad Mudassar MS-IT-8

  2. What is going on • Data analysis techniques are changing • Enterprises moving to cheaper commodity hardware • MPP (Massively Parallel Processing) architecture inside “Clods” • Analytical data is exploding • What technology for data analysis? • Parallel databases • MapReduce-based systems

  3. The two technologies • Parallel Databases • High performance and efficiency • Bad scores in fault tolerance and run in heterogeneous environment • Few known deployments over 100 nodes • MapReduce-based systems • Designed to scale over 1000 of nodes • Fault tolerant and capable to run in heterogeneous environment • Biggest issue with MapReduce is performance

  4. HadoopDB • A hybrid system to handle demands of data intensive applications • Advantages • Scalability of MapReduce • Performance and efficiency of parallel databases • Completely build on open source free to use components • PostgreSQL as database layer • Hadoop MapReduce is used • Amazon’s EC2 cloud is used

  5. Desired Properties • Performance • A primary characteristic that commercial database systems use to distinguish themselves • Fault tolerance • Measured differently for analytical DBMS and transactional DBMS. • For analytical DBMS query restart is to be avoided • Ability to run in heterogeneous environment • Nearly impossible to get homogeneous performance from 100 or 1000 nodes • Flexible query interface • Allow user to write user defined functions (UDFs) and queries that should be parallelized automatically.

  6. Architecture of HadoopDB

  7. The Hadoop framework • Hadoop consists of 2 layers • Data storage layers which is Hadoop Distributed File System (HDFS) • Data processing or the MapReduce framework • HDFS • Block-structure file system managed by NameNode • Data handled by DataNodes • MapReduce framework • Master-slave architecture based on JobTracker & TaskTracker • JobTracker manages job like assignment keeping track of jobs and load balancing • TaskTrackers perform assigned Map or Reduce tasks assigned to them

  8. The HadoopDB’s components • HadoopDB extends Hadoop framework with four components • Database connector • Interface between DBMS and TaskTacker • Database is similar to data blocks in HDFS • Catalog • Maintain information about database • Database location, driver class meta data like replica location partitioning property • Data Loader • Globally partition the data on given key • Break single node data into chunks • Load the chunks to the database

  9. The HadoopDB’s components • SQL to MapReduce to SQL (SMS) Planner • HadoopDB provide front end to process SQL queries • SMS planner extends Hive • Parser transforms query to abstract syntax tree • Get table schema information from catalog • Logical plan generator creates query plan • Optimizer breaks up plan to Map or Reduce phases • Executable plan generated for one or more MapReduce jobs • SMS tries to push maximum work to database layer

  10. Evaluating HadoopDB • Compare HadoopDB to • Hadoop • Parallel databases (Vertica, DBMS-X) • Features • Performance HadoopDB is expected to approach performance of parallel databases • Scalability HadoopDB would be scalable

  11. Data Load

  12. Queries Results

  13. Scalability • HadoopDB and Hadoop take advantage of run time scheduling by splitting data • Parallel databases restart entire query on node failure or wait for slowest node

  14. Conclusion • HadoopDB • Is a Hybrid system • Scales better then parallel databases • Fault tolerant • Approaches the performance of parallel databases • Free and opensource

More Related