1 / 67

The Elephant in the Room

The Elephant in the Room. A DBA’s Guide to Hadoop & Big Data. Purpose. Rosetta Stone presentation High level overview of Hadoop & Big Data NOT a deep dive NOT a demo session Mostly theory & vocabulary Where to learn more. Caveats. Focus is vendor-specific Hortonworks Hadoop

hu-alvarado
Download Presentation

The Elephant in the Room

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Elephant in the Room A DBA’s Guide to Hadoop & Big Data

  2. Purpose Rosetta Stone presentation High level overview of Hadoop & Big Data NOT a deep dive NOT a demo session Mostly theory & vocabulary Where to learn more

  3. Caveats Focus is vendor-specific • Hortonworks Hadoop • Microsoft SQL Server Don’t consider myself a Hadoop expert (yet)

  4. About Me Manage DBA’s for financial services company Former Data Architect, DBA, developer Linchpin People TeamMate AtlantaMDF Chapter Leader Infrequent blogger: http://codegumbo.com

  5. About You Assume that • SQL experience • exposure to database admin & architecture • little to no experience with Big Data

  6. Challenges... ..for the SQL Server DBA

  7. Rapid Evolution SQL Server new version => 2-4 years New functionality; deprecations Hadoop “official” release => 6 months New functionality; deprecations Different components on separate cycles

  8. DEVELOPERS DBAS

  9. Ecosystems, not product Open-source; vendors add enhancements Official Hadoop is only four modules: • HDFS • Hadoop MapReduce • Hadoop YARN • Hadoop Common

  10. Hadoop Ecosystem (Hortonworks) Hortonworks

  11. “Big Data”

  12. Big Data is like teenage sex... Everyone talks about it, Nobody really knows how to do it, Everyone thinks everyone else is doing it, So everyone claims they are doing it… -Dan Ariely

  13. The Four V’s of Big Data Volume - data is too big to scale out Velocity - decision window is small Variety - multiple formats challenge integration Variability - same data, different interpretations http://goo.gl/6icouZ

  14. RDBMS versus Big Data RDBMS Primarily Scale-Up Strong Typing Normalization Default Mutable Mature Big Data Primarily Scale-Out Schemaless Default Immutable Evolving

  15. Foundations “Gentlemen, this is a football…” - Vince Lombardi

  16. Hadoop Scaleable, distributed processing framework Official Hadoop is only four modules: • HDFS • Hadoop MapReduce • Hadoop YARN • Hadoop Common

  17. HDFS Hadoop Distributed File System Inspired by Google FileSystem (2002-2003) Cluster storage of large files across servers Yahoo - 10,000 core Hadoop cluster(s) Facebook - 100 PB+ (June, 2012) http://goo.gl/SpSN

  18. HDFS

  19. HDFS File permissions and authentication. Rack aware fsck: find missing files or blocks. Scheduled Rebalancing Redundancy & Replication Built around MapReduce

  20. Hadoop MapReduce “Developed” by Google; patent issued in 2004 Map - filtering and sorting Reduce - summarization Inherently distributed

  21. Hadoop MapReduce

  22. Hadoop YARN Yet Another Resource Negotiator Splits resource management out of MapReduce Allows for the use of other processing types (e.g., graph, stream, etc).

  23. Hadoop YARN

  24. Hadoop Common Shared libraries for Hadoop components (and vendor enhancements). Security objects are best example • Superusers • Service Level Authorization • HTTP Authentication

  25. But Wait… There’s More! Hortonworks

  26. Sqoop Data connector between RDBMS and HDFS Command line interface JDBC driver; BCP-like syntax Tutorial

  27. Hive HiveQL - SQL like syntax DDL scripts define tables Query transformed into MapReduce jobs Performance increases with scalability Stinger initiative - Microsoft\Hortonworks

  28. Hive

  29. Hive create external table price_data (stock_exchange string, symbol string, trade_date string, open float, high float, low float, close float, volume int, adj_close float) row format delimited fields terminated by ',' stored as textfile location '/user/hue/nyse/nyse_prices'; select * from price_data where symbol = 'IBM';

  30. Hive

  31. HCatalog Tight integration with Hive, but supports all Hadoop data access protocols Define relational view into data (DDL) “Tables” can be reused by Hive, Pig, Storm... Tutorial

  32. Pig Data abstraction language; Yahoo (2006) Based on Java; supports Python & Ruby Procedural (SQL is declarative) Allows for ETL Lazy evaluation

  33. Pig

  34. Pig

  35. Pig ETL service; useful as “duct tape” Typical scenario: Load data into HDFS Use Pig to scrub data, and Pump to another “db” (e.g., MongoDB) Web service reads from destination

  36. But Wait… There’s Too Much! Hortonworks

  37. Big Data Administration The possession of facts is knowledge, the use of them is wisdom. – Thomas Jefferson

  38. Big Data Use Cases Massive Size PB of info Data Warehouse Large clusters Increased Cost Complex Analytics Schemaless Investigational Single-node Low Cost

  39. RDBMS PERFORMANCE APPLICATION GROWTH

  40. BIG DATA PERFORMANCE APPLICATION GROWTH

  41. PERFORMANCE APPLICATION GROWTH

  42. Scale-Up Costs (SQL Server) Single Server Maximum RAM SAN Licenses Windows SQL Server Microsoft Support Personnel Developers DBA SAN Admin Network Admin Facilities Minimum Footprint

  43. Scale-Out Costs (Hortonworks HDP) Multiple Servers Commodity Licenses Windows ($$$) Linux (0\Support $) HDP Support Personnel Developer HDP Admin Network Admin Facilities Power Space Air

  44. Performance Tuning SYSTEM SYSTEM RDBMS HADOOP CODE CODE Performance Tuning Tips

  45. Hadoop Ecosystem (Hortonworks) Hortonworks

  46. Performance Architecture Nathan Marz - Twitter, Storm Lambda Architecture

More Related