1 / 57

Spark and Shark

Spark and Shark. High-Speed In-Memory Analytics over Hadoop and Hive Data. Reynold Xin 辛湜 (shi2) UC Berkeley. UC BERKELEY. Prof. Michael Franklin. My Background. PhD student in the AMP Lab at UC Berkeley 50-person lab focusing on big data

tamar
Download Presentation

Spark and Shark

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Spark and Shark High-Speed In-Memory Analyticsover Hadoop and Hive Data Reynold Xin 辛湜 (shi2) UC Berkeley UC BERKELEY

  2. Prof. Michael Franklin

  3. My Background • PhD student in the AMP Lab at UC Berkeley • 50-person lab focusing on big data • Works on the Berkeley Data Analytics System (BDAS) • Work/intern experience at Google Research, IBM DB2, Altera

  4. AMPLab @ Berkeley • NSF & DARPA funded • Sponsors: Amazon Web Services / Google / SAP • Intel, Huawei, Facebook, Hortonworks, Cloudera … (21 companies) • Collaboration databases (Mike Franklin – sitting right here), systems (Ion Stoica), networking (Scott Shenker), architecture (David Patterson), machine learning (Michael Jordan).

  5. What is Spark? • Not a modified version of Hadoop • Separate, fast, MapReduce-like engine • In-memory data storage for very fast iterative queries • General execution graphs and powerful optimizations • Up to 100x faster than HadoopMapReduce • Compatible with Hadoop’s storage APIs • Can read/write to any Hadoop-supported system, including HDFS, HBase, SequenceFiles, etc

  6. What is Shark? • A SQL analytics engine built on top of Spark • Compatible with Apache Hive data, metastores, and queries (HiveQL, UDFs, etc) • Similar speedups of up to 100x

  7. Project History • Spark project started in 2009, open sourced 2010 • Shark started summer 2011, open sourced 2012 • Spark 0.7 released Feb 2013 • Streaming alpha in Spark 0.7 • GraphLab support soon

  8. Adoption • In use at Yahoo!, Foursquare, Berkeley, Princeton & many others (possibly Taobao, Netease) • 600+ member meetup, 800+ watchers on GitHub • 30+ contributors (including Intel Shanghai) • AMPCamp 150 attendees, 5000 online attendees • AMPCamp on Mar 15, ECNU, Shanghai

  9. This Talk • Introduction • Spark • Shark: SQL on Spark • Why is HadoopMapReduce slow? • Streaming Spark

  10. Why go Beyond MapReduce? • MapReduce simplified big data analysis by giving a reliable programming model for large clusters • But as soon as it got popular, users wanted more: • More complex, multi-stage applications • More interactive ad-hoc queries

  11. Why go Beyond MapReduce? • Complex jobs and interactive queries both need one thing that MapReduce lacks: • Efficient primitives for data sharing Query 1 Stage 3 Stage 1 Stage 2 Query 2 Query 3 Iterative algorithm Interactive data mining

  12. Why go Beyond MapReduce? • Complex jobs and interactive queries both need one thing that MapReduce lacks: • Efficient primitives for data sharing Query 1 Stage 3 Stage 1 Stage 2 In MapReduce, the only way to share data across jobs is stable storage (e.g. HDFS) -> slow! Query 2 Query 3 Iterative algorithm Interactive data mining

  13. Examples HDFSread HDFSwrite HDFSread HDFSwrite iter. 1 iter. 2 . . . Input result 1 query 1 HDFSread result 2 query 2 query 3 result 3 Input . . . I/O and serialization can take 90% of the time

  14. Goal: In-Memory Data Sharing iter. 1 iter. 2 . . . Input query 1 one-timeprocessing query 2 query 3 Input Distributedmemory . . . 10-100×faster than network and disk

  15. Solution: ResilientDistributed Datasets (RDDs) • Distributed collections of objects that can be stored in memory for fast reuse • Automatically recover lost data on failure • Support a wide range of applications

  16. Programming Model Resilient distributed datasets (RDDs) • Immutable, partitioned collections of objects • Can be cached in memory for efficient reuse • Transformations (e.g. map, filter, groupBy, join) • Build RDDs from other RDDs Actions (e.g. count, collect, save) • Return a result or write it to storage

  17. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Cache 1 Base RDD Transformed RDD lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache() Worker results tasks Block 1 Driver Action cachedMsgs.filter(_.contains(“foo”)).count Cache 2 cachedMsgs.filter(_.contains(“bar”)).count Worker . . . Cache 3 Block 2 Worker Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec(vs 170 sec for on-disk data) Block 3

  18. public static class WordCountMapClassextends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizeritr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduceextends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

  19. Word Count • myrdd.flatMap{ doc => doc.split(“\s”)} .map{ word => (word, 1)} .reduceByKey { case(v1, v2) => v1 + v2 } • myrdd.flatMap(_.split(“\s”)) .map((_, 1)) .reduceByKey(_ + _)

  20. Fault Tolerance RDDs track the series of transformations used to build them (their lineage) to recompute lost data E.g: messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2)) HadoopRDD path = hdfs://… FilteredRDD func = _.contains(...) MappedRDD func = _.split(…)

  21. Fault Recovery Results Failure happens

  22. Tradeoff Space Network bandwidth Memory bandwidth Fine Transactional workloads K-V stores, databases, RAMCloud Granularity of Updates Batch analytics HDFS RDDs Coarse Low High Write Throughput

  23. Behavior with Not Enough RAM

  24. Example: Logistic Regression val data = spark.textFile(...).map(readPoint).cache() varw = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Finalw: " + w) Load data in memory once Initial parameter vector Repeated MapReduce stepsto do gradient descent

  25. Logistic Regression Performance 110 s / iteration first iteration 80 s further iterations 1 s

  26. Supported Operators • map • filter • groupBy • sort • join • leftOuterJoin • rightOuterJoin • reduce • count • reduceByKey • groupByKey • first • union • cross sample cogroup take partitionBy pipe save ...

  27. Scheduler Dryad-like task DAG Pipelines functionswithin a stage Cache-aware datalocality & reuse Partitioning-awareto avoid shuffles B: A: G: Stage 1 groupBy F: D: C: map E: join Stage 2 union Stage 3 = previously computed partition

  28. Implementation Use Mesos / YARN to share resources with Hadoop & other frameworks Can access any Hadoop input source (HDFS, S3, …) • Core engine is only 20k lines of code Spark Hadoop MPI … YARN Mesos Node Node Node Node

  29. User Applications • In-memory analytics & anomaly detection (Conviva) Interactive queries on data streams (Quantifind) Exploratory log analysis (Foursquare) • Traffic estimation w/ GPS data (Mobile Millennium) Twitter spam classification (Monarch) . . .

  30. ConvivaGeoReport • Group aggregations on many keys w/ same filter • 40× gain over Hive from avoiding repeated reading, deserialization and filtering Time (hours)

  31. Mobile Millennium Project • Estimate city traffic from crowdsourced GPS data Iterative EM algorithm scaling to 160 nodes Credit: Tim Hunter, with support of the Mobile Millennium team; P.I. Alex Bayen; traffic.berkeley.edu

  32. This Talk • Introduction • Spark • Shark: SQL on Spark • Why is HadoopMapReduce slow? • Streaming Spark

  33. Challenges • Data volumes expanding. • Faults and stragglers complicate parallel database design. • Complexity of analysis: machine learning, graph algorithms, etc. • Low-latency, interactivity.

  34. MPP Databases • Vertica, SAP HANA, Teradata, Google Dremel... • Fast! • Generally not fault-tolerant; challenging for long running queries as clusters scale up. • Lack rich analytics such as machine learning and graph algorithms.

  35. MapReduce • Apache Hive, Google Tenzing, Turn Cheetah … • Deterministic, idempotent tasks: enables fine-grained fault-tolerance and resource sharing. • Expressive Machine Learning algorithms. • High-latency, dismissed for interactive workloads.

  36. Shark • A data warehouse that • builds on Spark, • scales out and is fault-tolerant, • supports low-latency, interactive queries through in-memory computation, • supports both SQL and complex analytics, • is compatible with Apache Hive (storage, serdes, UDFs, types, metadata).

  37. Hive Architecture Client Meta store CLI JDBC Driver Query Optimizer SQL Parser Physical Plan Execution MapReduce HDFS

  38. Shark Architecture Client Meta store CLI JDBC Cache Mgr. Driver SQL Parser Physical Plan Query Optimizer Execution Spark HDFS

  39. Engine Features • Columnar Memory Store • Machine Learning Integration • Partial DAG Execution • Hash-based Shuffle vs Sort-based Shuffle • Data Co-partitioning • Partition Pruning based on Range Statistics • ...

  40. Efficient In-Memory Storage • Simply caching Hive records as Java objects is inefficient due to high per-object overhead • Instead, Shark employs column-oriented storage using arrays of primitive types Row Storage Column Storage 1 john 4.1 1 2 3 2 mike 3.5 john mike sally 3 sally 6.4 4.1 3.5 6.4

  41. Efficient In-Memory Storage • Simply caching Hive records as Java objects is inefficient due to high per-object overhead • Instead, Shark employs column-oriented storage using arrays of primitive types Row Storage Column Storage Benefit: similarly compact size to serialized data,but >5x faster to access 1 john 4.1 1 2 3 2 mike 3.5 john mike sally 3 sally 6.4 4.1 3.5 6.4

  42. Machine Learning Integration • Unified system for query processing and machine learning • Query processing and ML share the same set of workers and caches

  43. Performance 1.7 TB Real Warehouse Data on 100 EC2 nodes

  44. This Talk • Introduction • Spark • Shark: SQL on Spark • Why is HadoopMapReduce slow? • Streaming Spark

  45. Why are previous MR-based systems slow? • Disk-based intermediate outputs. • Inferior data format and layout (no control of data co-partitioning). • Execution strategies (lack of optimization based on data statistics). • Task scheduling and launch overhead!

  46. Task Launch Overhead • Hadoopuses heartbeat to communicate scheduling decisions. • Task launch delay 5 - 10 seconds. • Spark uses an event-driven architecture and can launch tasks in 5ms • better parallelism • easier straggler mitigation • elasticity • multi-tenancy resource sharing

More Related