1 / 57

Real-Time Stream Processing

Real-Time Stream Processing. CMSC 491/691 Hadoop-Based Distributed Computing Spring 2014 Adam Shook. Agenda. Apache Storm Apache Spark. Traditional Data Processing. !!!ALL!!! the data. Batch Pre-Computation (aka MapReduce). Index. Query. Index. Query. Index.

mercury
Download Presentation

Real-Time Stream Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Real-Time Stream Processing CMSC 491/691 Hadoop-Based Distributed Computing Spring 2014 Adam Shook

  2. Agenda • Apache Storm • Apache Spark

  3. Traditional Data Processing !!!ALL!!! the data Batch Pre-Computation (aka MapReduce) Index Query Index Query Index

  4. Traditional Data Processing • Slow... and views are out of date Absorbed into batch views Not absorbed Now Time

  5. Compensating for the real-time stuff • Need some kind of stream processing system to supplement our batch views • Applications can then merge the batch and the real time views together!

  6. How do we do that?

  7. Twitter Storm

  8. Enter: Storm • Open-Source project originally built by Twitter • Now lives in the Apache Incubator • Enables distributed, fault-tolerant real-time computation

  9. A History Lesson on Twitter Metrics Twitter Firehose

  10. A History Lesson on Metrics Twitter Firehose

  11. Problems! • Scaling is painful • Fault-tolerance is practically non-existent • Coding for it is awful

  12. Wanted to Address • Guaranteed data processing • Horizontal Scalability • Fault-tolerance • No intermediate message brokers • Higher level abstraction than message passing • “Just works”

  13. Storm Delivers • Guaranteed data processing • Horizontal Scalability • Fault-tolerance • No intermediate message brokers • Higher level abstraction than message passing • “Just works”

  14. Use Cases • Stream Processing • Distributed RPC • Continuous Computation

  15. Storm Architecture Supervisor ZooKeeper Supervisor ZooKeeper Supervisor Nimbus ZooKeeper Supervisor Supervisor

  16. Glossary • Streams • Constant pump of data as Tuples • Spouts • Source of streams • Bolts • Process input streams and produce new streams • Functions, Filters, Aggregation, Joins, Talk to databases, etc. • Topologies • Network of spouts and bolts

  17. Tasks and Topologies

  18. Grouping • When a Tuple is emitted from a Spout or Bolt, where does it go? • Shuffle Grouping • Pick a random task • Fields Grouping • Consistent hashing on a subset of tuple fields • All Grouping • Send to all tasks • Global Grouping • Pick task with lowest ID

  19. Topology [“id1”, “id2”] shuffle shuffle [“url”] shuffle all

  20. Guaranteed Message Processing • A tuple has not been fully processed until it all tuples in the “tuple tree” have been completed • If the tree is not completed within a timeout, it is replayed • Programmers need to use the API to ‘ack’ a tuple as completed

  21. Stream Processing ExampleWord Count TopologyBuilder builder = new TopologyBuilder(); builder.setSpout(1, new SentenceSpout(true), 5); builder.setBolt(2, new SplitSentence(), 8).shuffleGrouping(1); builder.setBolt(3, new WordCount(), 12).fieldsGrouping(2, new Fields(“word”)); Map conf = new HashMap(); conf.put(Config.TOPOLOGY_WORKERS, 5); StormSubmitter.submitTopology(“word-count”, conf, builder.createTopology());

  22. public static class SplitSentence extends ShellBolt implements IRichBolt { public SplitSentence() { super(“python”, “splitsentence.py”); } public void declareOutputFields(OutputFieldsDeclaraer declarer) { declarer.declare(new Fields(“word”)); } } #!/usr/bin/python import storm class SplitSentenceBolt(storm.BasicBolt): def process(Self, tup): words = tup.values[0].split(“ “) for word in words: storm.emit([word])

  23. public static class WordCount implements IBasicBolt { Map<String, Integer> counts = new HashMap<String, Integer>(); public void prepare(Map conf, TopologyContext context) {} public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); if (count == null) { count = 0; } ++count; counts.put(Word, count); collector.emit(new Values(word, count)); } public void cleanup () {} public void declareOutputFields(OutputFieldsDeclarer declarer) {declarer.declare(new Fields(“word”, “count”)); } }

  24. Local Mode! TopologyBuilder builder = new TopologyBuilder(); builder.setSpout(1, new SentenceSpout(true), 5); builder.setBolt(2, new SplitSentence(), 8).shuffleGrouping(1); builder.setBolt(3, new WordCount(), 12).fieldsGrouping(2, new Fields(“word”)); Map conf = new HashMap(); conf.put(Config.TOPOLOGY_WORKERS, 5); LocalCluster cluster = new LocalCluster(); cluster.submitTopology(“word-count”, conf, builder.createTopology()); Thread.sleep(10000); cluster.shutdown();

  25. Command Line Interface • Starting a topology storm jar mycode.jar twitter.storm.MyTopologydemo • Stopping a topology storm kill demo

  26. Distributed RPC

  27. DRPC ExampleReach • Reach is the number of unique people exposed to a specific URL on Twitter Follower Tweeter Distinct Follower Follower Count URL Tweeter Distinct Follower Reach Follower Follower Tweeter Distinct Follower Follower

  28. Reach Topology shuffle shuffle GetTweeters GetFollowers Spout [“follower-id”] Distinct global CountAggregator

  29. Storm Review • Distributed code and configurations • Robust process management • Monitors topologies and reassigns failed tasks • Provides reliability by tracking tuple trees • Routing and partitioning of streams • Serialization • Fine-grained performance stats of topologies

  30. Apache Spark

  31. Concern! • Say I have an application that involves many iterations... • Graph Algorithms • K-Means Clustering • Six Degrees of Bieber Fever • What's wrong with Hadoop MapReduce?

  32. New Frameworks! • Researchers have developed new frameworks to keep intermediate data in-memory • Only support specific computation patterns (Map...Reduce... repeat) • No abstractions for general re-use of data

  33. Enter: RDDs • Or Resilient Distributed Datasets • Fault-tolerant parallel data structures that enables: • Persisting data in memory • Specifying partitioning schemes for optimal placement • Manipulating them with a rich set of operators

  34. Apache SparkLightning-Fast Cluster Computation • Open-source top-level Apache project that came out of Berkeley in 2010 • General-purpose cluster computation system • High-level APIs in Scala, Java, and Python • Higher-level tools: • Shark for HiveQLon Spark • MLlibfor machine learning • GraphXfor graph processing • Spark Streaming

  35. Glossary

  36. RDD Persistence and Partitioning • Persistence • Users can control which RDDs will be reused and choose a storage strategy • Partitioning • What we know and love! • Hash-partitioning based on some key for efficient joins

  37. RDD Fault-Tolerance • Replicating data in-flight is costly and hard • Instead of replicating every data set, let's just log the transformations of each data set to keep its lineage • Loss of an RDD partition can be rebuilt by replaying the transformations • Only the lost partitions need to be rebuilt!

  38. RDD Storage • Transformations are lazy operations • No computations occur until an action • RDDs can be persisted in-memory, but are spilled to disk if necessary • Users can specify a number of flags to persist the data • Only on disk • Partitioning schemes • Persistence priorities

  39. RDD Eviction Policy • LRU policy at an RDD level • New RDD partition is computed, but not enough space? • Evict partition from the least recently accessed RDD • Unless it is the same RDD as the one with the new partition

  40. Example! Log Mining • Say you want a search through terabytes of log files stored in HDFS for errors and play around with them lines = spark.textFile("hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.persist()

  41. Example! Log Mining // Count number of errors logs errors.count() // Count errors mentioning MySQL: errors.filter(_.contains("MySQL")).count() // Return the time fields of errors mentioning // HDFS as an array errors.filter(_.contains("HDFS")) .map(_.split('\t')(3)) .collect()

  42. Spark Execution Flow • Nothing happens to errors until an action occurs • The original HDFS file is not stored in-memory, only the final RDD • This will greatly increase all of the future actions on the RDD

  43. Architecture

  44. PageRank

  45. Spark PageRank // Load graph as an RDD of (URL, outlinks) pairs val links = spark.textFile(...).map(...).persist() var ranks = // RDD of (URL, rank) pairs for (i <- 1 to ITERATIONS) { // Build an RDD of (targetURL, float) pairs // with the contributions sent by each page valcontribs = links.join(ranks).flatMap{ (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } // Sum contributions by URL and get new ranks ranks = contribs.reduceByKey((x,y) => x+y) .mapValues(sum => a/N + (1-a)*sum) }

  46. Spark PageRank Lineage

  47. Tracking LineageNarrow vs. Wide Dependencies

  48. Scheduler DAGs

  49. Spark API • Every data set is an object, and transformations are invoked on these objects • Start with a data set, then transform it using operators like map, filter,and join • Then, do some actions like count, collect, or save

  50. Spark API

More Related