Real-Time Stream Processing

Real-Time Stream Processing CMSC 491/691 Hadoop-Based Distributed Computing Spring 2014 Adam Shook

Agenda • Apache Storm • Apache Spark

Traditional Data Processing !!!ALL!!! the data Batch Pre-Computation (aka MapReduce) Index Query Index Query Index

Traditional Data Processing • Slow... and views are out of date Absorbed into batch views Not absorbed Now Time

Compensating for the real-time stuff • Need some kind of stream processing system to supplement our batch views • Applications can then merge the batch and the real time views together!

How do we do that?

Twitter Storm

Enter: Storm • Open-Source project originally built by Twitter • Now lives in the Apache Incubator • Enables distributed, fault-tolerant real-time computation

A History Lesson on Twitter Metrics Twitter Firehose

A History Lesson on Metrics Twitter Firehose

Problems! • Scaling is painful • Fault-tolerance is practically non-existent • Coding for it is awful

Wanted to Address • Guaranteed data processing • Horizontal Scalability • Fault-tolerance • No intermediate message brokers • Higher level abstraction than message passing • “Just works”

Storm Delivers • Guaranteed data processing • Horizontal Scalability • Fault-tolerance • No intermediate message brokers • Higher level abstraction than message passing • “Just works”

Use Cases • Stream Processing • Distributed RPC • Continuous Computation

Storm Architecture Supervisor ZooKeeper Supervisor ZooKeeper Supervisor Nimbus ZooKeeper Supervisor Supervisor

Glossary • Streams • Constant pump of data as Tuples • Spouts • Source of streams • Bolts • Process input streams and produce new streams • Functions, Filters, Aggregation, Joins, Talk to databases, etc. • Topologies • Network of spouts and bolts

Tasks and Topologies

Grouping • When a Tuple is emitted from a Spout or Bolt, where does it go? • Shuffle Grouping • Pick a random task • Fields Grouping • Consistent hashing on a subset of tuple fields • All Grouping • Send to all tasks • Global Grouping • Pick task with lowest ID

Topology [“id1”, “id2”] shuffle shuffle [“url”] shuffle all

Guaranteed Message Processing • A tuple has not been fully processed until it all tuples in the “tuple tree” have been completed • If the tree is not completed within a timeout, it is replayed • Programmers need to use the API to ‘ack’ a tuple as completed

Stream Processing ExampleWord Count TopologyBuilder builder = new TopologyBuilder(); builder.setSpout(1, new SentenceSpout(true), 5); builder.setBolt(2, new SplitSentence(), 8).shuffleGrouping(1); builder.setBolt(3, new WordCount(), 12).fieldsGrouping(2, new Fields(“word”)); Map conf = new HashMap(); conf.put(Config.TOPOLOGY_WORKERS, 5); StormSubmitter.submitTopology(“word-count”, conf, builder.createTopology());

public static class SplitSentence extends ShellBolt implements IRichBolt { public SplitSentence() { super(“python”, “splitsentence.py”); } public void declareOutputFields(OutputFieldsDeclaraer declarer) { declarer.declare(new Fields(“word”)); } } #!/usr/bin/python import storm class SplitSentenceBolt(storm.BasicBolt): def process(Self, tup): words = tup.values[0].split(“ “) for word in words: storm.emit([word])

public static class WordCount implements IBasicBolt { Map<String, Integer> counts = new HashMap<String, Integer>(); public void prepare(Map conf, TopologyContext context) {} public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); if (count == null) { count = 0; } ++count; counts.put(Word, count); collector.emit(new Values(word, count)); } public void cleanup () {} public void declareOutputFields(OutputFieldsDeclarer declarer) {declarer.declare(new Fields(“word”, “count”)); } }

Local Mode! TopologyBuilder builder = new TopologyBuilder(); builder.setSpout(1, new SentenceSpout(true), 5); builder.setBolt(2, new SplitSentence(), 8).shuffleGrouping(1); builder.setBolt(3, new WordCount(), 12).fieldsGrouping(2, new Fields(“word”)); Map conf = new HashMap(); conf.put(Config.TOPOLOGY_WORKERS, 5); LocalCluster cluster = new LocalCluster(); cluster.submitTopology(“word-count”, conf, builder.createTopology()); Thread.sleep(10000); cluster.shutdown();

Command Line Interface • Starting a topology storm jar mycode.jar twitter.storm.MyTopologydemo • Stopping a topology storm kill demo

Distributed RPC

DRPC ExampleReach • Reach is the number of unique people exposed to a specific URL on Twitter Follower Tweeter Distinct Follower Follower Count URL Tweeter Distinct Follower Reach Follower Follower Tweeter Distinct Follower Follower

Reach Topology shuffle shuffle GetTweeters GetFollowers Spout [“follower-id”] Distinct global CountAggregator

Storm Review • Distributed code and configurations • Robust process management • Monitors topologies and reassigns failed tasks • Provides reliability by tracking tuple trees • Routing and partitioning of streams • Serialization • Fine-grained performance stats of topologies

Apache Spark

Concern! • Say I have an application that involves many iterations... • Graph Algorithms • K-Means Clustering • Six Degrees of Bieber Fever • What's wrong with Hadoop MapReduce?

New Frameworks! • Researchers have developed new frameworks to keep intermediate data in-memory • Only support specific computation patterns (Map...Reduce... repeat) • No abstractions for general re-use of data

Enter: RDDs • Or Resilient Distributed Datasets • Fault-tolerant parallel data structures that enables: • Persisting data in memory • Specifying partitioning schemes for optimal placement • Manipulating them with a rich set of operators

Apache SparkLightning-Fast Cluster Computation • Open-source top-level Apache project that came out of Berkeley in 2010 • General-purpose cluster computation system • High-level APIs in Scala, Java, and Python • Higher-level tools: • Shark for HiveQLon Spark • MLlibfor machine learning • GraphXfor graph processing • Spark Streaming

Glossary

RDD Persistence and Partitioning • Persistence • Users can control which RDDs will be reused and choose a storage strategy • Partitioning • What we know and love! • Hash-partitioning based on some key for efficient joins

RDD Fault-Tolerance • Replicating data in-flight is costly and hard • Instead of replicating every data set, let's just log the transformations of each data set to keep its lineage • Loss of an RDD partition can be rebuilt by replaying the transformations • Only the lost partitions need to be rebuilt!

RDD Storage • Transformations are lazy operations • No computations occur until an action • RDDs can be persisted in-memory, but are spilled to disk if necessary • Users can specify a number of flags to persist the data • Only on disk • Partitioning schemes • Persistence priorities

RDD Eviction Policy • LRU policy at an RDD level • New RDD partition is computed, but not enough space? • Evict partition from the least recently accessed RDD • Unless it is the same RDD as the one with the new partition

Example! Log Mining • Say you want a search through terabytes of log files stored in HDFS for errors and play around with them lines = spark.textFile("hdfs://...") errors = lines.filter(_.startsWith("ERROR")) errors.persist()

Example! Log Mining // Count number of errors logs errors.count() // Count errors mentioning MySQL: errors.filter(_.contains("MySQL")).count() // Return the time fields of errors mentioning // HDFS as an array errors.filter(_.contains("HDFS")) .map(_.split('\t')(3)) .collect()

Spark Execution Flow • Nothing happens to errors until an action occurs • The original HDFS file is not stored in-memory, only the final RDD • This will greatly increase all of the future actions on the RDD

Architecture

PageRank

Spark PageRank // Load graph as an RDD of (URL, outlinks) pairs val links = spark.textFile(...).map(...).persist() var ranks = // RDD of (URL, rank) pairs for (i <- 1 to ITERATIONS) { // Build an RDD of (targetURL, float) pairs // with the contributions sent by each page valcontribs = links.join(ranks).flatMap{ (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } // Sum contributions by URL and get new ranks ranks = contribs.reduceByKey((x,y) => x+y) .mapValues(sum => a/N + (1-a)*sum) }

Spark PageRank Lineage

Tracking LineageNarrow vs. Wide Dependencies

Scheduler DAGs

Spark API • Every data set is an object, and transformations are invoked on these objects • Start with a data set, then transform it using operators like map, filter,and join • Then, do some actions like count, collect, or save

Spark API

Real-Time Stream Processing

Real-Time Stream Processing

Presentation Transcript

Distributed Real-Time Embedded Video Processing

Spark Streaming Large-scale near-real-time stream processing

Real-time Stream Processing Architecture for Comcast IP Video

FPGA BASED REAL TIME VIDEO PROCESSING

Real Time Stream Editor for PPP

Real Time Processing Group 4

Stream Processing

Spark Streaming Large -scale near-real-time stream processing

Telemetry/Real-time Processing

Real time (on-line) Transaction Processing

AMSR-E Near Real Time Processing

Using Processing Stream

XML Stream Processing

Real-Time Information Processing

FPGA BASED REAL TIME VIDEO PROCESSING

Simulation and Real-time processing

Real time signal processing

Real-time Query Processing

Distributed Real-Time Embedded Video Processing

Real-Time GIS: Leveraging Stream Services

Real-Time GIS Leveraging Stream Services

Live Stream Girls - Chat in Real-Time