1 / 53

Making Fly

Making Fly. Parviz Deyhim. http:// bit.ly / sparkemr. Obsession. Scalable, Highly-available and Secure. Obsession. Scalable, Highly-available and Secure. Amazon Elastic MapReduce. Hadoop - as-a-service. What is EMR?. Map-Reduce engine. Integrated with tools.

livi
Download Presentation

Making Fly

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Making Fly Parviz Deyhim http://bit.ly/sparkemr

  2. Obsession Scalable, Highly-available and Secure

  3. Obsession Scalable, Highly-available and Secure

  4. Amazon Elastic MapReduce

  5. Hadoop-as-a-service What is EMR? Map-Reduce engine Integrated with tools Integrated to AWS services Massively parallel Cost effective AWS wrapper

  6. Integration HDFS Amazon EMR

  7. Integration HDFS Amazon EMR Amazon DynamoDB Amazon S3

  8. Data management Analytics languages Integration HDFS Amazon EMR Amazon DynamoDB Amazon S3

  9. Data management Analytics languages Integration HDFS Amazon EMR Amazon DynamoDB Amazon RDS Amazon S3

  10. Data management Analytics languages Integration HDFS Amazon EMR Amazon DynamoDB Data Pipeline Amazon RedShift Amazon RDS Amazon S3

  11. Data management Analytics languages Integration HDFS Amazon EMR Amazon DynamoDB Data Pipeline Amazon RedShift Amazon RDS Amazon S3

  12. Amazon EMR Concepts • Master Node • Core Nodes • Task Nodes

  13. Core Nodes Amazon EMR cluster DataNode (HDFS) Master instance group • Master Node HDFS HDFS Core instance group

  14. Core Nodes Amazon EMR cluster Can Add Core Nodes: More CPU More Memory More HDFS Space Master instance group • Master Node HDFS HDFS HDFS Core instance group

  15. Core Nodes Amazon EMR cluster Can’t remove core nodes: HDFS corruption Master instance group • Master Node HDFS HDFS HDFS Core instance group

  16. Task Nodes Amazon EMR cluster No HDFS Provides compute resources: CPU Memory Master instance group • Master Node HDFS HDFS Core instance group

  17. Task Nodes Amazon EMR cluster Can add and remove task nodes Master instance group • Master Node HDFS HDFS Core instance group

  18. Spark On Amazon EMR

  19. Bootstrap Actions • Ability to run or install additional packages/software on EMR nodes • Simple bash script stored on S3 • Script gets executed during node/instance boot time • Script gets executed on every node that gets added to the cluster

  20. Spark on Amazon EMR • Bootstrap action installs Spark on EMR nodes • Currently on Spark 0.73 & upgrading to 0.8 very soon http://bit.ly/sparkemr

  21. Why Spark on Amazon EMR? • Deploy small and large Spark clusters in minutes • EMR handles node recover in case of failures • Integration with EC2 Spot Market, Amazon Redshift, Amazon Data pipeline, Amazon Cloudwatch and etc

  22. Why Spark on Amazon EMR? • Shipping Spark logs to S3 for debugging • Define S3 bucket at cluster deploy time

  23. Spark on EC2 Spot Market • Bid on un-used EC2 capacity • Spark is memory hungry. • Bid on large memory instances with the fraction of the cost

  24. Spark on EC2 Spot Market 1TB Memory Cluster

  25. Spark on EC2 Spot Market Amazon EMR cluster Master instance group • Master Node • Launch initial Spark cluster with core nodes • HDFS to store and checkpoint RDDs HDFS HDFS 32GB Memory

  26. Spark on EC2 Spot Market Amazon EMR cluster Master instance group • Master Node • Add Task nodes in spot market to increase memory capacity HDFS HDFS 32GB Memory 256GB Memory

  27. Spark on EC2 Spot Market Amazon EMR cluster • Create RDDs from HDFS or Amazon S3 with: • sc.textFile • OR • sc.sequenceFile • Run Computation on RDDs Master instance group • Master Node Amazon S3 HDFS HDFS 32GB Memory 256GB Memory

  28. Spark on EC2 Spot Market Amazon EMR cluster Master instance group • Save the resulting RDDs to HDFS or S3 with: • rdd.saveAsSequenceFileOR • rdd.saveAsObjectFile • Master Node Amazon S3 HDFS HDFS 32GB Memory 256GB Memory saveAsObjectFile

  29. Spark on EC2 Spot Market Amazon EMR cluster Master instance group • Master Node • Shutdown TaskNodes when your job is done HDFS HDFS 32GB Memory

  30. Elastic Spark With Amazon EMR

  31. Autoscaling Spark Amazon EMR cluster • Master Node HDFS HDFS 32GB Memory

  32. Autoscaling Spark Amazon EMR cluster • Master Node HDFS HDFS 32GB Memory 256GB Memory

  33. Elastic Spark • When to Scale? • Depends on your job • CPU bounded or Memory intensive? • Probably both for Spark jobs • Use CPU/Memory util. metrics to decide when to scale

  34. Amazon EMR Cloudwatch metrics • EMR integrates with Cloudwatch • Provides many metrics. • Examples: • Load Metrics • HDFS Metrics • S3 Metrics

  35. Amazon EMR Cloudwatch metrics

  36. Basics on Cloudwatch Metrics • Pick any Cloudwatch metrics • Pick a threshold that you like to be notified if its breached • Setup Cloudwatch Alarms based on your thresholds • Receive SNS notification in forms of: • Email • SNS • HTTP API Call

  37. Basics on Cloudwatch Metrics Receive Email Notification Monitor With Cloudwatch Take Manual Action Such As Adding More Task Nodes

  38. Basics on Cloudwatch Metrics Monitor With Cloudwatch HTTP API Calls Take Automated Actions

  39. Spark Autoscaling Based on Load • Setup Cloudwatch alarm on EMR “TotalLoad” metric • Receive Email/SNS/HTTP notification • Add more worker nodes by adding EMR task nodes

  40. Spark Autoscaling Based Memory • Spark needs memory • Lost of it!! • How to scale based on the memory usage?

  41. Spark Metrics • Spark 0.8 provides cluster metrics • Source and Sink topology

  42. Spark Metrics • Spark Metric Sources (metrics.properties): worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource

  43. Spark Metrics • Spark Metric Sinks (metrics.properties): • Package: org.apache.spark.metrics.sink ConsoleSink JmxSink CsvSink GangliaSink

  44. Spark Metrics • Spark Metric Sinks (metrics.properties): CloudwatchSink

  45. Spark Metrics & Cloudwatch

  46. Spark Metrics & Cloudwatch • Monitor Spark metrics with Cloudwatch • Setup Cloudwatch alarms and get notified if any metrics reached your threshold. • Example: if JvmHeapUsed > 20G • Receive notification and take manual or automated actions

  47. Spark Streaming and Amazon Kinesis

  48. Amazon Kinesis Kinesis

  49. Amazon Kinesis • CreateStream • Creates a new Data Stream within the Kinesis Service • PutRecord • Adds new records to a Kinesis Stream • DescribeStream • Provides metadata about the Stream, including name, status, Shards, etc. • GetNextRecord • Fetches next record for processing by user business logic • MergeShard / SplitShard • Scales Stream up/ down • DeleteStream • Deletes the Stream

  50. Amazon Kinesis Kinesis

More Related