Introduction to Apache Spark for General Purpose Computing

Introduction to Apache Spark Nicolas A Perez Software Engineer at IPC MapR Certified Spark Developer Organized Miami Scala Meetup https://twitter.com/@anicolaspp https://medium.com/@anicolaspp

What is Apache Spark

Spark • General Purpose Computing Framework faster than anything else • Used for large-scale data processing • Runs everywhere • Flexible (SQL, Streaming, GraphX, MLlib) • Easy to use (Scala, Java, Python, and R APIs)

Daytona Grey Sort Contest

General Purpose val count = sc.parallelize(1 to NUM_SAMPLES).map{i =>val x =Math.random()val y =Math.random()if(x*x + y*y <1)1else0}.reduce(_+_)println("Pi is roughly "+4.0* count /NUM_SAMPLES)

The Stack

The Spark Context (sc) • val conf =newSparkConf().setAppName(appName).setMaster(master)val sc = newSparkContext(conf) • val data =Array(1,2,3,4,5)val distData = sc.parallelize(data) The SparkContext class tells to Spark how to access to the cluster.

RDD • Resilient • Distributed • Datasets

Transformations on RDD map distinct join flatMap groupByKey filter reduceByKey union aggregateByKey intersection sortByKey

Actions on RDD reduce countByKey take collect saveAsTextFile takeSample first

RDD Execution Graph - Logical View

Physical View

RDD • RDD are created from any kind of sources (Text Files, HDFS, Raw Sockets, AWS S3, Azure Blob Storage, Cassandra, etc...) • RDD are lazy when calling Transformations on them. • RDD are represented by Spark as DAG (re-computation)

Deployment Platforms • Yarn • Mesos • AWS EMR • Azure HDInsight • Stand alone

The ABC Example

Word Counting • val rdd = sc.textFile("path to the file")val counts: RDD[(String, Int)] = • rdd.flatMap(line => line.split(" ")).map(word =>(word,1)).reduceByKey(_+_)

DEMO

But there is a word counting on Hadoop!

Spark SQL • Allows us to represent our data in a tabular format • We can run SQL queries on it • We can easily integrate different sources and run parallel queries on all of them at once • We can use standard tools that use SQL to query any kind of data at scale

Built-in Data Sources • Json • Parquet • Text Files • ORC • SerDer • Hive Tables • JDBC

Third-Party Data Sources • Cassandra • Impala • Drill • CSV Files • Other Customs ( Read my blog to see how to implement your own )

Spark SQL Important Abstractions • Data Frames • Data Sets • val people = sqlContext.read.json(path).as[Person]

Strongly Typed Tabular Representation Uses Schema for data representation (typed schema) Encoder optimizations for faster data access Spark Data Set API

Spark SQL, a Distributed Query Engine

Spark Data Frames • val sc = new SparkContext(config) • val sql = new HiveContext(sc) • val transactionsDF = val df = sqlContext • .read • .format("com.nico.datasource.dat") • .load(“~/transactions/“) • transactionsDF.registerTempTable("some_table") • ———————————————————————— • More at: https://goo.gl/qKPJdi

Spark Streaming • StreamingContext • Built-in File Streaming, Raw Socket Streaming. • Libraries for Twitter, Kafka, AWS Kinesis, Flume, etc… • Can be extended to stream from any source • Batch Processing (micro batches) • Streams can look back to the future • Windowed Operations • stream.countByWindow(Seconds(20))

Streaming Architecture Overview

Streaming Internals

From DStream to RDD

What you can get from Spark Streaming? • Millions of events per seconds (Billions if right deployment) • Concise API that is used in any other component of Spark • Fault Tolerance • Exactly-one semantic out of the box (for DFS and Kafka) • Integration with Spark SQL, MLlib, GraphX

Be careful • Not everyone needs streaming • Processing time must be smaller than batch time (Back pressure) • You might get out-of-order data • Applications need fine tuning since they have to run all the time • You need to careful planning your deployment strategy

Twitter Streaming Demo?

It is better if we create our own streaming server!

Demo

Gotchas • dstream.foreachRDD { rdd => • val connection = createNewConnection() • rdd.foreach { record => • connection.send(record) } • }

Gotchas dstream.foreachRDD { rdd => rdd.foreach { record => val connection = createNewConnection() connection.send(record) connection.close() } }

Gotchas • dstream.foreachRDD { rdd => • rdd.foreachPartition { partitionOfRecords => • val connection = createNewConnection() • partitionOfRecords.foreach(record => connection.send(record)) • connection.close() • } • }

MLlib & GraphX to be continued...

Question?

Introduction to Apache Spark for General Purpose Computing

Introduction to Apache Spark for General Purpose Computing

Presentation Transcript

Introduction to Apache Hadoop

An introduction to Apache Spark

Using Apache Spark

Introduction to Spark Internals

Introduction to Apache Spark

Parallel Programming With Apache Spark

An Overview of Apache Spark

Introduction to Apache Tika

Introduction to Spark

Hadoop vs Apache Spark

Apache Spark Courses Online

Apache spark training institute

Apache Spark Training | Best Spark Online Training-GOT

Apache Spark

Apache Spark Training | Best Spark Online Training-GOT

Introduction to Apache Spark

What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Training | Edureka

Apache Spark Scala Training

An introduction about the Apache Spark Framework

Apache Spark - Introduction

Apache Spark