520 likes | 732 Views
Apache Spark is a powerful and fast general-purpose computing framework used for large-scale data processing. It is flexible, easy to use, and supports SQL, streaming, GraphX, MLlib functionalities. Learn about Spark's features, components, and deployment platforms in this comprehensive guide by Nicolas A. Perez, a MapR Certified Spark Developer.
E N D
Introduction to Apache Spark Nicolas A Perez Software Engineer at IPC MapR Certified Spark Developer Organized Miami Scala Meetup https://twitter.com/@anicolaspp https://medium.com/@anicolaspp
Spark • General Purpose Computing Framework faster than anything else • Used for large-scale data processing • Runs everywhere • Flexible (SQL, Streaming, GraphX, MLlib) • Easy to use (Scala, Java, Python, and R APIs)
General Purpose val count = sc.parallelize(1 to NUM_SAMPLES).map{i =>val x =Math.random()val y =Math.random()if(x*x + y*y <1)1else0}.reduce(_+_)println("Pi is roughly "+4.0* count /NUM_SAMPLES)
The Spark Context (sc) • val conf =newSparkConf().setAppName(appName).setMaster(master)val sc = newSparkContext(conf) • val data =Array(1,2,3,4,5)val distData = sc.parallelize(data) The SparkContext class tells to Spark how to access to the cluster.
RDD • Resilient • Distributed • Datasets
Transformations on RDD map distinct join flatMap groupByKey filter reduceByKey union aggregateByKey intersection sortByKey
Actions on RDD reduce countByKey take collect saveAsTextFile takeSample first
RDD • RDD are created from any kind of sources (Text Files, HDFS, Raw Sockets, AWS S3, Azure Blob Storage, Cassandra, etc...) • RDD are lazy when calling Transformations on them. • RDD are represented by Spark as DAG (re-computation)
Deployment Platforms • Yarn • Mesos • AWS EMR • Azure HDInsight • Stand alone
Word Counting • val rdd = sc.textFile("path to the file")val counts: RDD[(String, Int)] = • rdd.flatMap(line => line.split(" ")).map(word =>(word,1)).reduceByKey(_+_)
Spark SQL • Allows us to represent our data in a tabular format • We can run SQL queries on it • We can easily integrate different sources and run parallel queries on all of them at once • We can use standard tools that use SQL to query any kind of data at scale
Built-in Data Sources • Json • Parquet • Text Files • ORC • SerDer • Hive Tables • JDBC
Third-Party Data Sources • Cassandra • Impala • Drill • CSV Files • Other Customs ( Read my blog to see how to implement your own )
Spark SQL Important Abstractions • Data Frames • Data Sets • val people = sqlContext.read.json(path).as[Person]
Strongly Typed Tabular Representation Uses Schema for data representation (typed schema) Encoder optimizations for faster data access Spark Data Set API
Spark Data Frames • val sc = new SparkContext(config) • val sql = new HiveContext(sc) • val transactionsDF = val df = sqlContext • .read • .format("com.nico.datasource.dat") • .load(“~/transactions/“) • transactionsDF.registerTempTable("some_table") • ———————————————————————— • More at: https://goo.gl/qKPJdi
Spark Streaming • StreamingContext • Built-in File Streaming, Raw Socket Streaming. • Libraries for Twitter, Kafka, AWS Kinesis, Flume, etc… • Can be extended to stream from any source • Batch Processing (micro batches) • Streams can look back to the future • Windowed Operations • stream.countByWindow(Seconds(20))
What you can get from Spark Streaming? • Millions of events per seconds (Billions if right deployment) • Concise API that is used in any other component of Spark • Fault Tolerance • Exactly-one semantic out of the box (for DFS and Kafka) • Integration with Spark SQL, MLlib, GraphX
Be careful • Not everyone needs streaming • Processing time must be smaller than batch time (Back pressure) • You might get out-of-order data • Applications need fine tuning since they have to run all the time • You need to careful planning your deployment strategy
Gotchas • dstream.foreachRDD { rdd => • val connection = createNewConnection() • rdd.foreach { record => • connection.send(record) } • }
Gotchas dstream.foreachRDD { rdd => rdd.foreach { record => val connection = createNewConnection() connection.send(record) connection.close() } }
Gotchas • dstream.foreachRDD { rdd => • rdd.foreachPartition { partitionOfRecords => • val connection = createNewConnection() • partitionOfRecords.foreach(record => connection.send(record)) • connection.close() • } • }