1 / 33

Spark 1.0 and Beyond

Spark 1.0 and Beyond. Patrick Wendell Databricks Spark.incubator.apache.org. About me. Committer and PMC member of Apache Spark “Former” PhD student at Berkeley Release manager for Spark 1.0 Background in networking and distributed systems. Today’s Talk. Spark background

Download Presentation

Spark 1.0 and Beyond

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Spark 1.0 and Beyond Patrick Wendell Databricks Spark.incubator.apache.org

  2. About me Committer and PMC member of Apache Spark “Former” PhD student at Berkeley Release manager for Spark 1.0 Background in networking and distributed systems

  3. Today’s Talk Spark background About the Spark release process The Spark 1.0 release Looking forward to Spark 1.1

  4. What is Spark? Efficient Usable Rich APIs in Java, Scala, Python Interactive shell Fast and Expressive Cluster Computing Engine Compatible with Apache Hadoop • General execution graphs • In-memory storage Up to 10× faster on disk,100×in memory 2-5× less code

  5. 30-Day Commit Activity

  6. Spark Philosophy • Make life easy and productive for data scientists • Well documented, expressive API’s • Powerful domain specific libraries • Easy integration with storage systems • … and caching to avoid data movement • Predictable releases, stable API’s

  7. Spark Release Process • Quarterly release cycle (3 months) • 2 months of general development • 1 month of polishing, QA and fixes • Spark 1.0 Feb 1  April 8th, April 8th+ • Spark 1.1 May 1  July 8th, July 8th+

  8. Spark 1.0:By the numbers • 3 months of development • 639 patches • 200+ JIRA issues • 100+ contributors

  9. API Stability in 1.X • API’s are stable for all non-alpha projects • Spark 1.1, 1.2, … will be compatible • @DeveloperApi • Internal API that is unstable • @Experimental • User-facing API that might stabilize later

  10. Today’s Talk About the Spark release process The Spark 1.0 release Looking forward to Spark 1.1

  11. Spark 1.0 Features • Core engine improvements • Spark streaming • MLLib • Spark SQL

  12. Spark Core • History server for Spark UI • Integration with YARN security model • Unified job submission tool • Java 8 support • Internal engine improvements

  13. History Server • Configure with : • spark.eventLog.enabled=truespark.eventLog.dir=hdfs://XX • In Spark Standalone, history server is embedded in the master. • In YARN/Mesos, run history server as a daemon.

  14. Job Submission Tool • Apps don’t need to hard-code master:conf = new SparkConf().setAppName(“My App”)sc = new SparkContext(conf) • ./bin/spark-submit <app-jar> \ --class my.main.Class --name myAppName --master local[4] --master spark://some-cluster

  15. Java 8 Support • RDD operations can use lambda syntax • class Split extends FlatMapFunction<String, String> {public Iterable<String> call(String s) {return Arrays.asList(s.split(" ")); });JavaRDD<String> words = lines.flatMap(new Split()); • JavaRDD<String> words = lines .flatMap(s -> Arrays.asList(s.split(" "))); Old New

  16. Java 8 Support • NOTE: Minor API changes • (a) If you are extending Function classes, use implements rather than extends. • (b) Return-type sensitive functions • mapToPairmapToDouble

  17. Python API Coverage • rdd operators • intersection(), take(), top(), topOrdered() • meta-data • name(), id(), getStorageLevel() • runtime configuration • setJobGroup(), setLocalProperty()

  18. Integration with YARN Security • Supports Kerberos authentication in YARN environments: • spark.authenticate= true • ACL support for user interfaces: • spark.ui.acls.enable= true • spark.ui.view.acls= patrick, matei

  19. Engine Improvements • Job cancellation directly from UI • Garbage collection of shuffle and RDD data

  20. Documentation • Unified Scaladocs across modules • Expanded MLLib guide • Deployment and configuration specifics • Expanded API documentation

  21. RDD-Based Matrices • SchemaRDD’s DStream’s: Streams of RDD’s MLLib machine learning Spark SQL Spark Streamingreal-time RDDs, Transformations, and Actions Spark

  22. Spark SQL

  23. Turning an RDD into a Relation • // Define the schema using a case class.case class Person(name: String, age: Int)// Create an RDD of Person objects, register it as a table.val people =sc.textFile("examples/src/main/resources/people.txt").map(_.split(",") .map(p => Person(p(0), p(1).trim.toInt))people.registerAsTable("people")

  24. Querying using SQL • // SQL statements can be run directly on RDD’sval teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")// The results of SQL queries are SchemaRDDs and support // normal RDD operations.valnameList = teenagers.map(t => "Name: " + t(0)).collect() • // Language integrated queries (ala LINQ)val teenagers =people.where('age >= 10).where('age <= 19).select('name)

  25. Import and Export • // Save SchemaRDD’s directly to parquetpeople.saveAsParquetFile("people.parquet") • // Load data stored in HivevalhiveContext = new org.apache.spark.sql.hive.HiveContext(sc)import hiveContext._// Queries can be expressed in HiveQL.hql("FROM src SELECT key, value")

  26. In Memory Columnar Storage • Spark SQL can cache tables using an in-memory columnar format: • - Scan only required columns- Fewer allocated objects (less GC)- Automatically selects best compression

  27. Spark Streaming • Web UI for streaming • Graceful shutdown • User-defined input streams • Support for creating in Java • Refactored API

  28. MLlib • Sparse vector support • Decision trees • Linear algebra • SVD and PCA • Evaluation support • 3 contributors in the last 6 months

  29. MLlib • Note: Minor API change • valdata = sc.textFile("data/kmeans_data.txt")valparsedData = data.map( s => s.split(‘\t').map(_.toDouble).toArray)valclusters = KMeans.train(parsedData, 4, 100) • val data = sc.textFile("data/kmeans_data.txt")valparsedData = data.map( s => Vectors.dense(s.split(' ').map(_.toDouble)))val clusters = KMeans.train(parsedData, 4, 100)

  30. 1.1 and Beyond • Data import/export leveraging catalystHBase, Cassandra, etc • Shark-on-catalyst • Performance optimizations External shuffle Pluggable storage strategies • Streaming: Reliable input from Flume and Kafka

  31. Unifying Experience • SchemaRDD represents a consistent integration point for data sources • spark-submit abstracts the environmental details (YARN, hosted cluster, etc). • API stability across versions of Spark

  32. Conclusion • Visit spark.apache.org for videos, tutorials, and hands-on exercises. • Help us test a release candidate! • Spark Summit on June 30thspark-summit.org • Meetup groupmeetup.com/spark-users

More Related