881 likes | 2.57k Views
This Edureka Spark Tutorial will help you to understand all the basics of Apache Spark. This Spark tutorial is ideal for both beginners as well as professionals who want to learn or brush up Apache Spark concepts. Below are the topics covered in this tutorial: <br><br>1) Big Data Introduction <br>2) Batch vs Real Time Analytics <br>3) Why Apache Spark? <br>4) What is Apache Spark? <br>5) Using Spark with Hadoop <br>6) Apache Spark Features <br>7) Apache Spark Ecosystem <br>8) Demo: Earthquake Detection Using Apache Spark
E N D
5 Best Practices in DevOps Culture EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
What to expect? 2 Spark Features 1 Why Apache Spark? 3 Spark Ecosystem 5 4 Use Case Hands-On Examples EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Big Data EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Data Generated Every Minute! EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
IOT: 50 Billion Devices By 2020 “~6 things online” per person Sensors, Smart, Objects, Device Clustered Systems 50 Rapid adoption rate of digital infrastructure 5x faster than electricity & telephony Billion SmartObjects Tablets, Laptops, Phones Inflection Point World Population 6.307 6.894 7.83 6.721 7.347 2003 2010 2020 2008 2015 EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Big Data Analytics EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Big Data Analytics ➢ Big Data Analytics is the process of examining large data sets to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information. ➢ Big Data Analytics is of two types: 1. Batch Analytics 2. Real-Time Analytics EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Batch vs Real Time Analytics EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Batch vs Real Time Analytics Time ETL Analytics based on the data collected over a period of time is Batch Analytics Stored Stored Stored Client Client Client Client Time Analytics based on immediate data for instant result is Real- Time (Stream) Analytics ms ms ms Client Client Client EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Cases of Real Time Analytics EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Cases of Real Time Analytics Banking Government Healthcare Telecommunications Stock Market EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Why Spark When Hadoop Is Already There? EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Batch Processing In Hadoop Processing Data Using MapReduce Processing Data Processing Data Processing Data Processing Data Day 1 Day 2 Day 3 Day 4 Day N Hadoop processes the data stored over a period of time Time Lag Day 1 Day 2 Day 3 Day 4 Day N Input Data Input Data Input Data Input Data EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Real Time Processing In Spark Processing Data Processing Data Processing Data Processing Data Day 1 Day 2 Day 3 Day 4 Day N Spark overcomes the time lag issue No Time Lag Day 1 Day 2 Day 3 Day 4 Day N Input Data Input Data Input Data Input Data EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Vs Hadoop Hadoop implements Batch processing on Big Data. It thus cannot deliver to our Real-Time use case needs. Our Requirements: Process data in real-time Handle input from multiple sources Easy to use Faster processing EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Success Story EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Success Story Twitter Sentiment Analysis With Spark NYSE: Real Time Analysis of Stock Market Data Trending be campaigns and attract larger audience Topics to can Sentiment crisis service adjusting and target marketing helps in used create management, Banking: Credit Card Fraud Detection Genomic Sequencing EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Overview EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
What Is Spark? Apache Spark is an open-source cluster-computing framework for real time processing developed by the Apache Software Foundation Figure: Real Time Processing In Spark Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance Serial Parallel Reduction in time It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types of computations Figure: Data Parallelism In Spark EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Why Spark? Simple programming layer provides powerful caching and disk persistence capabilities 100x faster than for large scale data processing vs Powerful Caching Speed Can be deployed through Mesos, Hadoop via Yarn, or Spark’s own cluster manger Can be programmed in Scala, Java, Python and R Features Polyglot Deployment EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Using Hadoop Through Spark EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark And Hadoop Spark applications can also be run on YARN (Hadoop NextGen) Spark can run on top of Hadoop’s distributed file system Hadoop Distributed File System (HDFS) to leverage the distributed replicated storage Spark can be used along with MapReduce in the same Hadoop cluster or can be used alone as a processing framework EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark And Hadoop Spark is not intended to replace Hadoop but it can regarded as an extension to it MapReduce and Spark are used together where MapReduce is used for batch processing and Spark for real- time processing EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Features Speed Polyglot Advanced Analytics In-Memory Computation Hadoop Integration Machine Learning EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Features 1 Spark runs upto 100x times faster than MapReduce vs 2 Polyglot: Programming in Scala, Python, Java and R EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Features 3 Lazy Evaluation: Delays evaluation till needed 4 Real time computation & low latency because of in-memory computation EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Features 5 Hadoop Integration 1 2 6 Machine Learning for iterative tasks EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Ecosystem Spark Core Spark Streaming Spark SQL MLlib GraphX EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Ecosystem Used for structured data. Can run unmodified hive queries on existing Hadoop deployment Graph Computation engine (Similar to Giraph). Combines data- parallel and graph- parallel concepts Enables analytical and interactive apps for live streaming data. Package for R language to enable R-users to leverage Spark power from R shell Machine learning libraries being built on top of Spark. GraphX (Graph Computation) Spark Streaming (Streaming) MLlib (Machine Learning) SparkR (R on Spark) Spark SQL (SQL) Spark Core Engine The core engine for entire Spark framework. Provides utilities and architecture for other components EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Ecosystem ML pipelines makes it easier to combine multiple algorithms or workflows Tabular data abstraction introduced by Spark SQL DataFrames ML Pipelines Spark Streaming (Streaming) MLlib (Machine learning) GraphX (Graph Computation) SparkR (R on Spark) Spark SQL (SQL) Spark Core Engine EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Core Spark Core Spark Streaming Spark SQL MLlib GraphX EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Core Spark Core is the base engine for large-scale parallel and distributed data processing Table Row It is responsible for: Row Memory management and fault recovery Scheduling, distributing and monitoring jobs on a cluster Interacting with storage systems Result Row Row Figure: Spark Core Job Cluster EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Architecture Worker Node Cache Executor Task Task Driver Program Cluster Manager Spark Context Worker Node Cache Executor Task Task Figure: Components of a Spark cluster EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Streaming Spark Core Spark Streaming Spark SQL MLlib GraphX EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Streaming Spark Streaming is used for processing real-time streaming data It is a useful addition to the core Spark API Spark Streaming enables high-throughput and fault-tolerant stream processing of live data streams The fundamental stream unit is DStream which is basically a series of RDDs to process the real-time data Figure: Streams In Spark Streaming EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Streaming MLlib Machine Learning Streaming Data Sources Data Storage Systems Spark Streaming Static Data Sources Spark SQL SQL + DataFrames Figure: Overview Of Spark Streaming EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Streaming Kafka Flume Batches Of Processed Data Batches Of Input Data Input Data Stream HDFS Databases Dashboards HDFS/ S3 Kinesis Streaming Streaming Engine Twitter Figure: Incoming streams of data divided into batches Figure: Data from a variety of sources to various storage systems DStream Data From Time 0 to 1 Data From Time 0 to 1 Data From Time 0 to 1 Data From Time 0 to 1 RDD @ Time 3 RDD @ Time 4 RDD @ Time 1 RDD @ Time 2 DStream Data From Time 0 to 1 Data From Time 0 to 1 Data From Time 0 to 1 Data From Time 0 to 1 flatMap Operation Words DStream Words From Time 0 to 1 Words From Time 0 to 1 Words From Time 0 to 1 Words From Time 0 to 1 Figure: Input data stream divided into discrete chunks of data Figure: Extracting words from an InputStream EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark SQL Spark Core Spark Streaming Spark SQL MLlib GraphX EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark SQL Features 1 Spark SQL integrates relational processing with Spark’s functional programming 2 Spark SQL is used for the structured/semi structured data analysis in Spark EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark SQL Features 3 Support for various data formats 4 SQL queries can be converted into RDDs for transformations RDD 2 RDD 1 Shuffle transform Drop split point Invoking RDD 2 computes all partitions of RDD 1 EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark SQL Overview 5 Performance And Scalability EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark SQL Features 6 Standard JDBC/ODBC Connectivity User Defined Functions lets users define new Column-based functions to extend the Spark vocabulary 7 User EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark SQL Flow Diagram Spark SQL has the following libraries: 1. Data Source API 2. DataFrame API 3. Interpreter & Optimizer 4. SQL Service Data Source API The flow diagram represents a Spark SQL process using all the four libraries in sequence DataFrame API Named Columns Interpreter & Optimizer Spark SQL Service Resilient Distributed Dataset EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
MLlib Spark Core Spark Streaming Spark SQL MLlib GraphX EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
MLlib Machine learning may be broken down into two classes of algorithms: Machine Learning 1. Supervised algorithms use labelled data in which both the input and output are provided to the algorithm Supervised Unsupervised Clustering - K Means • Classification - Naïve Bayes - SVM 2. Unsupervised algorithms do not have the outputs in advance. These algorithms are left to make sense of the data without labels. • Dimensionality Reduction - Principal Component Analysis - SVD • Regression - Linear - Logistic • EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Mllib Techniques Techniques for Machine Learning. There are 3 common categories of techniques: 1. Classification 2. Clustering 3. Collaborative Filtering EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Mllib - Techniques 1. Classification: It is a family of supervised machine learning algorithms that designate input as belonging to one of several pre-defined classes Some common use cases for classification include: i) Credit card fraud detection ii) Email spam detection 2. Clustering: In clustering, an algorithm groups objects into categories by analyzing similarities between input examples EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Mllib - Techniques Collaborative Filtering: Collaborative filtering algorithms recommend items (this is the filtering part) based on preference information from many users (this is the collaborative part) 3. EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
GraphX Spark Core Spark Streaming Spark SQL MLlib GraphX EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
GraphX Graph Concepts A graph is a mathematical structure used to model relations between objects. A graph is made up of vertices and edges that connect them. The vertices are the objects and the edges are the relationships between them. Relationship: Friends Carol Bob Edge Vertex A directed graph is a graph where the edges have a direction associated with them. E.g. User Bob follows Carol on Twitter. Relationship: Friends Carol Bob Follows EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training