Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training | Edureka

5 Best Practices in DevOps Culture EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

What to expect? 2 Spark Features 1 Why Apache Spark? 3 Spark Ecosystem 5 4 Use Case Hands-On Examples EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Big Data EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Data Generated Every Minute! EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

IOT: 50 Billion Devices By 2020 “~6 things online” per person Sensors, Smart, Objects, Device Clustered Systems 50 Rapid adoption rate of digital infrastructure 5x faster than electricity & telephony Billion SmartObjects Tablets, Laptops, Phones Inflection Point World Population 6.307 6.894 7.83 6.721 7.347 2003 2010 2020 2008 2015 EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Big Data Analytics EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Big Data Analytics ➢ Big Data Analytics is the process of examining large data sets to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information. ➢ Big Data Analytics is of two types: 1. Batch Analytics 2. Real-Time Analytics EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Batch vs Real Time Analytics EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Batch vs Real Time Analytics Time ETL Analytics based on the data collected over a period of time is Batch Analytics Stored Stored Stored Client Client Client Client Time Analytics based on immediate data for instant result is Real- Time (Stream) Analytics ms ms ms Client Client Client EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Use Cases of Real Time Analytics EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Use Cases of Real Time Analytics Banking Government Healthcare Telecommunications Stock Market EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Why Spark When Hadoop Is Already There? EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Batch Processing In Hadoop Processing Data Using MapReduce Processing Data Processing Data Processing Data Processing Data Day 1 Day 2 Day 3 Day 4 Day N Hadoop processes the data stored over a period of time Time Lag Day 1 Day 2 Day 3 Day 4 Day N Input Data Input Data Input Data Input Data EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Real Time Processing In Spark Processing Data Processing Data Processing Data Processing Data Day 1 Day 2 Day 3 Day 4 Day N Spark overcomes the time lag issue No Time Lag Day 1 Day 2 Day 3 Day 4 Day N Input Data Input Data Input Data Input Data EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Vs Hadoop Hadoop implements Batch processing on Big Data. It thus cannot deliver to our Real-Time use case needs. Our Requirements: Process data in real-time Handle input from multiple sources Easy to use Faster processing EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Success Story EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Success Story Twitter Sentiment Analysis With Spark NYSE: Real Time Analysis of Stock Market Data Trending be campaigns and attract larger audience Topics to can Sentiment crisis service adjusting and target marketing helps in used create management, Banking: Credit Card Fraud Detection Genomic Sequencing EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Overview EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

What Is Spark?  Apache Spark is an open-source cluster-computing framework for real time processing developed by the Apache Software Foundation Figure: Real Time Processing In Spark  Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance Serial Parallel Reduction in time  It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types of computations Figure: Data Parallelism In Spark EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Why Spark? Simple programming layer provides powerful caching and disk persistence capabilities 100x faster than for large scale data processing vs Powerful Caching Speed Can be deployed through Mesos, Hadoop via Yarn, or Spark’s own cluster manger Can be programmed in Scala, Java, Python and R Features Polyglot Deployment EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Using Hadoop Through Spark EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark And Hadoop Spark applications can also be run on YARN (Hadoop NextGen) Spark can run on top of Hadoop’s distributed file system Hadoop Distributed File System (HDFS) to leverage the distributed replicated storage Spark can be used along with MapReduce in the same Hadoop cluster or can be used alone as a processing framework EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark And Hadoop Spark is not intended to replace Hadoop but it can regarded as an extension to it MapReduce and Spark are used together where MapReduce is used for batch processing and Spark for real- time processing EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Features Speed Polyglot Advanced Analytics In-Memory Computation Hadoop Integration Machine Learning EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Features 1 Spark runs upto 100x times faster than MapReduce vs 2 Polyglot: Programming in Scala, Python, Java and R EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Features 3 Lazy Evaluation: Delays evaluation till needed 4 Real time computation & low latency because of in-memory computation EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Features 5 Hadoop Integration 1 2 6 Machine Learning for iterative tasks EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Ecosystem Spark Core Spark Streaming Spark SQL MLlib GraphX EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Ecosystem Used for structured data. Can run unmodified hive queries on existing Hadoop deployment Graph Computation engine (Similar to Giraph). Combines data- parallel and graph- parallel concepts Enables analytical and interactive apps for live streaming data. Package for R language to enable R-users to leverage Spark power from R shell Machine learning libraries being built on top of Spark. GraphX (Graph Computation) Spark Streaming (Streaming) MLlib (Machine Learning) SparkR (R on Spark) Spark SQL (SQL) Spark Core Engine The core engine for entire Spark framework. Provides utilities and architecture for other components EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Ecosystem ML pipelines makes it easier to combine multiple algorithms or workflows Tabular data abstraction introduced by Spark SQL DataFrames ML Pipelines Spark Streaming (Streaming) MLlib (Machine learning) GraphX (Graph Computation) SparkR (R on Spark) Spark SQL (SQL) Spark Core Engine EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Core Spark Core Spark Streaming Spark SQL MLlib GraphX EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Core Spark Core is the base engine for large-scale parallel and distributed data processing Table Row It is responsible for: Row  Memory management and fault recovery  Scheduling, distributing and monitoring jobs on a cluster  Interacting with storage systems Result Row Row Figure: Spark Core Job Cluster EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Architecture Worker Node Cache Executor Task Task Driver Program Cluster Manager Spark Context Worker Node Cache Executor Task Task Figure: Components of a Spark cluster EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Streaming Spark Core Spark Streaming Spark SQL MLlib GraphX EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Streaming  Spark Streaming is used for processing real-time streaming data  It is a useful addition to the core Spark API  Spark Streaming enables high-throughput and fault-tolerant stream processing of live data streams  The fundamental stream unit is DStream which is basically a series of RDDs to process the real-time data Figure: Streams In Spark Streaming EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Streaming MLlib Machine Learning Streaming Data Sources Data Storage Systems Spark Streaming Static Data Sources Spark SQL SQL + DataFrames Figure: Overview Of Spark Streaming EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark Streaming Kafka Flume Batches Of Processed Data Batches Of Input Data Input Data Stream HDFS Databases Dashboards HDFS/ S3 Kinesis Streaming Streaming Engine Twitter Figure: Incoming streams of data divided into batches Figure: Data from a variety of sources to various storage systems DStream Data From Time 0 to 1 Data From Time 0 to 1 Data From Time 0 to 1 Data From Time 0 to 1 RDD @ Time 3 RDD @ Time 4 RDD @ Time 1 RDD @ Time 2 DStream Data From Time 0 to 1 Data From Time 0 to 1 Data From Time 0 to 1 Data From Time 0 to 1 flatMap Operation Words DStream Words From Time 0 to 1 Words From Time 0 to 1 Words From Time 0 to 1 Words From Time 0 to 1 Figure: Input data stream divided into discrete chunks of data Figure: Extracting words from an InputStream EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark SQL Spark Core Spark Streaming Spark SQL MLlib GraphX EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark SQL Features 1 Spark SQL integrates relational processing with Spark’s functional programming 2 Spark SQL is used for the structured/semi structured data analysis in Spark EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark SQL Features 3 Support for various data formats 4 SQL queries can be converted into RDDs for transformations RDD 2 RDD 1 Shuffle transform Drop split point Invoking RDD 2 computes all partitions of RDD 1 EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark SQL Overview 5 Performance And Scalability EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark SQL Features 6 Standard JDBC/ODBC Connectivity User Defined Functions lets users define new Column-based functions to extend the Spark vocabulary 7 User EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Spark SQL Flow Diagram  Spark SQL has the following libraries: 1. Data Source API 2. DataFrame API 3. Interpreter & Optimizer 4. SQL Service Data Source API  The flow diagram represents a Spark SQL process using all the four libraries in sequence DataFrame API Named Columns Interpreter & Optimizer Spark SQL Service Resilient Distributed Dataset EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

MLlib Spark Core Spark Streaming Spark SQL MLlib GraphX EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

MLlib Machine learning may be broken down into two classes of algorithms: Machine Learning 1. Supervised algorithms use labelled data in which both the input and output are provided to the algorithm Supervised Unsupervised Clustering - K Means • Classification - Naïve Bayes - SVM 2. Unsupervised algorithms do not have the outputs in advance. These algorithms are left to make sense of the data without labels. • Dimensionality Reduction - Principal Component Analysis - SVD • Regression - Linear - Logistic • EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Mllib Techniques Techniques for Machine Learning. There are 3 common categories of techniques: 1. Classification 2. Clustering 3. Collaborative Filtering EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Mllib - Techniques 1. Classification: It is a family of supervised machine learning algorithms that designate input as belonging to one of several pre-defined classes Some common use cases for classification include: i) Credit card fraud detection ii) Email spam detection 2. Clustering: In clustering, an algorithm groups objects into categories by analyzing similarities between input examples EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Mllib - Techniques Collaborative Filtering: Collaborative filtering algorithms recommend items (this is the filtering part) based on preference information from many users (this is the collaborative part) 3. EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

GraphX Spark Core Spark Streaming Spark SQL MLlib GraphX EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

GraphX Graph Concepts A graph is a mathematical structure used to model relations between objects. A graph is made up of vertices and edges that connect them. The vertices are the objects and the edges are the relationships between them. Relationship: Friends Carol Bob Edge Vertex A directed graph is a graph where the edges have a direction associated with them. E.g. User Bob follows Carol on Twitter. Relationship: Friends Carol Bob Follows EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training

Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training | Edureka

Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training | Edureka

Presentation Transcript

An introduction to Apache Spark

Using Apache Spark

Introduction to Apache Spark

Parallel Programming With Apache Spark

An Overview of Apache Spark

Hadoop vs Apache Spark

Apache Spark Courses Online

Apache spark training institute

Apache Spark Training | Best Spark Online Training-GOT

Apache Spark

Apache Spark Training | Best Spark Online Training-GOT

Apache spark Interview Questions 2019.pdf

Introduction to Apache Spark

Apache Spark Scala Training

Apache Spark - Introduction

Introduction to Apache Spark

Apache Spark

Apache spark tutorial in Big data hadoop

PySpark Training | PySpark Tutorial For Beginners | Apache Spark With Python Tutorial | Simplilearn