400 likes | 496 Views
This Edureka "Apache Spark Training" tutorial will talk about how Apache Spark works practically. We have demonstrated a Movie Recommendation Project using Apache Spark in this tutorial. Below are the topics covered in this tutorial: <br><br>1) Use Cases Of Real Time Analytics <br>2) Movie Recommendation System Using Spark <br>3) What Is Spark? <br>4) Getting Movie Dataset <br>5) Spark Streaming <br>6) Collaborative Filtering <br>7) Spark MLlib <br>8) Fetching Results <br>9) Storing Results
E N D
EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
What to expect? Use Cases Of Real Time Analytics Movie Recommendation System Using Spark What Is Spark? Getting Movie Dataset Spark Streaming Collaborative Filtering Spark MLlib Fetching Results Storing Results EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Cases of Real Time Analytics EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Cases of Real Time Analytics Government agencies perform Real Time Analysis mostly in the field of national security. Countries need to continuously keep a track of all the military and police agencies for updates regarding threats to security. Government EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Cases of Real Time Analytics Healthcare domain uses Real Time analysis to continuously check the medical status of critical patients. Hospitals on the look out for blood and organ transplants need to stay in a real-time contact with each other during emergencies. Getting medical attention on time is a matter of life and death for patients. Healthcare EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Cases of Real Time Analytics Companies revolving around services in the form of calls, video chats and streaming use real time analysis to reduce customer churn and stay ahead of competition. They also extract measurements of jitter and delay in mobile networks to improve customer experiences. Telecommunications EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Cases of Real Time Analytics Banking transacts with almost all of the world’s money. It becomes very important to ensure fault tolerant transactions across the whole system. Fraud detection is made possible through real time analytics in banking. Banking EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Cases of Real Time Analytics Stock brokers use real time analytics to predict movement of stock portfolios. Companies re-think their business model after using real time analytics to analyze the market demand for their brand. Stock Market EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Movie Recommendation System EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Movie Recommendation System EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Movie Recommendation System Problem Statement To build a Movie Recommendation System which recommends movies based on a user’s preferences using Apache Spark. Our Requirements: Process huge amount of data Input from multiple sources Easy to use Fast processing Apache Spark is the perfect tool to implement our Movie Recommendation System. EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Use Case – Flow Diagram 4 1 2 3 Data from Streaming / HDFS Machine Learning Using MLlib Huge amount of Movie Rating data Getting Input using Spark Streaming Train the data Evaluate ALS 6 5 Storing Results in RDBMS System for Websites Fetching Results using Spark SQL Generate Recommendations EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
What is Spark? EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
What is Spark? Apache Spark is an open-source cluster-computing framework for real time processing developed by the Apache Software Foundation. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. Figure: Real Time Processing In Spark It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types of computations. Serial Parallel Reduction in time Figure: Support for multiple source formats Figure: Lazy Evaluation Figure: Data Parallelism In Spark EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Features Simple programming layer provides powerful caching and disk persistence capabilities 100x faster than for large scale data processing vs Powerful Caching Speed Can be deployed through Mesos, Hadoop via Yarn, or Spark’s own cluster manger Can be programmed in Scala, Java, Python and R Features Polyglot Deployment EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Movie Dataset EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Movie Dataset User Ratings from BookMyShow EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Movie Dataset Movie Ratings In Our Dataset EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Getting Dataset EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Getting Dataset For our Movie Recommendation System, we can get user ratings from many popular websites like IMDB, Rotten Tomatoes and Times Movie Ratings. This dataset is available in many formats such as CSV files, text files and databases. We can either stream the data live from the websites or download and store them in our local file system or HDFS. Figure: Various File Formats EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Streaming EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Streaming EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Streaming Spark Streaming is used for processing real-time streaming data Spark Streaming enables high-throughput and fault-tolerant stream processing of live data streams EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Collaborative Filtering EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Collaborative Filtering We will use Collaborative Filtering (CF) to predict the ratings for users for particular movies based on their ratings for other movies. We then collaborate this with other users’ rating for that particular movie. Movie Alice 4 5 5 4 4 Bob 3 4 3 3 4 Carol 5 4 4 ? Dave 1 2 ? Shutter Island Fight Club Dark Knight 21 Home Alone 5 5 5 Figure: Predicting the rating of Dave for Dark Knight and Carol for 21 using Collaborative Filtering EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark MLlib EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark MLlib Spark MLlib is used to perform machine learning in Apache Spark. Machine learning in Spark is implemented using Spark’s MLlib. MLlib stands for Machine Learning Library. ML Algorithms Train Data using Alternating Least Squares (ALS) Machine Learning Using Spark MLlib Featurization Machine Learning Tools Pipelines Persistence Generate Recommendations using Collaborative Filtering Utilities Figure: Machine Learning Flow Diagram Figure: Machine Learning Tools EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Fetching Results EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark SQL for Fetching Results Machine Learning Output To get the results from our Machine Learning, we need to use Spark SQL’s DataFrame, Dataset and SQL Service. The results in Machine Learning needs to be stored in a RDBMS so that our web application can display the recommendations to a particular use. Spark SQL Results EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Ratings for Movies Ratings of Movies for User 77 Figure: User 77’s ratings for different movies EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Recommended Movies Total Number of Recommendations for User 77 Top Movie Recommendations for User 77 Figure: Movies recommended for User 77 EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Storing Results EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Storing Results The results for our Movie Recommendation System can be stored either locally or into external storage systems. We can store the Recommended Movies along with the Ratings in a text file or a CSV file. We should prefer storing the results into an RDBMS system so that we can access it directly from our web application and display recommendations and top movies. EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Job Trends EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Spark Job Trends The following is the Job Trend of Apache Spark across the world. Spark has almost thrice the average number of jobs in comparison to its competitors and is the market leader from 2014. Source: www.indeed.com EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Summary EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Summary What is Spark? Real Time Analytics Movie Recommendation System Spark Job Trends Spark Streaming Spark MLlib EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Conclusion EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Conclusion Congrats! We have hence demonstrated the power of Apache Spark in Real-Time Analysis. The hands-on examples will give you the required confidence to work on any future projects you encounter in Apache Spark. EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training
Thank You … Questions/Queries/Feedback EDUREKA SPARK CERTIFICATION TRAINING www.edureka.co/apache-spark-scala-training