461 likes | 2.05k Views
This Edureka k-means clustering algorithm tutorial will take you through the machine learning introduction, cluster analysis, types of clustering algorithms, k-means clustering, how it works along with an example/ demo in R. This Data Science with R tutorial is ideal for beginners to learn how k-means clustering work. You can also read the blog here: https://goo.gl/3aseSs
E N D
k-means clustering www.edureka.co/data-science Edureka’s Data Science Certification Training
What Will You Learn Today? 1 3 2 Types of clustering Introduction to Machine Learning Cluster analysis 6 5 4 Introduction to k- means clustering How k-means clustering work? Demo in R: Netflix use-case www.edureka.co/data-science Edureka’s Data Science Certification Training
What is Machine learning? Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without being explicitly programmed. Learn Algorithm Perform Training Data Build Model Feedback www.edureka.co/data-science Edureka’s Data Science Certification Training
ML Use Case – Google self driving car Google self driving car is a smart, driverless car. It collects data from environment through sensors. Takes decisions like when to speed up, when to speed down, when to overtake and when to turn. www.edureka.co/data-science Edureka’s Data Science Certification Training
Types of Machine Learning Feed the classifier with training data set and predefined labels. Supervised learning It will learn to categorize particular data under a specific label. Unsupervised learning When and where should I buy a house? House features Area crime rate Bedrooms Distance to HQ Area (in sq.ft) Locality www.edureka.co/data-science Edureka’s Data Science Certification Training
Types of Machine Learning An image of fruits is first fed into the system. Supervised learning The system identifies different fruits using features like color, size and it categorizes them. When a new fruit is shown, it analyses its features and puts it into the category having similar featured items. Unsupervised learning www.edureka.co/data-science Edureka’s Data Science Certification Training
Unsupervised Learning Cluster Analysis www.edureka.co/data-science Edureka’s Data Science Certification Training
What is Clustering? Clustering means grouping of objects based on the information found in the data describing the objects or their relationship. The goal is that objects in one group should be similar to each other but different from objects in another group. It deals with finding a structure in a collection of unlabeled data. Some Examples of clustering methods are: K-means Clustering Fuzzy/ C-means Clustering Hierarchical Clustering www.edureka.co/data-science Edureka’s Data Science Certification Training
Clustering Use Cases Marketing Discovering distinct groups in customer databases, such as customers who make lot of long-distance calls. Insurance Marketing Insurance Identifying groups of crop insurance policy holders with a high average claim rate. Farmers crash crops, when it is “profitable”. Use-cases Land use Identification of areas of similar land use in a GIS database. Land use Seismic studies Seismic studies Identifying probable areas for oil/gas exploration based on seismic data www.edureka.co/data-science Edureka’s Data Science Certification Training
Types of clustering www.edureka.co/data-science Edureka’s Data Science Certification Training
Types of Clustering Exclusive Clustering Hierarchial Clustering Overlapping Clustering • An item belongs exclusively to one cluster, not several. • K-means does this sort of exclusive clustering. • When two cluster have a parent- child relationship or a tree-like structure then it is Hierarchical clustering • An item can belong to multiple clusters • Its degree of association with each cluster is known • Fuzzy/ C-means does this sort of exclusive clustering. Cluster 0 Cluster 1 Cluster 2 Cluster 1 Cluster 2 www.edureka.co/data-science Edureka’s Data Science Certification Training
K-means clustering www.edureka.co/data-science Edureka’s Data Science Certification Training
K-means clustering k-means clustering k-means clustering is one of the simplest algorithms which uses unsupervised learning method to solve known clustering issues. Total population Divides entire dataset into k clusters. k-means clustering require following two inputs. 1. K = number of clusters 2. Training set(m) = {x1, x2, x3,......, xm} Group 1 Group 2 Group 3 Group 4 www.edureka.co/data-science Edureka’s Data Science Certification Training
Example - Google News Various news URLs related to Trump and Modi are grouped under one section. K-means clustering automatically clusters new stories about the same topic into pre-defined clusters. www.edureka.co/data-science Edureka’s Data Science Certification Training
Example The plot of students in an area is as given below, I need to find specific locations to build schools in this area so that the students doesn’t have to travel much www.edureka.co/data-science Edureka’s Data Science Certification Training
Example - Solution This looks good www.edureka.co/data-science Edureka’s Data Science Certification Training
But how did he do that?... I’ll show you how www.edureka.co/data-science Edureka’s Data Science Certification Training
How k-means work? www.edureka.co/data-science Edureka’s Data Science Certification Training
How k-means work? The WSS is defined as the sum of the squared distance between each member of the cluster and its centroid. Choose number of clusters Mathematically: Initialization Cluster assignment where, p(i)= data point Move centroid q(i)= closest centroid to data point Optimization The idea of the elbow method is to choose the k after which the WSS decrease is almost constant. Convergence www.edureka.co/data-science Edureka’s Data Science Certification Training
How k-means work? Randomly initialize k points called the cluster centroids. Here, k = 2 Choose number of clusters Value of k(number of clusters) can be determined by the elbow curve. Initialization Cluster assignment Y-axis Move centroid Optimization Cluster centroid Convergence X-axis www.edureka.co/data-science Edureka’s Data Science Certification Training
How k-means work? Compute the distance between the data points and the cluster centroid initialized. Choose number of clusters Depending upon the minimum distance, data points are divided into two groups. Initialization Cluster assignment Move centroid Optimization Convergence www.edureka.co/data-science Edureka’s Data Science Certification Training
How k-means work? Compute the mean of blue dots. Choose number of clusters Reposition blue cluster centroid to this mean. Compute the mean of orange dots. Initialization Reposition orange cluster centroid to this mean. Cluster assignment Move centroid Optimization Convergence www.edureka.co/data-science Edureka’s Data Science Certification Training
How k-means work? Repeat previous two steps iteratively till the cluster centroids stop changing their positions. Choose number of clusters Initialization Cluster assignment Move centroid Optimization Convergence www.edureka.co/data-science Edureka’s Data Science Certification Training
How k-means work? Finally, k-means clustering algorithm converges. Choose number of clusters Divides the data points into two clusters clearly visible in orange and blue. Initialization Cluster assignment Move centroid Optimization Convergence www.edureka.co/data-science Edureka’s Data Science Certification Training
Problem Statement Challenge: Netflix wanted to increase its business by showing most popular movies on its website. Solution: So, Netflix decided to group the movies based on budget, gross and facebook likes Approach: For this, Netflix took imdb dataset of 5000 values and applied k-means clustering to group it. But how would I know which movie set to show and which to not ? www.edureka.co/data-science Edureka’s Data Science Certification Training
Demo www.edureka.co/data-science Edureka’s Data Science Certification Training
Solution – R Script www.edureka.co/data-science Edureka’s Data Science Certification Training
Output We got three clusters based on budget and gross. Lets see how good are these clusters. Using command cl gives following output. Within cluster sum of squares by cluster: (between_SS / total _ SS = 72.4 %) Higher the %age value, better is the model. www.edureka.co/data-science Edureka’s Data Science Certification Training
Output Further, lets relate cluster assignment to individual characteristics like director facebook likes(column 5) and movie facebook likes(column 28). Cluster 2 has maximum movie likes as well as director likes. www.edureka.co/data-science Edureka’s Data Science Certification Training
I want to know the profit values of movie Try this out www.edureka.co/data-science Edureka’s Data Science Certification Training
Hmm… I will go with cluster 2. It is making maximum profit and has maximum facebook likes. www.edureka.co/data-science Edureka’s Data Science Certification Training
Course Details Get Edureka Certified in Data Science Today! Go to www.edureka.co/data-science Gnana Sekhar says - “Edureka Data science course provided me a very good mixture of theoretical and practical training. LMS pre recorded sessions and assignments were very good as there is a lot of information in them that will help me in my job. Edureka is my teaching GURU now...Thanks EDUREKA.” Shravan Reddy says- “I would like to recommend any one who wants to be a Data Scientist just one place: Edureka. Explanations are clean, clear, easy to understand. Their support team works very well.. I took the Data Science course and I'm going to take Machine Learning with Mahout and then Big Data and Hadoop”. Balu Samaga says - “It was a great experience to undergo and get certified in the Data Science course from Edureka. Quality of the training materials, assignments, project, support infrastructures are a top notch.” What our learners have to say about us! and other www.edureka.co/data-science Edureka’s Data Science Certification Training
www.edureka.co/data-science Edureka’s Data Science Certification Training