Implementation of k-means Clustering - Edureka

Business Analytics with R (29 Blogs) Become a Certified Professional

Become a Certified Professional

In this blog, you will understand what is K-means clustering and how it can be implemented on the criminal data collected in various US states. The data contains crimes committed like: assault, murder, and rape in arrests per 100,000 residents in each of the 50 US states in 1973. Along with analyzing the data you will also learn about:

- Finding the optimal number of clusters.
- Minimizing distortion
- Creating and analyzing the elbow curve.

Understanding the mechanism of k-means algorithm.

Let us start with the analysis. The data looks as:

dataset — Click on the image to download this dataset

Need this dataset? Click on the above image to download it.

First let’s prepare the data for the analysis. In order to do so, we should remove any NA values that might be present in the data and convert the data into a matrix.

> crime0 <- na.omit(USArrests)
> crime <- data.matrix (crime0)
> str(crime)
 num [1:50, 1:4] 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
 - attr(*, "dimnames")=List of 2
 ..$ : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
 ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"

Let us take the number of clusters to be 5. Kmeans() function takes the input data and the number of clusters in which the data is to be clustered. The syntax is : kmeans( data, k) where k is the number of cluster centers.

> cl <- kmeans(crime, 5)
> class(cl)
[1] "kmeans"

Analyzing the Clustering :

> str(cl)
List of 9
 $ cluster : Named int [1:50] 5 3 3 5 3 5 4 5 3 5 ...
 ..- attr(*, "names")= chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
 $ centers : num [1:5, 1:4] 2.95 6.11 12.14 5.59 11.3 ...
 ..- attr(*, "dimnames")=List of 2
 .. ..$ : chr [1:5] "1" "2" "3" "4" ...
 .. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
 $ totss : num 355808
 $ withinss : num [1:5] 4548 2286 16272 1480 3653
 $ tot.withinss: num 28240
 $ betweenss : num 327568
 $ size : int [1:5] 10 9 14 10 7
 $ iter : int 3
 $ ifault : int 0
 - attr(*, "class")= chr "kmeans"

The str() function gives the structure of the kmeans which includes various parameters like withinss, betweenss, etc, analyzing which you can find out the performance of kmeans.

betweenss : Between sum of squares i.e. Intracluster similarity

withinss : Within sum of square i.e. Intercluster similarity

totwithinss : Sum of all the withinss of all the clusters i.e.Total intra-cluster similarity

A good clustering, will have a lower value of withinss and higher value of betweenss which depends on the number of clusters ‘k’ chosen initially. Let us see how we can find the optimal value of ‘k’.

Finding the optimal value of ‘k’

An optimal value of ‘k’ is the value which gives us a converged set of clusters with minimum distortion. Greater the distortion, worse will be the clusters formed.

Distortion:

The distortion can be calculated in terms of ‘withinss’ from each of the clusters. Lesser the value of ‘withinss’ of a particular cluster, more densely populated it will be, thus minimum distortion.

kmeans.wss.k <- function(crime, k){
 km = kmeans(crime, k)
 return (km$tot.withinss)
 }

This function takes up the data and the value of k and returns the ‘km$totwithinss’ for it. ‘km$totwithinss’ is the total within-cluster sum of squares, thus including withinss of all the 5 clusters created i.e. sum(withinss). Higher the value of ‘km$totwithinss’, greater will be the distortion.

For k=5, withinss is 24417.02

> kmeans.wss.k(crime,5)
 [1] 24417.02

Let’s increase the value of k from 5 to 10, and observe the difference.

> kmeans.wss.k(crime,10)
 [1] 11083.04

It can be seen that as the value of K increases, distortion decreases.

We can take out the different values of ‘km$totwithinss’ and plot them in a graph to find the relationship between distortion and the value of k. The following function does that for us:

> kmeans.dis <- function(crime, maxk){
+ dis=(nrow(crime)-1)*sum(apply(crime,2,var))
+ dis[2:maxk]=sapply (2:maxk, kmeans.wss.k, crime=crime)
+ return(dis)
+ }
> maxk = 10
> dis = kmeans.dis(crime, maxk);
> plot(1:maxk, dis, type='b', xlab="Number of Clusters",
 + ylab="Distortion",
 + col="blue")

Ta Da!!! Thus we have the famous elbow curve with us.

Elbow Curve:

This is the plot between ‘k’, the number of clusters and the ‘totwithinss’ (or distortion) for each value of k. You can see when the number of cluster is less, there is a gradual decrease in distortion but as we keep on increasing the value of k, the rate of reduction of distortion values becomes constant.

This value of k beyond which the distortion rate becomes constant is the optimal value. Here k=4.

Let us apply some animation to understand how R gave us the clustered results.

> library(animation)
> cl<- kmeans.ani(crime, 4)

Kmeans clustering Algorithm:

Let us understand the algorithm on which k-means clustering works:

Step #1. If k=4, we select 4 random points and assume them to be cluster centers for the clusters to be created.

Step #2. We take up a random data point from the space and find out its distance from all the 4 clusters centers. If the data point is closest to the green cluster center, it is colored green and similarly all the points are categorised among the 4 clusters.

Step #3. Now we calculate the centroid of all the green points and assign that point as the cluster center for that cluster.

Similarly, we calculate centroids for all the 4 colored(clustered) points and assign the new centroids as the cluster centers.

Step #4. Step-2 and step-3 are run iteratively, unless the cluster centers converge at a point and no longer move.

Thus, we reach the converged clusters centers.

It can be seen that the data is divided into 4 clusters. The cluster centers are :

> cl$centers
 Murder Assault UrbanPop Rape
Texas 4.740741 104.8519 62.96296 16.10
Louisiana 10.907143 219.9286 71.71429 25.95
South Carolina 13.375000 284.5000 46.25000 25.05
New Mexico 11.040000 298.0000 77.60000 32.68

Cluster-4 with ‘New Mexico’ as the cluster center has a huge crime rate with the highest population as well.

Cluster-3 and Cluster-2 follow up.

Each state is assigned a cluster, depending on which we can now predict its crime ranking. The output looks as :

Got a question for us? Please mention it in the comments section and we will get back to you.

Related Posts:

Get Started with Business Analytics with R

Get Started with Data Science

Recommended videos for you

mastering-python-an-excellent-tool-for-web-scraping-and-data-analysis.jpg

Mastering Python : An Excellent tool for Web Scraping and Data Analysis

Web Scraping And Analytics With Python

Python-Programming-Learn-Python-Python-Tutorial-Python-Training-Edureka.jpeg

Python Programming – Learn Python Programming From Scratch

Business Analytics Decision Tree in R

Data Science : Make Smarter Business Decisions

3 Scenarios Where Predictive Analytics is a Must

Python for Big Data Analytics

Linear Regression With R

know-the-science-behind-product-recommendation-with-r-programming.jpg

Know The Science Behind Product Recommendation With R Programming

Python-Lists-Python-Tuples-Python-Sets-Dictionary-Python-Strings-Python-Training-Edureka.jpeg

Python List, Tuple, String, Set And Dictonary – Python Sequences

Python-Machine-Learning-Tutorial-Machine-Learning-Algorithms-Python-Training-Edureka.jpeg

Machine Learning with Python

Python-NumPy-Tutorial-NumPy-Array-Python-Tutorial-For-Beginners-Python-Training-Edureka.jpeg

Python Numpy Tutorial – Arrays In Python

Business Analytics with R

Android Development : Using Android 5.0 Lollipop

Diversity Of Python Programming

Python-Loops-Tutorial-Python-For-Loop-While-Loop-Python-Python-Training-Edureka.jpeg

Python Loops – While, For and Nested Loops in Python Programming

Sentiment Analysis In Retail Domain

Python-Class-Python-Classes-Python-Programming-Python-Tutorial-Edureka.jpeg

Python Classes – Python Programming Tutorial

The Whys and Hows of Predictive Modeling-II

application-of-clustering-in-data-science-using-real-time-examples.jpg

Application of Clustering in Data Science Using Real-Time Examples

Recommended blogs for you

Hash Tables and Hashmaps in Python: What are they and How to implement?

Data Scientist Resume Sample

Confusion Matrix in Machine Learning : Your One Stop Solution

Python Anaconda Tutorial : Everything You Need To Know

Object Oriented Programming Python: All you need to know

How To Install pip In Python: Get Started With Python Installation

Python Basics: What makes Python so Powerful?

Naive Bayes Classifier

Python Modulo in Practice: How to Use the % Operator

Matplotlib Tutorial – Python Matplotlib Library with Examples

What are the Best Books for Data Science?

Apriori Algorithm : Know How to Find Frequent Itemsets

How to implement Time Sleep in Python?

Speech Recognition Python: How To Translate Speech To Text?

Latest Machine Learning Projects to Try in 2019

How To Implement Round Function In Python?

Python Requests: All You Need To Know

What is Try Except in Python and how it works?

Introduction to Python for Big Data Analytics

How To Sort A Dictionary In Python : Sort By Keys , Sort By Values

Comments

12 Comments

Mohamed Ibn Ousmane Bah says:
May 25, 2017 at 11:48 pm GMT
Hello!
how can I assign the clusters according to the results ?
Reply
goutham says:
Mar 3, 2017 at 6:06 am GMT
sir i cant find the code for animations in ur blog can u please naviagate me to the animation code.
Reply
- EdurekaSupport says:
  Mar 3, 2017 at 8:54 am GMT
  Hey Goutham, thanks for checking out our blog. For cluster animation, you need to import package named “animation” into the R studio. The following is the code to do cluster animation.
  >install.packages(“animation”)
  >library(animation)
  > cl<- kmeans.ani(crime, 4)
  You will be able to see the animation running in the plot window of your R Studio console.
  Hope this helps. Cheers!
  Reply
Malar Saravanan says:
Feb 13, 2017 at 3:23 am GMT
I’m getting the code works completely correct for my data set till distortion but I’m stuck to do assignment of cluster after that.. from k means algorithm can you please provide instructions how to implement it..
Reply
Satish says:
May 5, 2016 at 7:35 am GMT
So the provided dataset is a labeled one with cluster ID. Am i right? We can use it for calculating the accuracy of the clustering
Reply
- EdurekaSupport says:
  Feb 15, 2017 at 10:44 am GMT
  Hey Satish, thanks for checking out our blog.
  The clustering accuracy can be determined by the animation chart of K-means clustering using the ‘animation” package which shows the clustering process. If the clusters groups or centres aren’t overlapping with each other we can conclude the clustering accuracy.
  We also can measure the accuracy of the new labelling by comparing it with the original labeling (original labelling is the ground truth) i.e using a table function.
  Hope this helps. Cheers!
  Reply
septianusa says:
Apr 26, 2016 at 4:21 am GMT
how we can know characteristics for each cluster?
Reply
- EdurekaSupport says:
  Feb 15, 2017 at 9:21 am GMT
  Hey Septianusa, thanks for checking out or tutorial! In order to determine the characteristics of each cluster, you will have to analyse the variables separately for each cluster.
  One of the things you can do is calculate mean/median of the variables for every cluster and then determine which variable has a higher value in which cluster.
  Or you can build classification models to understand characteristics of clusters.
  For cluster kk, build a classification model with two classes: one is ‘k’; the other is others’others’. The training data consists of data points in cluster k and data points randomly sampled from all other clusters except cluster k.By examining the variable importance of the classification model, you can tell which features have bigger impact on the classification performance, in other words, which features can distinguish better class k from class others others better.
  Hope this helps. Cheers!
  Reply
Antônio Vieira says:
May 2, 2015 at 4:26 pm GMT
Where we can download this dataset ?
Reply
- sai says:
  May 12, 2015 at 10:06 am GMT
  click on the image with sample data set (at the top).
  Reply
Vishwa says:
Apr 8, 2015 at 7:33 am GMT
There is a error in the function (kmeans.dis) code in line 3,
it should be “sapply (2:maxk, kmeans.wss.k, train3=train3)” instead of “sapply (2:maxk, kmeans.dis.k, train3=train3)”
Function “kmeans.wss.k” is the one which is created earlier and same has to be used here.
Please correct me if i am wrong ?
Reply
- EdurekaSupport says:
  Apr 23, 2015 at 1:03 pm GMT
  Thanks for highlighting this Vishwa. We have made the necessary changes.
  Reply

Join the discussionCancel reply

REGISTER FOR FREE WEBINAR

webinar_success

Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP

Implementing K-means Clustering on the Crime Dataset

edureka.co