Published on Jul 16,2018
24.4K Views
Email Post

In this blog, you will understand what is K-means clustering and how it can be implemented on the criminal data collected in various US states. The data contains crimes committed like: assault, murder, and rape in arrests per 100,000 residents in each of the 50 US states in 1973. Along with analyzing the data you will also learn about:

  • Finding the optimal number of clusters.
  • Minimizing distortion
  • Creating and analyzing the elbow curve.
  • Understanding the mechanism of k-means algorithm.

Let us start with the analysis. The data looks as:

dataset

Click on the image to download this dataset

Need this dataset? Click on the above image to download it. 

First let’s prepare the data for the analysis. In order to do so, we should remove any NA values that might be present in the data and convert the data into a matrix.

> crime0 <- na.omit(USArrests)
> crime <- data.matrix (crime0)
> str(crime)
 num [1:50, 1:4] 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
 - attr(*, "dimnames")=List of 2
 ..$ : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
 ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"

Let us take the number of clusters to be 5. Kmeans() function takes the input data and the number of clusters in which the data is to be clustered. The syntax is : kmeans( data, k) where k is the number of cluster centers.

> cl <- kmeans(crime, 5)
> class(cl)
[1] "kmeans"

Analyzing the Clustering :

> str(cl)
List of 9
 $ cluster : Named int [1:50] 5 3 3 5 3 5 4 5 3 5 ...
 ..- attr(*, "names")= chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
 $ centers : num [1:5, 1:4] 2.95 6.11 12.14 5.59 11.3 ...
 ..- attr(*, "dimnames")=List of 2
 .. ..$ : chr [1:5] "1" "2" "3" "4" ...
 .. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
 $ totss : num 355808
 $ withinss : num [1:5] 4548 2286 16272 1480 3653
 $ tot.withinss: num 28240
 $ betweenss : num 327568
 $ size : int [1:5] 10 9 14 10 7
 $ iter : int 3
 $ ifault : int 0
 - attr(*, "class")= chr "kmeans"

The str() function gives the structure of the kmeans which includes various parameters like withinss, betweenss, etc, analyzing which you can find out the performance of kmeans.

betweenss : Between sum of squares i.e. Intracluster similarity

withinss : Within sum of square i.e. Intercluster similarity

totwithinss : Sum of all the withinss of all the clusters i.e.Total intra-cluster similarity

A good clustering, will have a lower value of withinss and higher value of betweenss which depends on the number of clusters ‘k’ chosen initially. Let us see how we can find the optimal value of ‘k’.

Finding the optimal value of ‘k’ 

An optimal value of ‘k’ is the value which gives us a converged set of clusters with minimum distortion. Greater the distortion, worse will be the clusters formed.

Distortion:

The distortion can be calculated in terms of  ‘withinss’ from each of the clusters. Lesser the value of ‘withinss’ of a particular cluster, more densely populated it will be, thus minimum distortion.

kmeans.wss.k <- function(crime, k){
 km = kmeans(crime, k)
 return (km$tot.withinss)
 }

This function takes up the data and the value of k and returns the ‘km$totwithinss’ for it. ‘km$totwithinss’ is the total within-cluster sum of squares, thus including withinss of all the 5 clusters created i.e. sum(withinss). Higher the value of  ‘km$totwithinss’, greater will be the distortion.

For k=5, withinss is 24417.02

> kmeans.wss.k(crime,5)
 [1] 24417.02

Let’s increase the value of k from 5 to 10, and observe the difference.

> kmeans.wss.k(crime,10)
 [1] 11083.04

It can be seen that as the value of K increases, distortion decreases.

We can take out the different values of ‘km$totwithinss’ and plot them in a graph to find the relationship between distortion and the value of k. The following function does that for us:

> kmeans.dis <- function(crime, maxk){
+ dis=(nrow(crime)-1)*sum(apply(crime,2,var))
+ dis[2:maxk]=sapply (2:maxk, kmeans.wss.k, crime=crime)
+ return(dis)
+ }
> maxk = 10
> dis = kmeans.dis(crime, maxk);
> plot(1:maxk, dis, type='b', xlab="Number of Clusters",
 + ylab="Distortion",
 + col="blue")

Capture1 (1)

Ta Da!!! Thus we have the famous elbow curve with us.

Elbow Curve:

This is the plot between ‘k’, the number of clusters and the ‘totwithinss’ (or distortion) for each value of k. You can see when the number of cluster is less, there is a gradual decrease in distortion but as we keep on increasing the value of k, the rate of reduction of distortion values becomes constant.

This value of k beyond which the distortion rate becomes constant is the optimal value. Here  k=4.

Let us apply some animation to understand how R gave us the clustered results.

> library(animation)
> cl<- kmeans.ani(crime, 4)

Kmeans clustering Algorithm:

Let us understand the algorithm on which k-means clustering works:

Step #1. If k=4, we select 4 random points and assume them to be cluster centers for the clusters to be created.

Step #2. We take up a random data point from the space and find out its distance from all the 4 clusters centers. If  the data point is closest to the green cluster center, it is colored green and similarly all the points are categorised among the 4 clusters.

Capture10

Step #3. Now we calculate the centroid of all the green points and assign that point as the cluster center for that cluster.

Similarly, we calculate centroids for all the 4 colored(clustered) points and assign the new centroids as the cluster centers.

Capture9

Step #4. Step-2 and step-3 are run iteratively, unless the cluster centers converge at a point and no longer move.

Capture8

Capture7

Capture6

Capture5

Capture4

Capture3

Thus, we reach the converged clusters centers.

It can be seen that the data is divided into 4 clusters. The cluster centers are :

> cl$centers
 Murder Assault UrbanPop Rape
Texas 4.740741 104.8519 62.96296 16.10
Louisiana 10.907143 219.9286 71.71429 25.95
South Carolina 13.375000 284.5000 46.25000 25.05
New Mexico 11.040000 298.0000 77.60000 32.68

Cluster-4 with ‘New Mexico’ as the cluster center has a huge crime rate with the highest population as well.

Cluster-3 and Cluster-2 follow up.

Each state is assigned a cluster, depending on which we can now predict its crime ranking. The output looks as :

cluster-output

Got a question for us? Please mention it in the comments section and we will get back to you.

Related Posts:

3-D Graphs in R Commander

Get Started with Business Analytics with R

Get Started with Data Science

Share on

Browse Categories

Comments
12 Comments