Clustering is a fundamental modelling technique, which is all about grouping. The steps involved in clustering are valid for all techniques.
Here are the steps for Cluster Analysis:
1.Choose the Right Variable – The concept involves identifying what is the right attribute and how much is it worth it. Here, one must select a variable that one feels may be important for identifying and understanding differences among groups of observation within the data.
2.Scaling the Data – In this, the data samples from different sources may be grouped in different scales. For example, if we are working on personal data, such as age where it goes from 0 to 100, weight between 40-180 and height between 1-6 feet. Here, the variables in the analysis vary in range; the variable with the largest range will have the greatest impact on the results.
3.Calculate Distances- Here, if the variables in the analysis vary in range, the variable with the largest range will have the greatest impact on the results.
A Point to note is that each of the attributes has different scales. If we try to come out with an equation, then normalization must be considered, where we may have to bring all attributes and variables. For example, given that we are doing analysis on weather and evaluate the sample data from India & US, the scale is different in this case. This is because one would be using metric system and the other is using US system. Thus, our objective is to bring them to the same standard. Also, the basic purpose of Cluster Analysis is to calculate distances
Calculation of Distance between Points in a Cluster
Here, one objective can be to group similar points together into one cluster.
1) One way is that we can take the center of the cluster and find out the center of the next group and calculate distance between the centers.
2) Or take the closest point and find distance between closest points.
3) Or take the largest distance points and find out the distant between them.
Simple linkage – produces elongated clusters. It is the shortest distance between a point in one cluster and a point in the other cluster.
Complete linkage– longest distance between a point in one cluster and a point in the other cluster
Average linkage– average distance between each point in one cluster and each point in the other cluster
Centroid – distance between the centroids (mean vector over the variables) of the two clusters
Ward– combines clusters that lead to the smallest distance within clusters, sum of all squares over all variables
Note: These concepts may be applied to multiple techniques. In each and every technique we have multiple options to choose from. When it comes to cluster analysis, this is called as hierarchical cluster analysis, where one can use multiple methods. Each method has its own advantage, disadvantage and properties.
Got a question for us? Mention them in the comments section and we will get back to you.