Mahout primarily supports three use cases, Recommendations, Clustering and Classification and here, we are talking about Clustering. A cluster refers to a small group of objects. Clustering means grouping any forms of data into characteristically similar groups of data-sets. In other words, Clustering is dividing data points into homogeneous classes or clusters, such that the points in the same group are as similar as possible, while those in different groups are as dissimilar as possible. When a collection of objects is given, they are divided into groups based on similarity.
There are the different types of clustering in Mahout:
K-means clustering, discovered by Macqueen in 1967, is one of the simplest unsupervised learning algorithms that solve the well-known clustering problem.
K-Means clustering is a method of vector quantization, which originally comes from signal processing, a popular technique for cluster analysis in data mining.
If k is defined, following are the steps, in which k-means algorithm can be executed:
- Partition of the objects into k non-empty subsets.
- Identifying the cluster centroids (mean point) of the current partition.
- Assigning each point to a specific cluster.
- Finding out the distance of each point from the centroid and allot points to the cluster where the distance from the centroid is the minimum.
- After re-allocation of the points, identifying the centroid of the new cluster formed.
K-Means: Pizza Hut Clustering Example:
Let’s consider an example which takes in account the Pizza Hut delivery points. We can provide a solution to this by using the K-Means Clustering, which is one part of algorithm under the pillow of clustering.
The algorithm makes a centroid and from there it calculates the distance between the centroid and the points. It then, finds out which is the minimal distance, and tries to group together all those points. When we have the delivery locations for Pizza, first of all, we need to group the delivery locations. If we need three delivery locations, or three clusters, or groups of records of the data we acquire, then, we find out the distance between the centroid and the delivery points.
If the grouping is not sufficient or is not giving the closest results, we re-position the centroid nearest to the points and try to group them together, so as to optimize the distance between the cluster centroid points and the data points. Then again, we need to find the distance. This is not needed to be done manually, as everything is done by the algorithm. The only thing that one has to do is study the inferential statistics. The outcome of this Mahout algorithm, where you have inference out of it to find out what we are getting is right or wrong.
Once we find this out, we have to group the similar sets of data that have very less distance, and share similar characteristics of a data-set, and then, we go on to group them together. This way clustering brings together the similar kind of data or common sets of information.
One thing to be made sure about here, is not to have a past history record set, which has both input as well as output. In this case only, one needs to go for clustering.
Note: If in case, there is data with past history record set, which has both input and output, one can directly go for classification mode.
Got a question for us? Mention them in the comments section and we will get back to you.