You can control the distance measuring technique by having your own implementation of it. It is not always mandatory that you have the distance measuring technique provided by Mahout Algorithms or Mahout Library. If you think there is something which you know about your data, and which you want to calculate based on your own algorithm, you can just implement the distance measure interface and provide the weighting factor or implement the distance method and apply appropriate mathematical logic for the vector points and then provide them value determining whether that particular implementation falls within that particular centroid or not. There are various distance measures in Mahout, with which one can determine the quality of a cluster.
Cosine Distance Measure
- The cosine distance measure is a better distance measure for texts, because it groups documents by the highest-weighted common words between them.
- If the TF-IDF weighted vectors have higher weights for topic words, similar documents clustered using the cosine distance measure come to have common topic words between them.
- This causes the cluster centroid vector to have a higher average weight for topic words than for stop-words.
Inter-cluster distance is a good measure to know the quality of the cluster. Good clusters probably do not have centroids that are too close to each other, because this would indicate that the clustering process is creating groups with similar features, and perhaps drawing distinctions between cluster members that are hard to support.
Intra-Cluster Distance is exactly the opposite of Inter-Cluster Distance. Intra-Cluster Distance between members of a cluster will be small compared to inter-cluster distances. A good distance measure will return a small distance between objects that are similar and produce clusters that are tighter, and that are therefore more reliably discriminated from one another.
The figure above illustrates the intra-cluster distances that might be obtained when clustering using two different distance measures.
Intra-cluster distance is a measure of how close the points lie to each other. This is dependent on two things: the penalty a distance measure gives for farther objects, and the smaller value it gives for closer objects. The higher the ratio of these two values, the more separated out the cluster are.
Got a question for us? Please mention them in the comments section and we will get back to you.