Use different distance formula other than euclidean distance in k means

0 votes
I am working with latitude-longitude data.

My objective is to make clusters based on the distance between two points.

Now distance between two different point is =ACOS(SIN(lat1)*SIN(lat2)+COS(lat1)*COS(lat2)*COS(lon2-lon1))*6371

How to use k means in R. Is there any way I can override distance calculation in that process?
Jun 21, 2018 in Data Analytics by DataKing99
• 8,130 points
392 views

1 answer to this question.

0 votes

K-means is based on variance minimization. The sum-of-variance formula equals the sum of squared Euclidean distances, but the converse, for other distances, will not hold.

If you want to have a k-means like an algorithm for other distances (where the mean is not an appropriate estimator), use k-medoids (PAM). In contrast to k-means, k-medoids will converge with arbitrary distance functions!

For Manhattan distance, you can also use K-medians. The median is an appropriate estimator for L1 norms (the median minimizes the sum-of-differences; the mean minimizes the sum-of-squared-distances).

For your particular use case, you could also transform your data into 3D space, then use (squared) Euclidean distance and thus k-means. But your cluster centers will be somewhere underground!

answered Jun 21, 2018 by darklord
• 6,170 points

Related Questions In Data Analytics

0 votes
1 answer
0 votes
1 answer

In a dpylr pipline how to use sample and seq?

For avoiding rowwise(), I prefer to use ...READ MORE

answered Apr 6, 2018 in Data Analytics by DeepCoder786
• 1,720 points
87 views
0 votes
2 answers

How to use group by for multiple columns in dplyr, using string vector input in R?

data = data.frame(   zzz11def = sample(LETTERS[1:3], 100, replace=TRUE),   zbc123qws1 ...READ MORE

answered Aug 5 in Data Analytics by anonymous
3,598 views
0 votes
1 answer

Which function can I use to clear the console in R and RStudio ?

Description                   Windows & Linux           Mac Clear console                      Ctrl+L ...READ MORE

answered Apr 17, 2018 in Data Analytics by anonymous
1,941 views
0 votes
1 answer

k means vs KNN

K-means clustering is basically an unsupervised clustering ...READ MORE

answered Oct 30, 2018 in Data Analytics by kurt_cobain
• 9,260 points
172 views
0 votes
1 answer

What are the options for deploying models in production with R?

Well, I could say that the answer ...READ MORE

answered Apr 12, 2018 in Data Analytics by DataKing99
• 8,130 points
304 views
+1 vote
1 answer

How to handle Nominal Data?

Nominal data is basically data which can ...READ MORE

answered Jul 23, 2018 in Data Analytics by Anmol
• 3,620 points
41 views
+1 vote
2 answers

How to handle outliers

There are multiple ways to handle outliers ...READ MORE

answered Jul 23, 2018 in Data Analytics by Anmol
• 3,620 points
46 views
0 votes
2 answers

Different data structures in R

The different data types in R are ...READ MORE

answered Aug 26 in Data Analytics by anonymous
• 31,840 points
75 views
0 votes
1 answer

How to change y axis max in time series using R?

The axis limits are being set using ...READ MORE

answered Apr 3, 2018 in Data Analytics by darklord
• 6,170 points
172 views