How to cluster a very large dataset in R

Question

I have a very large dataset consisting of 70K numeric values representing various distances ranging from 0-50. I want to cluster these numbers; however, when I try this approach, I get a 70K * 70K distance matrix representing the distance between every 2 numbers in the dataset, which won't fit in memory.

So, Is there any way to solve such problem!

Sahiti · Answer 1 · Jun 19, 2018

You can initially use kmeans, to calculate the important number of centers and then perform hierarchical clustering approach on the coordinates of the centers.

So, in this, way distance matrix would be small in size.

You can try out the code below:


# Data
x <- rbind(matrix(rnorm(70000, sd = 0.3), ncol = 2),
           matrix(rnorm(70000, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")

# CAH without kmeans : doesn't work necessarily
library(FactoMineR)
cah.test <- HCPC(x, graph=FALSE, nb.clust=-1)

# CAH with kmeans : works more quickly
cl <- kmeans(x, 1000, iter.max=20)
cah <- HCPC(cl$centers, graph=FALSE, nb.clust=-1)
plot.HCPC(cah, choice="tree")