I have a very large dataset consisting of 70K numeric values representing various distances ranging from 0-50. I want to cluster these numbers; however, when I try this approach, I get a 70K * 70K distance matrix representing the distance between every 2 numbers in the dataset, which won't fit in memory.

So, Is there any way to solve such problem! Jun 19, 2018 1,555 views

## 1 answer to this question.

You can initially use kmeans, to calculate the important number of centers and then perform hierarchical clustering approach on the coordinates of the centers.

So, in this, way distance matrix would be small in size.

You can try out the code below:

```
# Data
x <- rbind(matrix(rnorm(70000, sd = 0.3), ncol = 2),
matrix(rnorm(70000, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")

# CAH without kmeans : doesn't work necessarily
library(FactoMineR)
cah.test <- HCPC(x, graph=FALSE, nb.clust=-1)

# CAH with kmeans : works more quickly
cl <- kmeans(x, 1000, iter.max=20)
cah <- HCPC(cl\$centers, graph=FALSE, nb.clust=-1)
plot.HCPC(cah, choice="tree")``` answered Jun 19, 2018 by
• 6,380 points

## How to convert a text mining termDocumentMatrix into excel or csv in R?

By assuming that all the values are ...READ MORE

## How to sum a variable by group in R?

You can also try this way, x_new = ...READ MORE

## How to create dummy variables based on a categorical variable of lists in R?

You can use mtabulate in the following way: library(qdapTools) cbind(data, ...READ MORE

## How to cluster center mean of DBSCAN in R?

Just index back into the original data ...READ MORE

## How to find out cluster center mean of DBSCAN in R?

Just index back into the original data ...READ MORE

## Clustering in R

Clustering is a type of unsupervised learning ...READ MORE

+1 vote

## How to find out which version of R is loaded

You can use sessionInfo() to accomplish that. > sessionInfo() R version ...READ MORE

+1 vote