How to cluster a very large dataset in R?

0 votes

I have a very large dataset consisting of 70K numeric values representing various distances ranging from 0-50. I want to cluster these numbers; however, when I try this approach, I get a 70K * 70K distance matrix representing the distance between every 2 numbers in the dataset, which won't fit in memory. 

So, Is there any way to solve such problem!

Jun 19, 2018 in Data Analytics by CodingByHeart77
• 3,680 points
45 views

1 answer to this question.

Your answer

Your name to display (optional):
Privacy: Your email address will only be used for sending these notifications.
0 votes

You can initially use kmeans, to calculate the important number of centers and then perform hierarchical clustering approach on the coordinates of the centers.

So, in this, way distance matrix would be small in size. 

You can try out the code below:


# Data
x <- rbind(matrix(rnorm(70000, sd = 0.3), ncol = 2),
           matrix(rnorm(70000, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")

# CAH without kmeans : doesn't work necessarily
library(FactoMineR)
cah.test <- HCPC(x, graph=FALSE, nb.clust=-1)

# CAH with kmeans : works more quickly
cl <- kmeans(x, 1000, iter.max=20)
cah <- HCPC(cl$centers, graph=FALSE, nb.clust=-1)
plot.HCPC(cah, choice="tree")
answered Jun 19, 2018 by darklord
• 6,140 points

Related Questions In Data Analytics

0 votes
1 answer

How to convert a text mining termDocumentMatrix into excel or csv in R?

By assuming that all the values are ...READ MORE

answered Apr 5, 2018 in Data Analytics by DeepCoder786
• 1,700 points
61 views
0 votes
1 answer

How to convert a list to data frame in R?

Let's assume your list of lists is ...READ MORE

answered Apr 12, 2018 in Data Analytics by nirvana
• 3,040 points

edited Apr 12, 2018 by nirvana 1,301 views
0 votes
1 answer

How to sum a variable by group in R?

Easily by using Aggregate Func(): aggregate(x$points, by=list(Players=x$Players), FUN=sum) or ...READ MORE

answered Apr 13, 2018 in Data Analytics by CodingByHeart77
• 3,680 points

edited Apr 13, 2018 by CodingByHeart77 3,309 views
0 votes
1 answer

How to create dummy variables based on a categorical variable of lists in R?

You can use mtabulate in the following way: library(qdapTools) cbind(data[1], ...READ MORE

answered Apr 13, 2018 in Data Analytics by CodingByHeart77
• 3,680 points
310 views
0 votes
1 answer

How to cluster center mean of DBSCAN in R?

Just index back into the original data ...READ MORE

answered Jun 25, 2018 in Data Analytics by DataKing99
• 8,100 points
31 views
0 votes
1 answer

How to find out cluster center mean of DBSCAN in R?

Just index back into the original data ...READ MORE

answered Jun 27, 2018 in Data Analytics by darklord
• 6,140 points
62 views
0 votes
2 answers

Clustering in R

Clustering is a type of unsupervised learning ...READ MORE

answered Jul 9, 2018 in Data Analytics by zombie
• 3,690 points
36 views
0 votes
1 answer

How to find out which version of R is loaded

You can use sessionInfo() to accomplish that. > sessionInfo() R version ...READ MORE

answered Nov 8, 2018 in Data Analytics by Maverick
• 10,000 points
11 views
0 votes
1 answer

How to sort a data frame by columns in R?

You can just use the order function ...READ MORE

answered Apr 10, 2018 in Data Analytics by darklord
• 6,140 points
52 views
0 votes
1 answer

How to convert a list of dataframes in to a single dataframe using R?

You can use the plyr function: data <- ...READ MORE

answered Apr 13, 2018 in Data Analytics by darklord
• 6,140 points
61 views

© 2018 Brain4ce Education Solutions Pvt. Ltd. All rights Reserved.
"PMP®","PMI®", "PMI-ACP®" and "PMBOK®" are registered marks of the Project Management Institute, Inc. MongoDB®, Mongo and the leaf logo are the registered trademarks of MongoDB, Inc.