hclust size limit

0 votes
'm new to R. I'm trying to run hclust() on about 50K items. I have 10 columns to compare and 50K rows of data. When I tried assigning the distance matrix, I get: "Cannot allocate vector of 5GB".

Is there a size limit to this? If so, how do I go about doing a cluster of something this large?
Jun 25, 2018 in Data Analytics by darklord
• 6,140 points
73 views

1 answer to this question.

0 votes

Classic hierarchical clustering approaches are O(n^3) in runtime and O(n^2) in memory complexity. So yes, they scale incredibly bad to large data sets. Obviously, anything that requires materialization of the distance matrix is in O(n^2) or worse.

Note that there are some specializations of hierarchical clustering such as SLINK and CLINK that run in O(n^2), and depending on the implementation may also only need O(n) memory.

You might want to look into more modern clustering algorithms. Anything that runs in O(n log n) or better should work for you. There are plenty of good reasons to not use hierarchical clustering: usually it is rather sensitive to noise (i.e. it doesn't really know what to do with outliers) and the results are hard to interpret for large data sets (dendrograms are nice, but only for small data sets).

answered Jun 25, 2018 by DataKing99
• 8,100 points

Related Questions In Data Analytics

0 votes
1 answer

How to limit output of a dataframe in R?

For randomly sampling a row/cell where a ...READ MORE

answered Apr 18, 2018 in Data Analytics by kappa3010
• 2,010 points
32 views
0 votes
1 answer

How to change font size of text and axes on R plots ?

To change the font size of text, ...READ MORE

answered Apr 20, 2018 in Data Analytics by zombie
• 3,690 points
38 views
0 votes
1 answer

What is the Difference in Size and Count in pandas (python)?

The major difference is size includes NaN ...READ MORE

answered Apr 30, 2018 in Data Analytics by DeepCoder786
• 1,700 points
652 views
0 votes
1 answer

How can I control the size of points in an R scatterplot?

plot(variable, type='o' , pch=5, cex=.3) The pch argument ...READ MORE

answered May 3, 2018 in Data Analytics by shams
• 3,580 points
24 views
0 votes
1 answer

How can I change font size and direction of axes text in ggplot2 ?

You can try theme(): Library(ggplot2) a <- data.frame(x=gl(10, 1, ...READ MORE

answered May 30, 2018 in Data Analytics by zombie
• 3,690 points
183 views
0 votes
1 answer

Error saying "vector size cannot be NA" when using R with data mining

You can use the removesparseterm function.  Removes sparse ...READ MORE

answered Nov 15, 2018 in Data Analytics by Maverick
• 10,040 points
293 views
0 votes
1 answer

hclust size limit?

Classic hierarchical clustering approaches are O(n^3) in ...READ MORE

answered Jul 10 in Python by SDeb
• 12,440 points
14 views
0 votes
1 answer

Big Data transformations with R

Dear Koushik, Hope you are doing great. You can ...READ MORE

answered Dec 17, 2017 in Data Analytics by Sudhir
• 1,610 points
35 views
0 votes
2 answers

Transforming a key/value string into distinct rows in R

We would start off by loading the ...READ MORE

answered Mar 26, 2018 in Data Analytics by Bharani
• 4,550 points
47 views
0 votes
1 answer

Finding frequency of observations in R

You can use the "dplyr" package to ...READ MORE

answered Mar 26, 2018 in Data Analytics by Bharani
• 4,550 points
78 views