hclust size limit

0 votes
'm new to R. I'm trying to run hclust() on about 50K items. I have 10 columns to compare and 50K rows of data. When I tried assigning the distance matrix, I get: "Cannot allocate vector of 5GB".

Is there a size limit to this? If so, how do I go about doing a cluster of something this large?
Jun 25, 2018 in Data Analytics by Sahiti
• 6,380 points
294 views

1 answer to this question.

0 votes

Classic hierarchical clustering approaches are O(n^3) in runtime and O(n^2) in memory complexity. So yes, they scale incredibly bad to large data sets. Obviously, anything that requires materialization of the distance matrix is in O(n^2) or worse.

Note that there are some specializations of hierarchical clustering such as SLINK and CLINK that run in O(n^2), and depending on the implementation may also only need O(n) memory.

You might want to look into more modern clustering algorithms. Anything that runs in O(n log n) or better should work for you. There are plenty of good reasons to not use hierarchical clustering: usually it is rather sensitive to noise (i.e. it doesn't really know what to do with outliers) and the results are hard to interpret for large data sets (dendrograms are nice, but only for small data sets).

answered Jun 25, 2018 by DataKing99
• 8,240 points

Related Questions In Data Analytics

0 votes
1 answer

How to limit output of a dataframe in R?

For randomly sampling a row/cell where a ...READ MORE

answered Apr 18, 2018 in Data Analytics by kappa3010
• 2,090 points
623 views
0 votes
4 answers

How to change font size of text and axes on R plots ?

To change the font size of text ...READ MORE

answered Dec 16, 2020 in Data Analytics by Gitika
• 65,870 points
14,797 views
0 votes
1 answer

What is the Difference in Size and Count in pandas (python)?

The major difference is "size" includes NaN values, ...READ MORE

answered Apr 30, 2018 in Data Analytics by DeepCoder786
• 1,720 points

edited Jun 8, 2020 by Gitika 1,470 views
0 votes
1 answer

How can I control the size of points in an R scatterplot?

plot(variable, type='o' , pch=5, cex=.3) The pch argument ...READ MORE

answered May 3, 2018 in Data Analytics by shams
• 3,660 points
443 views
0 votes
1 answer

How can I change font size and direction of axes text in ggplot2 ?

You can try theme(): Library(ggplot2) a <- data.frame(x=gl(10, 1, ...READ MORE

answered May 30, 2018 in Data Analytics by zombie
• 3,790 points
2,960 views
+1 vote
1 answer

Error saying "vector size cannot be NA" when using R with data mining

You can use the removesparseterm function.  Removes sparse ...READ MORE

answered Nov 15, 2018 in Data Analytics by Maverick
• 10,840 points
2,086 views
0 votes
1 answer

hclust size limit?

Classic hierarchical clustering approaches are O(n^3) in ...READ MORE

answered Jul 10, 2019 in Python by SDeb
• 13,270 points
108 views
0 votes
1 answer

Big Data transformations with R

Dear Koushik, Hope you are doing great. You can ...READ MORE

answered Dec 17, 2017 in Data Analytics by Sudhir
• 1,610 points
281 views
0 votes
2 answers

Transforming a key/value string into distinct rows in R

We would start off by loading the ...READ MORE

answered Mar 26, 2018 in Data Analytics by Bharani
• 4,620 points
342 views
0 votes
1 answer

Finding frequency of observations in R

You can use the "dplyr" package to ...READ MORE

answered Mar 26, 2018 in Data Analytics by Bharani
• 4,620 points
3,399 views