hclust size limit

0 votes
'm new to R. I'm trying to run hclust() on about 50K items. I have 10 columns to compare and 50K rows of data. When I tried assigning the distance matrix, I get: "Cannot allocate vector of 5GB".

Is there a size limit to this? If so, how do I go about doing a cluster of something this large?
Jun 26, 2018 in Data Analytics by Sahiti
• 6,370 points
1,011 views

1 answer to this question.

0 votes

Classic hierarchical clustering approaches are O(n^3) in runtime and O(n^2) in memory complexity. So yes, they scale incredibly bad to large data sets. Obviously, anything that requires materialization of the distance matrix is in O(n^2) or worse.

Note that there are some specializations of hierarchical clustering such as SLINK and CLINK that run in O(n^2), and depending on the implementation may also only need O(n) memory.

You might want to look into more modern clustering algorithms. Anything that runs in O(n log n) or better should work for you. There are plenty of good reasons to not use hierarchical clustering: usually it is rather sensitive to noise (i.e. it doesn't really know what to do with outliers) and the results are hard to interpret for large data sets (dendrograms are nice, but only for small data sets).

answered Jun 26, 2018 by DataKing99
• 8,250 points

Related Questions In Data Analytics

0 votes
1 answer

How to limit output of a dataframe in R?

For randomly sampling a row/cell where a ...READ MORE

answered Apr 18, 2018 in Data Analytics by kappa3010
• 2,090 points
3,452 views
0 votes
4 answers

How to change font size of text and axes on R plots ?

To change the font size of text ...READ MORE

answered Dec 16, 2020 in Data Analytics by Gitika
• 65,770 points
123,020 views
0 votes
1 answer

What is the Difference in Size and Count in pandas (python)?

The major difference is "size" includes NaN values, ...READ MORE

answered Apr 30, 2018 in Data Analytics by DeepCoder786
• 1,720 points

edited Jun 8, 2020 by Gitika 2,924 views
0 votes
1 answer

How can I control the size of points in an R scatterplot?

plot(variable, type='o' , pch=5, cex=.3) The pch argument ...READ MORE

answered May 3, 2018 in Data Analytics by shams
• 3,670 points
1,290 views
0 votes
1 answer

How can I change font size and direction of axes text in ggplot2 ?

You can try theme(): Library(ggplot2) a <- data.frame(x=gl(10, 1, ...READ MORE

answered May 30, 2018 in Data Analytics by zombie
• 3,790 points
4,362 views
+1 vote
1 answer

Error saying "vector size cannot be NA" when using R with data mining

You can use the removesparseterm function.  Removes sparse ...READ MORE

answered Nov 15, 2018 in Data Analytics by Maverick
• 10,840 points
4,878 views
0 votes
1 answer

hclust size limit?

Classic hierarchical clustering approaches are O(n^3) in ...READ MORE

answered Jul 10, 2019 in Python by SDeb
• 13,300 points
771 views
0 votes
1 answer

Big Data transformations with R

Dear Koushik, Hope you are doing great. You can ...READ MORE

answered Dec 18, 2017 in Data Analytics by Sudhir
• 1,570 points
1,050 views
0 votes
2 answers

Transforming a key/value string into distinct rows in R

We would start off by loading the ...READ MORE

answered Mar 26, 2018 in Data Analytics by Bharani
• 4,660 points
1,227 views
0 votes
1 answer

Finding frequency of observations in R

You can use the "dplyr" package to ...READ MORE

answered Mar 26, 2018 in Data Analytics by Bharani
• 4,660 points
5,912 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP