Will random forest algorithm work if rows have a few duplicate values

+10 votes
Will random forest algorithm work if rows have a few duplicate values?
Oct 23, 2019 in Data Analytics by ch
• 3,450 points
2,623 views

1 answer to this question.

+4 votes

I have no idea about RandomForest, but check out Gordon Linoff answer to a similar question.

I am answering this as a general question on decision trees, rather than on the R implementation.

The parameters for decision trees are often based on record counts -- minimum leaf size and minimum split search size come to mind. In addition, purity measures are affected by the size of nodes as the tree is being built. When you have duplicated records, then you are implicitly putting weight on the values in those rows.

This is neither good nor bad. You simply need to understand the data and the model that you want to build. If the duplicated values arise from different runs of an experiment, then they should be fine.

In some cases, duplicates (or equivalently weights) can be quite bad. For instance, if you are oversampling the data to get a balanced sample on the target, then the additional rows would be problematic. A single leaf might end up consisting of a single instance from the original data -- and overfitting would be a problem.

Source: https://stackoverflow.com/questions/34532957/how-do-duplicated-rows-effect-a-decision-tree

answered Oct 29, 2019 by Cherukuri
• 33,030 points

Related Questions In Data Analytics

0 votes
1 answer

Building Random Forest on a data-set comprising of missing(NA) values

You have two options, either impute the ...READ MORE

answered Apr 3, 2018 in Data Analytics by Bharani
• 4,660 points
1,048 views
+1 vote
2 answers
0 votes
1 answer

Discarding duplicate rows from a data.frame - R

You can use distinct() function along with ...READ MORE

answered May 4, 2018 in Data Analytics by Bharani
• 4,660 points
517 views
0 votes
1 answer

How to sample n random rows per group in a dataframe?

You can assign a random ID to ...READ MORE

answered Jul 3, 2018 in Data Analytics by Sahiti
• 6,370 points
4,726 views
0 votes
1 answer

How to display randomforest object?

Refer to this article here, https://stats.stackexchange.com/questi ...READ MORE

answered Oct 29, 2019 in Data Analytics by Cherukuri
• 33,030 points
597 views
0 votes
1 answer

Random Forest Error : Error in y - ymean : non-numeric argument to binary operator

Hey, Convert the Class.variable to factor then it might work. random_forest ...READ MORE

answered Oct 9, 2019 in Data Analytics by Cherukuri
• 33,030 points
12,341 views
+1 vote
1 answer

RandomForest Error

If the classifying variable data type is ...READ MORE

answered Oct 14, 2019 in Data Analytics by Cherukuri
• 33,030 points
498 views
0 votes
1 answer

By using dpylr package sum of multiple columns

Basically here we are making an equation ...READ MORE

answered Apr 5, 2018 in Data Analytics by DeepCoder786
• 1,720 points
2,023 views
0 votes
2 answers

How to remove rows with missing values (NAs) in a data frame?

Hi, The below code returns rows without ...READ MORE

answered Aug 20, 2019 in Data Analytics by anonymous
• 33,030 points
14,441 views
+1 vote
2 answers

How to count the number of elements with the values in a vector?

Use dplyr function group_by(). > n = as.data.frame(num) > ...READ MORE

answered Aug 21, 2019 in Data Analytics by anonymous
• 33,030 points
4,593 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP