Will random forest algorithm work if rows have a few duplicate values?

0 votes
Will random forest algorithm work if rows have a few duplicate values?
Oct 23 in Data Analytics by ch
• 3,290 points
22 views

1 answer to this question.

0 votes

I have no idea about RandomForest, but check out Gordon Linoff answer to a similar question.

I am answering this as a general question on decision trees, rather than on the R implementation.

The parameters for decision trees are often based on record counts -- minimum leaf size and minimum split search size come to mind. In addition, purity measures are affected by the size of nodes as the tree is being built. When you have duplicated records, then you are implicitly putting weight on the values in those rows.

This is neither good nor bad. You simply need to understand the data and the model that you want to build. If the duplicated values arise from different runs of an experiment, then they should be fine.

In some cases, duplicates (or equivalently weights) can be quite bad. For instance, if you are oversampling the data to get a balanced sample on the target, then the additional rows would be problematic. A single leaf might end up consisting of a single instance from the original data -- and overfitting would be a problem.

Source: https://stackoverflow.com/questions/34532957/how-do-duplicated-rows-effect-a-decision-tree

answered Oct 28 by Cherukuri
• 31,840 points

Related Questions In Data Analytics

0 votes
1 answer

Building Random Forest on a data-set comprising of missing(NA) values

You have two options, either impute the ...READ MORE

answered Apr 2, 2018 in Data Analytics by Bharani
• 4,550 points
168 views
0 votes
2 answers
0 votes
1 answer

Discarding duplicate rows from a data.frame - R

You can use distinct() function along with ...READ MORE

answered May 4, 2018 in Data Analytics by Bharani
• 4,550 points
47 views
0 votes
1 answer

How to sample n random rows per group in a dataframe?

You can assign a random ID to ...READ MORE

answered Jul 2, 2018 in Data Analytics by darklord
• 6,170 points
341 views
0 votes
1 answer

How to display randomforest object?

Refer to this article here, https://stats.stackexchange.com/questio ...READ MORE

answered Oct 28 in Data Analytics by Cherukuri
• 31,840 points
14 views
0 votes
1 answer

Random Forest Error : Error in y - ymean : non-numeric argument to binary operator

Hey, Convert the Class.variable to factor then it might work. random_forest ...READ MORE

answered Oct 9 in Data Analytics by Cherukuri
• 31,840 points
107 views
0 votes
1 answer

RandomForest Error

If the classifying variable data type is ...READ MORE

answered Oct 13 in Data Analytics by Cherukuri
• 31,840 points
15 views
0 votes
1 answer

By using dpylr package sum of multiple columns

Basically here we are making an equation ...READ MORE

answered Apr 5, 2018 in Data Analytics by DeepCoder786
• 1,720 points
115 views
0 votes
2 answers

How to remove rows with missing values (NAs) in a data frame?

Hi, The below code returns rows without ...READ MORE

answered Aug 20 in Data Analytics by anonymous
• 31,840 points
5,556 views
0 votes
2 answers

How to count the number of elements with the values in a vector?

Use dplyr function group_by(). > n = as.data.frame(num) > ...READ MORE

answered Aug 21 in Data Analytics by anonymous
• 31,840 points
130 views