Will random forest algorithm work if rows have a few duplicate values

Question

Will random forest algorithm work if rows have a few duplicate values?

Cherukuri · Answer 1 · Oct 29, 2019

I have no idea about RandomForest, but check out Gordon Linoff answer to a similar question.

I am answering this as a general question on decision trees, rather than on the R implementation.

The parameters for decision trees are often based on record counts -- minimum leaf size and minimum split search size come to mind. In addition, purity measures are affected by the size of nodes as the tree is being built. When you have duplicated records, then you are implicitly putting weight on the values in those rows.

This is neither good nor bad. You simply need to understand the data and the model that you want to build. If the duplicated values arise from different runs of an experiment, then they should be fine.

In some cases, duplicates (or equivalently weights) can be quite bad. For instance, if you are oversampling the data to get a balanced sample on the target, then the additional rows would be problematic. A single leaf might end up consisting of a single instance from the original data -- and overfitting would be a problem.

Source: https://stackoverflow.com/questions/34532957/how-do-duplicated-rows-effect-a-decision-tree