different results for Random Forest Regression in R and Python

0 votes

I am using the same data to do Random Forest Regression in R and Python but I am getting very different R2 values. I understand that hyper parameters might be a reason behind this but I don't think it results in almost halving of R2 scores. I am using the following codes and getting the respective results.

In Python -

    X =  data.drop(['response'],axis=1)
    y = data['response'] 
   
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.05, random_state = 42)

    rdf = RandomForestRegressor(n_estimators=500,oob_score=True)
    rdf.fit(X_train, y_train)

    print("Random Forest Model Score (on Train)" , ":" , rdf.score(X_train, y_train)*100 , "," ,
          "Random Forest Model Score (on Test)" ,":" , rdf.score(X_test, y_test)*100)   

    y_predicted = rdf.predict(X_train)
    y_test_predicted = rdf.predict(X_test)

    print("Training RMSE", ":", rmse(y_train, y_predicted),
          "Testing RMSE", ":", rmse(y_test, y_test_predicted))


>Random Forest Model Score (on Train) : 92.2312123 , Random Forest Model Score (on Test) : 78.1812321

>Training RMSE : 5.606443558164292e-06   Testing RMSE : 9.59221499904858e-06

In R -

> rows <- sample(0.95*nrow(data))
> train_random <- data[rows,]
> test_random <-  data[-rows,]

> rf_model <- randomForest(response ~ . ,
                         data = train_random,
                         keep.forest=TRUE,
                         importance=TRUE
                         )

> rf_model

Call:
 randomForest(formula = response ~ ., data = train_random, keep.forest = TRUE, importance = TRUE) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 6

          Mean of squared residuals: 1.437236e-06
                    % Var explained: 42.05
> pred_train <- predict(rf_model,train_random)
> pred_test <- predict(rf_model,test_random)
> R2_Score(pred_train, train_random$response)
[1] 0.9014311
> R2_Score(pred_test, test_random$response)
[1] 0.3616823

I understand that the test train split is not resulting in the same splits but why am I getting such distinctly different R2 values and what is the way to carry out the same Random Forest in R. I have tried using the same hyper parameters I am getting from Python but it is not helping me get the same R2 values in R. Can someone please help me?

Apr 11, 2022 in Machine Learning by Nandini
• 5,480 points
1,552 views

1 answer to this question.

0 votes

Random Forests, as others have mentioned, have a random component, which you probably already knew about.

Random forest, on the other hand, employs bootstrapping, which alters the outcome each time it is performed.I had the same issue with the randomForest function returning various numbers for successive passes. As Zach said, the random forest algorithm generates various subsets of data at random, so the final findings may differ significantly between passes. To get around this, I just ran set.seed(500) before each new pass to reset the seed to 500, and it gave me exactly the same results. I hope it was useful.

Elevate Your Expertise with Our Machine Learning Certification Program!

answered Apr 12, 2022 by Dev
• 6,000 points

Related Questions In Machine Learning

0 votes
1 answer

How to add random and/or fixed effects into cloglog regression in R

The standard glm function can be used ...READ MORE

answered Apr 13, 2022 in Machine Learning by anonymous
697 views
0 votes
1 answer

Difference between classification and regression score in Python scikit learn

Classification Score is used for discrete values ...READ MORE

answered Feb 24, 2022 in Machine Learning by Nandini
• 5,480 points
609 views
0 votes
1 answer

R: Calculate and interpret odds ratio in logistic regression

A logit, or the log of the ...READ MORE

answered Apr 4, 2022 in Machine Learning by Nandini
• 5,480 points
17,090 views
0 votes
1 answer
+1 vote
1 answer

How to handle Nominal Data?

Nominal data is basically data which can ...READ MORE

answered Jul 24, 2018 in Data Analytics by Abhi
• 3,720 points
655 views
+2 votes
2 answers

How to handle outliers

There are multiple ways to handle outliers ...READ MORE

answered Jul 24, 2018 in Data Analytics by Abhi
• 3,720 points
963 views
0 votes
2 answers

Why should anyone learn Python instead of R for machine learning?

Machine learning is the latest technology everyone ...READ MORE

answered Apr 13, 2019 in Data Analytics by SA
• 1,090 points
1,058 views
0 votes
1 answer

How to simulate first passage time probability in python for a random walk?

To begin with, you're now computing fp ...READ MORE

answered Apr 5, 2022 in Machine Learning by Dev
• 6,000 points
1,675 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP