different results for Random Forest Regression in R and Python

0 votes

I am using the same data to do Random Forest Regression in R and Python but I am getting very different R2 values. I understand that hyper parameters might be a reason behind this but I don't think it results in almost halving of R2 scores. I am using the following codes and getting the respective results.

In Python -

    X =  data.drop(['response'],axis=1)
    y = data['response'] 
   
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.05, random_state = 42)

    rdf = RandomForestRegressor(n_estimators=500,oob_score=True)
    rdf.fit(X_train, y_train)

    print("Random Forest Model Score (on Train)" , ":" , rdf.score(X_train, y_train)*100 , "," ,
          "Random Forest Model Score (on Test)" ,":" , rdf.score(X_test, y_test)*100)   

    y_predicted = rdf.predict(X_train)
    y_test_predicted = rdf.predict(X_test)

    print("Training RMSE", ":", rmse(y_train, y_predicted),
          "Testing RMSE", ":", rmse(y_test, y_test_predicted))


>Random Forest Model Score (on Train) : 92.2312123 , Random Forest Model Score (on Test) : 78.1812321

>Training RMSE : 5.606443558164292e-06   Testing RMSE : 9.59221499904858e-06

In R -

> rows <- sample(0.95*nrow(data))
> train_random <- data[rows,]
> test_random <-  data[-rows,]

> rf_model <- randomForest(response ~ . ,
                         data = train_random,
                         keep.forest=TRUE,
                         importance=TRUE
                         )

> rf_model

Call:
 randomForest(formula = response ~ ., data = train_random, keep.forest = TRUE, importance = TRUE) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 6

          Mean of squared residuals: 1.437236e-06
                    % Var explained: 42.05
> pred_train <- predict(rf_model,train_random)
> pred_test <- predict(rf_model,test_random)
> R2_Score(pred_train, train_random$response)
[1] 0.9014311
> R2_Score(pred_test, test_random$response)
[1] 0.3616823

I understand that the test train split is not resulting in the same splits but why am I getting such distinctly different R2 values and what is the way to carry out the same Random Forest in R. I have tried using the same hyper parameters I am getting from Python but it is not helping me get the same R2 values in R. Can someone please help me?

Apr 11 in Machine Learning by Nandini
• 5,480 points
73 views

1 answer to this question.

0 votes
Random Forests, as others have mentioned, have a random component, which you probably already knew about.

Random forest, on the other hand, employs bootstrapping, which alters the outcome each time it is performed.I had the same issue with the randomForest function returning various numbers for successive passes. As Zach said, the random forest algorithm generates various subsets of data at random, so the final findings may differ significantly between passes. To get around this, I just ran set.seed(500) before each new pass to reset the seed to 500, and it gave me exactly the same results. I hope it was useful.
answered Apr 12 by Dev
• 6,000 points

Related Questions In Machine Learning

0 votes
0 answers

How to add random and/or fixed effects into cloglog regression in R

Update question on treatment of one variable ...READ MORE

Apr 11 in Machine Learning by Dev
• 6,000 points
38 views
0 votes
1 answer

Difference between classification and regression score in Python scikit learn

Classification Score is used for discrete values ...READ MORE

answered Feb 24 in Machine Learning by Nandini
• 5,480 points
45 views
0 votes
1 answer
0 votes
1 answer
+1 vote
1 answer

How to handle Nominal Data?

Nominal data is basically data which can ...READ MORE

answered Jul 24, 2018 in Data Analytics by Abhi
• 3,720 points
233 views
+2 votes
2 answers

How to handle outliers

There are multiple ways to handle outliers ...READ MORE

answered Jul 24, 2018 in Data Analytics by Abhi
• 3,720 points
260 views
0 votes
2 answers

Why should anyone learn Python instead of R for machine learning?

Machine learning is the latest technology everyone ...READ MORE

answered Apr 13, 2019 in Data Analytics by SA
• 1,050 points
385 views
0 votes
1 answer

How to simulate first passage time probability in python for a random walk?

To begin with, you're now computing fp ...READ MORE

answered Apr 5 in Machine Learning by Dev
• 6,000 points
145 views
webinar REGISTER FOR FREE WEBINAR X
Send OTP
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP