How to add outliers to a Linear Regression dataset

0 votes

I am trying to play around to see how outliers in a dataset might affect a Linear Regression model. The issue I'm having is I don't exactly know how to add outliers to a dataset, I've only found loads of articles online about how to detect and remove them.

This is the code I have so far:

import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression


# Generate regression dataset
X, y = make_regression(
    n_samples=1000,
    n_features=1,
    noise=0.0,
    bias=0.0,
    random_state=42,
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

regressor = LinearRegression()
regressor.fit(X_train, y_train)  # Training the algorithm

y_pred = regressor.predict(X_test)

print("R2 Score:", metrics.r2_score(y_test, y_pred))
print("Mean Absolute Error:", metrics.mean_absolute_error(y_test, y_pred))
print("Mean Squared Error:", metrics.mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error:", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

plt.scatter(X_test, y_test)
plt.plot(X_test, y_pred, color="red", linewidth=1)
plt.show()

And this is the output:

Regression Model of Clean Dataset

My question is how can I add outliers to this clean dataset in order to see the effects outliers will have on the resulting model?

Any help would be appreciated, thanks!

Mar 26 in Machine Learning by Nandini
• 5,480 points
36 views

1 answer to this question.

0 votes

You can directly add values to X and y. Because the slope is large enough, you'll get outliers as a result. In reality, you might use any method you desire.
The code is:

import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression

# Generate regression dataset
X, y = make_regression(
    n_samples=1000,
    n_features=1,
    noise=0.0,
    bias=0.0,
    random_state=42,
)

for j in range(20):
    X=np.append(X, np.random.choice(X.flatten()))
    y=np.append(y, np.random.choice(y.flatten()))

X = X.reshape(-1,1)
y = y.reshape(-1,1)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Regress = LinearRegression()
Regress.fit(X_train, y_train)  # Training the algorithm

y_pred = Regress.predict(X_test)

print("R2 Score:", metrics.r2_score(y_test, y_pred))
print("Mean Absolute Error:", metrics.mean_absolute_error(y_test, y_pred))
print("Mean Squared Error:", metrics.mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error:", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

plt.scatter(X_test, y_test)
plt.plot(X_test, y_pred, color="red", linewidth=1)
plt.show()

enter image description here

This is how the plot looks like with outliers.

answered Mar 30 by Dev
• 6,000 points

Related Questions In Machine Learning

0 votes
1 answer

How to add one new column in an existing dataset?

Hi@akhtar, You can do this task using numpy ...READ MORE

answered Apr 17, 2020 in Machine Learning by MD
• 95,340 points
1,327 views
0 votes
1 answer

How do I create a linear regression model in Weka without training?

Weka is a classification algorithm. This is ...READ MORE

answered Mar 9 in Machine Learning by Nandini
• 5,480 points
92 views
0 votes
1 answer
0 votes
1 answer
+1 vote
2 answers

View onto a numpy array?

 just index it as you normally would. ...READ MORE

answered Oct 18, 2018 in Python by roberto
269 views
0 votes
1 answer
webinar REGISTER FOR FREE WEBINAR X
Send OTP
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP