0 votes

I am trying to play around to see how outliers in a dataset might affect a Linear Regression model. The issue I'm having is I don't exactly know how to add outliers to a dataset, I've only found loads of articles online about how to detect and remove them.

This is the code I have so far:

```import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression

# Generate regression dataset
X, y = make_regression(
n_samples=1000,
n_features=1,
noise=0.0,
bias=0.0,
random_state=42,
)

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)

regressor = LinearRegression()
regressor.fit(X_train, y_train)  # Training the algorithm

y_pred = regressor.predict(X_test)

print("R2 Score:", metrics.r2_score(y_test, y_pred))
print("Mean Absolute Error:", metrics.mean_absolute_error(y_test, y_pred))
print("Mean Squared Error:", metrics.mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error:", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

plt.scatter(X_test, y_test)
plt.plot(X_test, y_pred, color="red", linewidth=1)
plt.show()
```

And this is the output:

My question is how can I add outliers to this clean dataset in order to see the effects outliers will have on the resulting model?

Any help would be appreciated, thanks!

Mar 26, 2022 1,020 views

## 1 answer to this question.

0 votes

You can directly add values to X and y. Because the slope is large enough, you'll get outliers as a result. In reality, you might use any method you desire.
The code is:

```import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression

# Generate regression dataset
X, y = make_regression(
n_samples=1000,
n_features=1,
noise=0.0,
bias=0.0,
random_state=42,
)

for j in range(20):
X=np.append(X, np.random.choice(X.flatten()))
y=np.append(y, np.random.choice(y.flatten()))

X = X.reshape(-1,1)
y = y.reshape(-1,1)

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)

Regress = LinearRegression()
Regress.fit(X_train, y_train)  # Training the algorithm

y_pred = Regress.predict(X_test)

print("R2 Score:", metrics.r2_score(y_test, y_pred))
print("Mean Absolute Error:", metrics.mean_absolute_error(y_test, y_pred))
print("Mean Squared Error:", metrics.mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error:", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

plt.scatter(X_test, y_test)
plt.plot(X_test, y_pred, color="red", linewidth=1)
plt.show()```

This is how the plot looks like with outliers.

answered Mar 30, 2022 by
• 6,000 points

0 votes
1 answer

## How to add one new column in an existing dataset?

Hi@akhtar, You can do this task using numpy ...READ MORE

0 votes
1 answer

## Is there a way to force the coefficient of the independent variable to be a positive coefficient in the linear regression model used in R?

A Few Constraints This is an example of ...READ MORE

0 votes
1 answer

## How do I create a linear regression model in Weka without training?

Weka is a classification algorithm. This is ...READ MORE

0 votes
1 answer

## How to get a regression summary in scikit-learn like R does?

In sklearn, there is no R type ...READ MORE

0 votes
1 answer

## Logistic Regression with continuous data using sklearn in python

Despite the fact that it produces a ...READ MORE

0 votes
1 answer

## Efficient online linear regression algorithm in python

To calculate 10k rows, and also to speed ...READ MORE

+1 vote
2 answers

## View onto a numpy array?

just index it as you normally would. ...READ MORE

0 votes
1 answer

## Dimension in python numpy

Use the .shape to print the dimensions ...READ MORE

0 votes
1 answer

## Alternatives to linear regression for dataset with many points with small value and some extreme values

The above situation is the case where ...READ MORE

0 votes
1 answer

## How to resolve heteroscedasticity in Multiple Linear Regression in R?

Try to use a different form of ...READ MORE