I am trying to play around to see how outliers in a dataset might affect a Linear Regression model. The issue I'm having is I don't exactly know how to add outliers to a dataset, I've only found loads of articles online about how to detect and remove them.

This is the code I have so far:

```import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression

# Generate regression dataset
X, y = make_regression(
n_samples=1000,
n_features=1,
noise=0.0,
bias=0.0,
random_state=42,
)

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)

regressor = LinearRegression()
regressor.fit(X_train, y_train)  # Training the algorithm

y_pred = regressor.predict(X_test)

print("R2 Score:", metrics.r2_score(y_test, y_pred))
print("Mean Absolute Error:", metrics.mean_absolute_error(y_test, y_pred))
print("Mean Squared Error:", metrics.mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error:", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

plt.scatter(X_test, y_test)
plt.plot(X_test, y_pred, color="red", linewidth=1)
plt.show()
```

And this is the output: My question is how can I add outliers to this clean dataset in order to see the effects outliers will have on the resulting model?

Any help would be appreciated, thanks!

Mar 26, 2022 412 views

## 1 answer to this question.

You can directly add values to X and y. Because the slope is large enough, you'll get outliers as a result. In reality, you might use any method you desire.
The code is:

```import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression

# Generate regression dataset
X, y = make_regression(
n_samples=1000,
n_features=1,
noise=0.0,
bias=0.0,
random_state=42,
)

for j in range(20):
X=np.append(X, np.random.choice(X.flatten()))
y=np.append(y, np.random.choice(y.flatten()))

X = X.reshape(-1,1)
y = y.reshape(-1,1)

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)

Regress = LinearRegression()
Regress.fit(X_train, y_train)  # Training the algorithm

y_pred = Regress.predict(X_test)

print("R2 Score:", metrics.r2_score(y_test, y_pred))
print("Mean Absolute Error:", metrics.mean_absolute_error(y_test, y_pred))
print("Mean Squared Error:", metrics.mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error:", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

plt.scatter(X_test, y_test)
plt.plot(X_test, y_pred, color="red", linewidth=1)
plt.show()``` This is how the plot looks like with outliers.

• 6,000 points

## Is there a way to force the coefficient of the independent variable to be a positive coefficient in the linear regression model used in R?

A Few Constraints This is an example of ...READ MORE

## How do I create a linear regression model in Weka without training?

Weka is a classification algorithm. This is ...READ MORE

## How to get a regression summary in scikit-learn like R does?

In sklearn, there is no R type ...READ MORE

## Logistic Regression with continuous data using sklearn in python

Despite the fact that it produces a ...READ MORE

## Efficient online linear regression algorithm in python

To calculate 10k rows, and also to speed ...READ MORE

+1 vote

## View onto a numpy array?

just index it as you normally would. ...READ MORE

## Dimension in python numpy

Use the .shape to print the dimensions ...READ MORE