Function definition.
This handles data when non-numeric attributes are also present:
from scipy import stats
def drop_numerical_outliers(df, z_thresh=3):
# Constrains will contain `True` or `False` depending on if it is a value below the threshold.
constrains = df.select_dtypes(include=[np.number]) \
.apply(lambda x: np.abs(stats.zscore(x)) < z_thresh, reduce=False) \
.all(axis=1)
# Drop (inplace) values set to be rejected
df.drop(df.index[~constrains], inplace=True)
Usage.
drop_numerical_outliers(df)
Example.
Think about a dataset df that contains values on houses: alley, land contour, sale price.
Scatter graph visualization
# Plot data before dropping those greater than z-score 3.
# The scatterAreaVsPrice function's definition has been removed for readability's sake.
scatterAreaVsPrice(df)
# Drop the outliers on every attributes
drop_numerical_outliers(train_df)
# Plot the result. All outliers were dropped. Note that the red points are not
# the same outliers from the first plot, but the new computed outliers based on the new data-frame.
scatterAreaVsPrice(train_df)
I hope this help you.