Detect and exclude outliers in a pandas DataFrame

0 votes

I have a pandas data frame with certain outliers. 
For instance :

column 'Vol' has all values around 12xx and one value is 4000 (outlier).

I want to exclude those rows that have a Vol column like this. Should I put a filter on the data frame such that we select all rows where the values of a certain column are within, say, 3 standard deviations from the mean? What is the best way to do this?

Apr 25 in Python by Kichu
• 16,850 points
67 views

1 answer to this question.

0 votes

Function definition.

This handles data when non-numeric attributes are also present:

from scipy import stats

def drop_numerical_outliers(df, z_thresh=3):
    # Constrains will contain `True` or `False` depending on if it is a value below the threshold.
    constrains = df.select_dtypes(include=[np.number]) \
        .apply(lambda x: np.abs(stats.zscore(x)) < z_thresh, reduce=False) \
        .all(axis=1)
    # Drop (inplace) values set to be rejected
    df.drop(df.index[~constrains], inplace=True)

Usage.

drop_numerical_outliers(df)

Example.

Think about a dataset df that contains values on houses: alley, land contour, sale price. 

Scatter graph visualization

# Plot data before dropping those greater than z-score 3. 
# The scatterAreaVsPrice function's definition has been removed for readability's sake.
scatterAreaVsPrice(df)

Before - Gr Liv Area Versus SalePrice

# Drop the outliers on every attributes
drop_numerical_outliers(train_df)

# Plot the result. All outliers were dropped. Note that the red points are not
# the same outliers from the first plot, but the new computed outliers based on the new data-frame.
scatterAreaVsPrice(train_df)

After - Gr Liv Area Versus SalePrice

I hope this help you.

answered Apr 28 by narikkadan
• 7,860 points

Related Questions In Python

0 votes
1 answer

Python Pandas Dataframe: set_value is deprecated and will be removed in a future release

The set_value function is deprecated and you will ...READ MORE

answered Apr 8, 2019 in Python by Jai
12,957 views
+2 votes
4 answers

How can I replace values with 'none' in a dataframe using pandas

Actually in later versions of pandas this ...READ MORE

answered Aug 13, 2018 in Python by bug_seeker
• 15,530 points
70,664 views
0 votes
1 answer

How to convert a Pandas GroupBy object to DataFrame in Python

g1 here is a DataFrame. It has a hierarchical index, ...READ MORE

answered Nov 12, 2018 in Python by Nymeria
• 3,540 points
31,867 views
–1 vote
1 answer
0 votes
0 answers

Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

 I am getting this error: Truth value of ...READ MORE

May 9 in Python by Kichu
• 16,850 points
64 views
0 votes
1 answer

How to rename columns in pandas (Python)?

You can use the rename function in ...READ MORE

answered Apr 30, 2018 in Data Analytics by DeepCoder786
• 1,720 points

edited Jun 8, 2020 by MD 944 views
0 votes
1 answer

What is the Difference in Size and Count in pandas (python)?

The major difference is "size" includes NaN values, ...READ MORE

answered Apr 30, 2018 in Data Analytics by DeepCoder786
• 1,720 points

edited Jun 8, 2020 by Gitika 1,767 views
0 votes
2 answers

Replacing a row in pandas data.frame

key error. I love python READ MORE

answered Feb 18, 2019 in Data Analytics by anonymous
10,698 views
0 votes
1 answer

How to urlencode a querystring in Python?

Just pass your parameters into urlencode() like: >>> import urllib >>> ...READ MORE

answered Apr 28 in Python by narikkadan
• 7,860 points
12 views
0 votes
1 answer

What is the difference between read() and readline() in python?

The read() will read the whole file at ...READ MORE

answered Apr 28 in Python by narikkadan
• 7,860 points
19 views
webinar REGISTER FOR FREE WEBINAR X
Send OTP
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP