Detect and exclude outliers in a pandas DataFrame

0 votes

I have a pandas data frame with certain outliers. 
For instance :

column 'Vol' has all values around 12xx and one value is 4000 (outlier).

I want to exclude those rows that have a Vol column like this. Should I put a filter on the data frame such that we select all rows where the values of a certain column are within, say, 3 standard deviations from the mean? What is the best way to do this?

Apr 25, 2022 in Python by Kichu
• 19,040 points
4,646 views

1 answer to this question.

0 votes

Function definition.

This handles data when non-numeric attributes are also present:

from scipy import stats

def drop_numerical_outliers(df, z_thresh=3):
    # Constrains will contain `True` or `False` depending on if it is a value below the threshold.
    constrains = df.select_dtypes(include=[np.number]) \
        .apply(lambda x: np.abs(stats.zscore(x)) < z_thresh, reduce=False) \
        .all(axis=1)
    # Drop (inplace) values set to be rejected
    df.drop(df.index[~constrains], inplace=True)

Usage.

drop_numerical_outliers(df)

Example.

Think about a dataset df that contains values on houses: alley, land contour, sale price. 

Scatter graph visualization

# Plot data before dropping those greater than z-score 3. 
# The scatterAreaVsPrice function's definition has been removed for readability's sake.
scatterAreaVsPrice(df)

Before - Gr Liv Area Versus SalePrice

# Drop the outliers on every attributes
drop_numerical_outliers(train_df)

# Plot the result. All outliers were dropped. Note that the red points are not
# the same outliers from the first plot, but the new computed outliers based on the new data-frame.
scatterAreaVsPrice(train_df)

After - Gr Liv Area Versus SalePrice

I hope this help you.

answered Apr 28, 2022 by narikkadan
• 86,360 points

Related Questions In Python

0 votes
1 answer

Python Pandas Dataframe: set_value is deprecated and will be removed in a future release

The set_value function is deprecated and you will ...READ MORE

answered Apr 8, 2019 in Python by Jai
16,909 views
+2 votes
4 answers

How can I replace values with 'none' in a dataframe using pandas

Actually in later versions of pandas this ...READ MORE

answered Aug 13, 2018 in Python by bug_seeker
• 15,520 points
125,858 views
0 votes
1 answer

How to convert a Pandas GroupBy object to DataFrame in Python

g1 here is a DataFrame. It has a hierarchical index, ...READ MORE

answered Nov 12, 2018 in Python by Nymeria
• 3,560 points
35,160 views
–1 vote
1 answer
0 votes
0 answers

Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

 I am getting this error: Truth value of ...READ MORE

May 9, 2022 in Python by Kichu
• 19,040 points
2,215 views
0 votes
1 answer

How to rename columns in pandas (Python)?

You can use the rename function in ...READ MORE

answered Apr 30, 2018 in Data Analytics by DeepCoder786
• 1,720 points

edited Jun 8, 2020 by MD 2,904 views
0 votes
1 answer

What is the Difference in Size and Count in pandas (python)?

The major difference is "size" includes NaN values, ...READ MORE

answered Apr 30, 2018 in Data Analytics by DeepCoder786
• 1,720 points

edited Jun 8, 2020 by Gitika 3,906 views
0 votes
2 answers

Replacing a row in pandas data.frame

key error. I love python READ MORE

answered Feb 18, 2019 in Data Analytics by anonymous
14,898 views
0 votes
0 answers
0 votes
0 answers

Why do I get "List index out of range" when trying to add consecutive numbers in a list using "for i in list"?

Given the following list a = [0, 1, ...READ MORE

Apr 24, 2022 in Python by Kichu
• 19,040 points
1,077 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP