How to filter out na in R

Question

Below is the code:

library(tidyverse)
df <- tibble(
    ~col1, ~col2, ~col3,
    1, 2, 3, 
    1, NA, 3, 
    NA, 2, 3
)

I can remove all NA observations with drop_na():

df %>% drop_na()

Or remove all NA observations in a single column (col1 for example):

df %>% drop_na(col1)

Why can't I just use a regular != filter pipe?

df %>% filter(col1 != NA)

Why do we have to use a special function from tidyr to remove NAs?

kappa3010 · Answer 1 · Apr 3, 2018

This has nothing to do specifically with dplyr::filter. But, any comparison with NA, including NA==NA will return NA.

R does not know about what you are doing in your analysis.

So, basically it does not allow comparison operators to think NA as a value.

Interested in a career in data analysis? Our Data Analyst Certification Course will equip you with the tools and techniques you need to succeed.

answered Apr 3, 2018 by kappa3010
• 2,090 points

Edureka · Answer 2 · Mar 26, 2019

Try this:

df %>% filter(!is.na(col1))

answered Mar 26, 2019 by anonymous

Thanks, that worked :)

commented Mar 26, 2019 by Shravan

This was simple, direct and perfect...thank you!

commented Feb 27, 2020 by anonymous

Thanks for your contribution!

In case you found the answer helpful do upvote the answer and increase your points!

Cheers!!!

commented Feb 27, 2020 by Edureka
• 2,960 points

What if we have 2 columns with possible na rows?

Oct 23, 2020 in Data Analytics by anonymous
• 120 points
edited Oct 23, 2020 by MD • 1,308 views

Thanks a lot!

commented Jul 24, 2021 by anonymous

edited Mar 6

Kalgi · Answer 3 · Apr 12, 2019

Null values have no notion of equality in R. Therefore, NA == NA just returns NA. In fact, NA compared to any object in R will return NA. The filter statement in dplyr requires a boolean argument, so when it is iterating through col1, checking for inequality with filter(col1 != NA), the 'col1 != NA' command is continually throwing NA values for each row of col1. This is not a boolean, so the filter command does not evaluate properly.

answered Apr 12, 2019 by Zane

Thanks Zane! That was very well explained.

commented Apr 12, 2019 by Kalgi
• 52,340 points

MD · Answer 4 · Dec 10, 2020

Hi,

The dplyr has ’filter()’ function to do such filtering, but there is even more. With dplyr you can do the kind of filtering, which could be hard to perform or complicated to construct with tools like SQL and traditional BI tools, in such a simple and more intuitive way. For example, we have one flight dataset and removing NA values with the filter keyword.

flight %>%
select(FL_DATE, CARRIER, ORIGIN, ORIGIN_CITY_NAME, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME, ARR_DELAY, ARR_TIME) %>%
filter(!is.na(ARR_DELAY))