How to join columns that are non-NA

0 votes

Consider a dataset with NAs in few rows and for some columns:

data <- data.table(ID=c(1, 2, 1:3), X=c(NA, NA, 1, NA, 3), Y=c(4, 5, NA, 5, 6), Z=c(7, 8, NA, NA, 9))
data
#    ID  X  Y  Z
# 1:  1 NA  4  7
# 2:  2 NA  5  8
# 3:  1  1 NA NA
# 4:  2 NA  5 NA
# 5:  3  3  6  9

Reference table is as follows:

ref <- data.table(ID=c(1, 1:3), X=c(1, 1:3), Y=c(1, 4:6), Z=c(1, 7, NA, 9), VALUE1=c(111, 101:103), VALUE2=c(112, 104:106))
ref
#    ID X Y  Z VALUE1 VALUE2
# 1:  1 1 1  1 111  112
# 2:  1 1 4  7 101  104
# 3:  2 2 5 NA 102  105
# 4:  3 3 6  9 103  106

I want to perform left join data on "data" with "reference table" using only non-NA columns for each row:

My desired output is as follows:

   ID  X  Y  Z VALUE1 VALUE2
1:  1 NA  4  7 101  104

2:  2 NA  5  8  NA   NA

3:  1  1 NA NA 111  112
4:  1  1 NA NA 101  104

5:  2 NA  5 NA 102  105

6:  3  3  6  9 103  106

How do I achieve this?

Apr 18, 2018 in Data Analytics by DataKing99
• 8,240 points
579 views

1 answer to this question.

0 votes

You can match on something like X=X OR is.na(X).

However for complicated merge situations, you can use sqldf  

Below is my code:

library(sqldf)
sqldf("SELECT l.*, r.VALUE1, r.VALUE2
       FROM       data as l
       LEFT JOIN  ref as r
       ON         l.ID = r.ID AND (l.X = r.X OR l.X IS NULL)
                  AND (l.Y = r.Y OR l.Y IS NULL)
                  AND (l.Z = r.Z OR l.Z IS NULL)
                  AND (l.X IS NOT NULL OR l.Y IS NOT NULL OR l.Z IS NOT NULL)")

#  ID  X  Y  Z VALUE1 VALUE2
#1  1 NA  4  7 101  104
#2  2 NA  5  8  NA   NA
#3  1  1 NA NA 111  112
#4  1  1 NA NA 101  104
#5  2 NA  5 NA 102  105
#6  3  3  6  9 103  106

NOTE: The last condition insures that if all of your X, Y, Z are NA then it won't match any rows.

answered Apr 18, 2018 by Sahiti
• 6,370 points

Related Questions In Data Analytics

0 votes
1 answer

How to join two tables (tibbles) by *list* columns in R

You can use the hash from digest ...READ MORE

answered Apr 6, 2018 in Data Analytics by kappa3010
• 2,090 points
1,414 views
0 votes
5 answers

How to remove NA values with dplyr::filter()

Try this: df %>% filter(!is.na(col1)) READ MORE

answered Mar 26, 2019 in Data Analytics by anonymous
321,054 views
0 votes
1 answer
0 votes
1 answer
0 votes
1 answer

.SD in data.table in R

.SD stands for "Subset of Data.table". The ...READ MORE

answered Apr 13, 2018 in Data Analytics by nirvana
• 3,130 points
6,659 views
0 votes
1 answer

Big Data transformations with R

Dear Koushik, Hope you are doing great. You can ...READ MORE

answered Dec 18, 2017 in Data Analytics by Sudhir
• 1,610 points
762 views
0 votes
2 answers

Transforming a key/value string into distinct rows in R

We would start off by loading the ...READ MORE

answered Mar 26, 2018 in Data Analytics by Bharani
• 4,660 points
834 views
0 votes
1 answer

Finding frequency of observations in R

You can use the "dplyr" package to ...READ MORE

answered Mar 26, 2018 in Data Analytics by Bharani
• 4,660 points
5,542 views
0 votes
1 answer

How to subset data so that it contains only columns whose names match a condition

You can use grepl on the names ...READ MORE

answered Apr 27, 2018 in Data Analytics by Sahiti
• 6,370 points
6,041 views
0 votes
1 answer

How to replace NA with 0 using starts_with()

Well I could suggest various options such ...READ MORE

answered Apr 3, 2018 in Data Analytics by Sahiti
• 6,370 points
1,315 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP