How to duplicate IDs with Hive QL Impala Python

0 votes

I need help deduplicating a list of users (20 million+) across a different set of IDs.

Here's how it looks like :
- We have 3 kinds of userIDs : ID1, ID2 and ID3. - At least 2 of them are always sent together : ID1 with ID2 or ID2 with ID3. ID3 is never sent with ID1.
- Users can have several ID1, ID2 or ID3.
- So sometimes, in my table, I will have several lines with lots of different IDs, but it's possible that all of those can describe one single user.

An example :
enter image description here

Problem is : I know how to do this on SQL Server through the CURSOR / OPEN / FETCH / NEXT commands, but I only have Hive QL, Impala and Python available on my environment.

Oct 1, 2018 in Python by bug_seeker
• 15,520 points
741 views

1 answer to this question.

0 votes

Based on your example, supposing id2 always exist, you can aggregate rows, group by id2:

select max(id1) id1,  id2, max(id3) id3 from
( --your dataset as in example
 select 'A'  as id1, 1 as id2,  null   as id3 union all
 select null as id1, 1 as id2, 'Alpha' as id3 union all
 select null as id1, 2 as id2, 'Beta'  as id3 union all
 select 'A'  as id1, 2 as id2,  null   as id3
 )s
 group by id2;

OK
A       1       Alpha
A       2       Beta
Time taken: 58.739 seconds, Fetched: 2 row(s)

And now I'm trying to implement your logic as you described:

select --pass2
 id1, id2, id3,
 case when lag(id2) over (order by id2, GroupId) = id2 then lag(GroupId) over (order by id2, GroupId) else GroupId end GroupId2
 from
 (
 select        --pass1
 id1, id2, id3,
 case when 
 lag(id1) over(order by id1, NVL(ID1,ID3)) =id1 then lag(NVL(ID1,ID3))  over(order by id1, NVL(ID1,ID3)) else NVL(ID1,ID3) end GroupId
 from
( --your dataset as in example
 select 'A'  as id1, 1 as id2,  null   as id3 union all
 select null as id1, 1 as id2, 'Alpha' as id3 union all
 select null as id1, 2 as id2, 'Beta'  as id3 union all
 select 'A'  as id1, 2 as id2,  null   as id3
 )s
 )s --pass1
;


OK
id1     id2     id3     groupid2
A       1       NULL    A
NULL    1       Alpha   A
A       2       NULL    A
NULL    2       Beta    A
Time taken: 106.944 seconds, Fetched: 4 row(s)
answered Oct 1, 2018 by Priyaj
• 58,090 points

Related Questions In Python

0 votes
1 answer

How to perform web scraping with python?

Hey, there are various libraries used in ...READ MORE

answered Apr 20, 2018 in Python by aayushi
• 750 points
1,577 views
+1 vote
1 answer

How to use GUI that comes with Python to test your code?

Hey @alex0809, When your testing a website ...READ MORE

answered Sep 24, 2018 in Python by Vardhan
• 13,190 points
699 views
+1 vote
10 answers

How to fix this? ValueError: invalid literal for int() with base 10 error in Python

The following are totally acceptable in python: passing ...READ MORE

answered Nov 16, 2018 in Python by Nymeria
• 3,560 points
406,521 views
0 votes
1 answer

How to parse date/time string with timezone abbreviated name in Python?

The parse() function in dateutil can't handle ...READ MORE

answered Nov 27, 2018 in Python by Nymeria
• 3,560 points
1,932 views
0 votes
2 answers
+1 vote
2 answers

how can i count the items in a list?

Syntax :            list. count(value) Code: colors = ['red', 'green', ...READ MORE

answered Jul 7, 2019 in Python by Neha
• 330 points

edited Jul 8, 2019 by Kalgi 4,058 views
0 votes
1 answer
0 votes
1 answer

How to replace values with None in Pandas data frame in Python?

Actually in later versions of pandas this ...READ MORE

answered Aug 30, 2018 in Python by Priyaj
• 58,090 points
11,180 views
+1 vote
1 answer

How to create plots using python matplotlib in IPython notebook?

I think you should try: I used %matplotlib inline in ...READ MORE

answered Aug 8, 2018 in Python by Priyaj
• 58,090 points
1,214 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP