We have a table regarding identification where data is as such:
Customer Number | Identification |
---|---|
101 | passport-us |
101 | Licence |
102 | passport-uk |
102 | Licence |
102 | birth-cert |
103 | Licence-learner |
Thus one customer may have submitted multiple identification others in varying number/combination.
Result we are after is :
Customer Number | Passport | Licence | Birth Certificate |
---|---|---|---|
101 | passport-us | licence | |
102 | passport-uk | licence | birth-cert |
103 | licence-learner |
So basically taking the exact text and placing in the subsequent column.
What we know is:
- there are only three identification type we are interested in (if there are others we dont catch)
- they will contain within it the keyword (passport,birth, etc)
- there is no given order to how customer number is coming in the table, there is a separate primary key else where in the bigger table.
- we are aiming the query to run in hive
I was thinking along the lines of self join three times with REGEXP_EXTRACT perhaps.
Thoughts?
Advertisement
Answer
You can do it using pure sql like this:
select CustomerNumber, max(case when Identification like 'passport%' then Identification else null end) as Passport, max(case when Identification like 'Licence%' then Identification else null end) as Licence, max(case when Identification ='birth-cert' then Identification else null end) as BirthCertificate from identification group by CustomerNumber