We have a table regarding identification where data is as such:
| Customer Number | Identification |
|---|---|
| 101 | passport-us |
| 101 | Licence |
| 102 | passport-uk |
| 102 | Licence |
| 102 | birth-cert |
| 103 | Licence-learner |
Thus one customer may have submitted multiple identification others in varying number/combination.
Result we are after is :
| Customer Number | Passport | Licence | Birth Certificate |
|---|---|---|---|
| 101 | passport-us | licence | |
| 102 | passport-uk | licence | birth-cert |
| 103 | licence-learner |
So basically taking the exact text and placing in the subsequent column.
What we know is:
- there are only three identification type we are interested in (if there are others we dont catch)
- they will contain within it the keyword (passport,birth, etc)
- there is no given order to how customer number is coming in the table, there is a separate primary key else where in the bigger table.
- we are aiming the query to run in hive
I was thinking along the lines of self join three times with REGEXP_EXTRACT perhaps.
Thoughts?
Advertisement
Answer
You can do it using pure sql like this:
select CustomerNumber, max(case when Identification like 'passport%' then Identification else null end) as Passport, max(case when Identification like 'Licence%' then Identification else null end) as Licence, max(case when Identification ='birth-cert' then Identification else null end) as BirthCertificate from identification group by CustomerNumber