This is how my data looks,
emp_id,skills 1234,python|java|sql|R|javascript 5639,C|HTML|php|perl
This is how data need to be loaded into the table
emp_id skills python 1234 java sql R perl C 5639 HTML php
Actually I have replaced | with n but its not being loaded in the next line instead its just loading by adding spaces. And I will loading data into table using python etl, so even postprocessing i can add. Any suggestions??
Advertisement
Answer
It’s the behaviour of a relational database, that can’t just display emp_id
once per group and all its elements as one per line. Changing that way of displaying data is the front end’s privilege and responsibility, not the database’s. So do that in Python.
Having said that, Impala has the SPLIT_PART()
string function, which returns the n-th token of a string separated by a separator character that you pass as an argument.
So, cross joining with a series of consecutive integers, and then applying SPLIT_PART(skills,'|',i)
does what you need.
Actually, in my (never too) humble opinion, this is what you should do whenever someone throws a file with such an unhappy format at you to load into the database. Always verticalise comma/bar/semicolon/or_whatever separated lists of “values” using the technique below, and store the data verticalised.:
WITH -- your input input( emp_id,skills) AS ( SELECT 1234,'python|java|sql|R|javascript' UNION ALL SELECT 5639,'C|HTML|php|perl' ) , -- a big enough series of integers .. i(i) AS ( SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 ) SELECT emp_id , SPLIT_PART(skills,'|',i) AS skill FROM input CROSS JOIN i WHERE SPLIT_PART(skills,'|',i) <> '' ORDER BY emp_id , i ; -- out emp_id | skill -- out --------+------------ -- out 1234 | python -- out 1234 | java -- out 1234 | sql -- out 1234 | R -- out 1234 | javascript -- out 5639 | C -- out 5639 | HTML -- out 5639 | php -- out 5639 | perl
With two bar/comma separated columns, it may look like this:
WITH -- your input, enhanced input( emp_id,skills,pubs) AS ( SELECT 1234,'python|java|sql|R|javascript','ship inn,anchor,stag' UNION ALL SELECT 5639,'C|HTML|php|perl' ,'black horse,crown,mitre' ) , -- a big enough series of integers .. i(i) AS ( SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 ) , -- another big enough series of integers .. j(j) AS ( SELECT i AS j FROM i ) SELECT emp_id , i AS skill_sequence , SPLIT_PART(skills,'|',i) AS skill , j AS pub_sequence , SPLIT_PART(pubs,',',j) AS pub FROM input CROSS JOIN i CROSS JOIN j WHERE SPLIT_PART(skills,'|',i) <> '' AND SPLIT_PART(pubs, ',',j) <> '' ORDER BY emp_id , i , j ; -- out emp_id | skill_sequence | skill | pub_sequence | pub -- out --------+----------------+------------+--------------+------------- -- out 1234 | 1 | python | 1 | ship inn -- out 1234 | 1 | python | 2 | anchor -- out 1234 | 1 | python | 3 | stag -- out 1234 | 2 | java | 1 | ship inn -- out 1234 | 2 | java | 2 | anchor -- out 1234 | 2 | java | 3 | stag -- out 1234 | 3 | sql | 1 | ship inn -- out 1234 | 3 | sql | 2 | anchor -- out 1234 | 3 | sql | 3 | stag -- out 1234 | 4 | R | 1 | ship inn -- out 1234 | 4 | R | 2 | anchor -- out 1234 | 4 | R | 3 | stag -- out 1234 | 5 | javascript | 1 | ship inn -- out 1234 | 5 | javascript | 2 | anchor -- out 1234 | 5 | javascript | 3 | stag -- out 5639 | 1 | C | 1 | black horse -- out 5639 | 1 | C | 2 | crown -- out 5639 | 1 | C | 3 | mitre -- out 5639 | 2 | HTML | 1 | black horse -- out 5639 | 2 | HTML | 2 | crown -- out 5639 | 2 | HTML | 3 | mitre -- out 5639 | 3 | php | 1 | black horse -- out 5639 | 3 | php | 2 | crown -- out 5639 | 3 | php | 3 | mitre -- out 5639 | 4 | perl | 1 | black horse -- out 5639 | 4 | perl | 2 | crown -- out 5639 | 4 | perl | 3 | mitre