This is how my data looks,
emp_id,skills
1234,python|java|sql|R|javascript
5639,C|HTML|php|perl
This is how data need to be loaded into the table
emp_id skills
python
1234 java
sql
R
perl
C
5639 HTML
php
Actually I have replaced | with n but its not being loaded in the next line instead its just loading by adding spaces. And I will loading data into table using python etl, so even postprocessing i can add. Any suggestions??
Advertisement
Answer
It’s the behaviour of a relational database, that can’t just display emp_id
once per group and all its elements as one per line. Changing that way of displaying data is the front end’s privilege and responsibility, not the database’s. So do that in Python.
Having said that, Impala has the SPLIT_PART()
string function, which returns the n-th token of a string separated by a separator character that you pass as an argument.
So, cross joining with a series of consecutive integers, and then applying SPLIT_PART(skills,'|',i)
does what you need.
Actually, in my (never too) humble opinion, this is what you should do whenever someone throws a file with such an unhappy format at you to load into the database. Always verticalise comma/bar/semicolon/or_whatever separated lists of “values” using the technique below, and store the data verticalised.:
WITH
-- your input
input( emp_id,skills) AS (
SELECT 1234,'python|java|sql|R|javascript'
UNION ALL SELECT 5639,'C|HTML|php|perl'
)
,
-- a big enough series of integers ..
i(i) AS (
SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4
UNION ALL SELECT 5
UNION ALL SELECT 6
UNION ALL SELECT 7
UNION ALL SELECT 8
)
SELECT
emp_id
, SPLIT_PART(skills,'|',i) AS skill
FROM input
CROSS JOIN i
WHERE SPLIT_PART(skills,'|',i) <> ''
ORDER BY
emp_id
, i
;
-- out emp_id | skill
-- out --------+------------
-- out 1234 | python
-- out 1234 | java
-- out 1234 | sql
-- out 1234 | R
-- out 1234 | javascript
-- out 5639 | C
-- out 5639 | HTML
-- out 5639 | php
-- out 5639 | perl
With two bar/comma separated columns, it may look like this:
WITH
-- your input, enhanced
input( emp_id,skills,pubs) AS (
SELECT 1234,'python|java|sql|R|javascript','ship inn,anchor,stag'
UNION ALL SELECT 5639,'C|HTML|php|perl' ,'black horse,crown,mitre'
)
,
-- a big enough series of integers ..
i(i) AS (
SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4
UNION ALL SELECT 5
UNION ALL SELECT 6
UNION ALL SELECT 7
UNION ALL SELECT 8
)
,
-- another big enough series of integers ..
j(j) AS (
SELECT
i AS j
FROM i
)
SELECT
emp_id
, i AS skill_sequence
, SPLIT_PART(skills,'|',i) AS skill
, j AS pub_sequence
, SPLIT_PART(pubs,',',j) AS pub
FROM input
CROSS JOIN i
CROSS JOIN j
WHERE SPLIT_PART(skills,'|',i) <> ''
AND SPLIT_PART(pubs, ',',j) <> ''
ORDER BY
emp_id
, i
, j
;
-- out emp_id | skill_sequence | skill | pub_sequence | pub
-- out --------+----------------+------------+--------------+-------------
-- out 1234 | 1 | python | 1 | ship inn
-- out 1234 | 1 | python | 2 | anchor
-- out 1234 | 1 | python | 3 | stag
-- out 1234 | 2 | java | 1 | ship inn
-- out 1234 | 2 | java | 2 | anchor
-- out 1234 | 2 | java | 3 | stag
-- out 1234 | 3 | sql | 1 | ship inn
-- out 1234 | 3 | sql | 2 | anchor
-- out 1234 | 3 | sql | 3 | stag
-- out 1234 | 4 | R | 1 | ship inn
-- out 1234 | 4 | R | 2 | anchor
-- out 1234 | 4 | R | 3 | stag
-- out 1234 | 5 | javascript | 1 | ship inn
-- out 1234 | 5 | javascript | 2 | anchor
-- out 1234 | 5 | javascript | 3 | stag
-- out 5639 | 1 | C | 1 | black horse
-- out 5639 | 1 | C | 2 | crown
-- out 5639 | 1 | C | 3 | mitre
-- out 5639 | 2 | HTML | 1 | black horse
-- out 5639 | 2 | HTML | 2 | crown
-- out 5639 | 2 | HTML | 3 | mitre
-- out 5639 | 3 | php | 1 | black horse
-- out 5639 | 3 | php | 2 | crown
-- out 5639 | 3 | php | 3 | mitre
-- out 5639 | 4 | perl | 1 | black horse
-- out 5639 | 4 | perl | 2 | crown
-- out 5639 | 4 | perl | 3 | mitre