Skip to content
Advertisement

parse string containing comma in csv using mysql? (or maybe pandas?)

(I’m using mysql v5.7)

I have a csv file that has something like

mytable1

 name   sys        id
name1 aaa,bbb,ccc 8,3,5 
name2 ccc,bbb     22,33
name3 ccc,ccc     11,56
name4 aaa,ccc,ccc,ccc 87,88,40,99

note that ‘aaa,bbb,ccc’ is one string and ‘8,3,5’ is one string.. I know it’s not a good practice putting data like this but this is what I have given from other department… so my first task was to extract number matched with ccc. like this:

mytable2

name  id
name1 5 
name2 22
name3 11
name4 88

for the name3, there were two ‘ccc’ but it is ok to pick the first one.

I could get the right result by doing

select sub.name,
case
when x = 0 then SUBSTRING_INDEX(SUBSTRING_INDEX(sub.id, ",", 1),",",-1)
when x= 1 then SUBSTRING_INDEX(SUBSTRING_INDEX(sub.id, ",", 2),",",-1)
when x=2 then SUBSTRING_INDEX(SUBSTRING_INDEX(sub.id, ",", 3),",",-1)
when x=3 then SUBSTRING_INDEX(SUBSTRING_INDEX(sub.id, ",", 4),",",-1)
when x=4 then SUBSTRING_INDEX(SUBSTRING_INDEX(sub.id, ",", 5),",",-1)
else SUBSTRING_INDEX(sub.id, ",", -1)
end as y
 from (
SELECT DISTINCT
       mytable1.*,
case 
when SUBSTRING_INDEX(SUBSTRING_INDEX(sys, ",", 1),",",-1) = 'ccc' then 0
when SUBSTRING_INDEX(SUBSTRING_INDEX(sys, ",", 2),",",-1) = 'ccc' then 1
when SUBSTRING_INDEX(SUBSTRING_INDEX(sys, ",", 3),",",-1) = 'ccc' then 2
when SUBSTRING_INDEX(SUBSTRING_INDEX(sys, ",", 4),",",-1) = 'ccc' then 3
when SUBSTRING_INDEX(SUBSTRING_INDEX(sys, ",", 5),",",-1) = 'ccc' then 4
else -1
end as x
FROM mytable1) sub        

so basically, I found the index of the ccc then use that index to find the matched id. (btw max number of items in sys column is 6 like ‘aaa, bbb, ccc, ccc, ccc, ccc’)

now I am wondering if there’s any way to find the ids that’s from ccc but not included in the mytable2. like this.

  56 40 99

any help would be appreciated!!

Advertisement

Answer

The id matched the first specified sys may be found by simple

SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(id, ',', FIND_IN_SET(@parameter, sys)), ',', -1) id
FROM mytable1
HAVING id > '';

For to extract all another values use:

SELECT DISTINCT name, SUBSTRING_INDEX(SUBSTRING_INDEX(id, ',', num), ',', -1) id
FROM mytable1
JOIN (SELECT 1 num UNION SELECT 2 UNION SELECT 3 UNION SELECT 4) numbers
    ON SUBSTRING_INDEX(SUBSTRING_INDEX(sys, ',', num), ',', -1) = @parameter
HAVING id NOT IN (SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(id, ',', FIND_IN_SET(@parameter, sys)), ',', -1)
                  FROM mytable1);

You may remove name from the output if you do not need in it.

Increase the numbers count in numbers subquery accordingly if the sys/id value may contain more than 4 separate values in CSV.

fiddle

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement