I have a SQL table of the following format:
ID Cat 1 A 1 B 1 D 1 F 2 B 2 C 2 D 3 A 3 F
Now, I want to create a table with one ID per row, and multiple Cat’s in a row. My desired output looks as follows:
ID A B C D E F 1 1 1 0 1 0 1 2 0 1 1 1 0 0 3 1 0 0 0 0 1
I have found:
Transform table to one-hot-encoding of single column value
However, I have more than 1000 Cat’s, so I am looking for code to write this automatically, rather than manually. Who can help me with this?
Advertisement
Answer
First let me transform the data you pasted into an actual table:
WITH data AS ( SELECT REGEXP_EXTRACT(data2, '[0-9]') id, REGEXP_EXTRACT(data2, '[A-Z]') cat FROM ( SELECT SPLIT("""1 A 1 B 1 D 1 F 2 B 2 C 2 D 3 A 3 F""", 'n') AS data1 ), UNNEST(data1) data2 ) SELECT * FROM data
(try sharing a table next time)
Now we can do some manual 1-hot encoding:
SELECT id , MAX(IF(cat='A',1,0)) cat_A , MAX(IF(cat='B',1,0)) cat_B , MAX(IF(cat='C',1,0)) cat_C FROM data GROUP BY id
Now we want to write a script that will automatically create the columns we want:
SELECT STRING_AGG(FORMAT("MAX(IF(cat='%s',1,0))cat_%s", cat, cat), ', ') FROM ( SELECT DISTINCT cat FROM data ORDER BY 1 )
That generates a string that you can copy paste into a query, that 1-hot encodes your arrays/rows:
SELECT id , MAX(IF(cat='A',1,0))cat_A, MAX(IF(cat='B',1,0))cat_B, MAX(IF(cat='C',1,0))cat_C, MAX(IF(cat='D',1,0))cat_D, MAX(IF(cat='F',1,0))cat_F FROM data GROUP BY id
And that’s exactly what the question was asking for. You can generate SQL with SQL, but you’ll need to write a new query using that result.