So my current table has more than 100 fields and I am trying to calculate the correlations between the input variables and output variable, and then convert all those columns into rows. For example, my current table looks like this:
input_1 | input_2 | output |
---|---|---|
3 | 6 | 5 |
4 | 7 | 5 |
6 | 4 | 4 |
6 | 9 | 3 |
7 | 10 | 5 |
9 | 9 | 2 |
2 | 9 | 4 |
What I trying to achieve is:
categories | correlation |
---|---|
input_1 | -0.594 |
input_2 | -0.27 |
The query that I have written below gives me a correlation per each calculation. However, I was them converted into rows,
select (Avg(input1 * output) - (Avg(input1) * Avg(output))) / (stddev(input1) * stddev(output)) AS correlation from a
Can you help how this can be achieved? Thank you!
Advertisement
Answer
Consider below approach
execute immediate (select ''' select categories, correlation from ( select ''' || string_agg('corr(' || category || ', output) as ' || category , ', ') || ''' from `project.dataset.table` ) unpivot (correlation for categories in (''' || string_agg(category) || ''')) ''' from ( select category from ( select array( select category from unnest(regexp_extract_all(to_json_string(t), r'"(w+)":')) category where category != 'output' ) arr from `project.dataset.table` t limit 1 ), unnest(arr) category ))
if applied to sample data in your question – output is
As you can notice – I am not using your heavy formula – rather I am using CORR built-in function