So my current table has more than 100 fields and I am trying to calculate the correlations between the input variables and output variable, and then convert all those columns into rows. For example, my current table looks like this:
input_1 | input_2 | output |
---|---|---|
3 | 6 | 5 |
4 | 7 | 5 |
6 | 4 | 4 |
6 | 9 | 3 |
7 | 10 | 5 |
9 | 9 | 2 |
2 | 9 | 4 |
What I trying to achieve is:
categories | correlation |
---|---|
input_1 | -0.594 |
input_2 | -0.27 |
The query that I have written below gives me a correlation per each calculation. However, I was them converted into rows,
x
select (Avg(input1 * output) - (Avg(input1) * Avg(output))) / (stddev(input1) * stddev(output)) AS correlation
from a
Can you help how this can be achieved? Thank you!
Advertisement
Answer
Consider below approach
execute immediate (select '''
select categories, correlation
from (
select ''' ||
string_agg('corr(' || category || ', output) as ' || category , ', ')
|| ''' from `project.dataset.table`
)
unpivot (correlation for categories in (''' || string_agg(category) || '''))
'''
from (
select category from (
select
array(
select category
from unnest(regexp_extract_all(to_json_string(t), r'"(w+)":')) category
where category != 'output'
) arr
from `project.dataset.table` t
limit 1
), unnest(arr) category
))
if applied to sample data in your question – output is
As you can notice – I am not using your heavy formula – rather I am using CORR built-in function