Skip to content
Advertisement

In SQL how do I group by every one of a long list of columns and get counts, assembled all into one table?

I have performed a stratified sample on a multi-label dataset before training a classifier and want to check how balanced it is now. The columns in the dataset are:

I want to group by every label_* column once, and create a dictionary of the results with positive/negative counts. At the moment I am accomplishing this in PySpark SQL like this:

The output is thus:

This feels like it should be possible in one SQL statement, but I can’t figure out how to do this or find an existing solution. Obviously I don’t want to write out all the column names and generating SQL seems worse than this solution.

Can SQL do this? Thanks!

Advertisement

Answer

You can generate sql without group by.

Something like

And then use the result to produce your dict {k : [total-positive_k, positive_k]}

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement