I have performed a stratified sample on a multi-label dataset before training a classifier and want to check how balanced it is now. The columns in the dataset are:
|_Body|label_0|label_1|label_10|label_100|label_101|label_102|label_103|label_104|label_11|label_12|label_13|label_14|label_15|label_16|label_17|label_18|label_19|label_2|label_20|label_21|label_22|label_23|label_24|label_25|label_26|label_27|label_28|label_29|label_3|label_30|label_31|label_32|label_33|label_34|label_35|label_36|label_37|label_38|label_39|label_4|label_40|label_41|label_42|label_43|label_44|label_45|label_46|label_47|label_48|label_49|label_5|label_50|label_51|label_52|label_53|label_54|label_55|label_56|label_57|label_58|label_59|label_6|label_60|label_61|label_62|label_63|label_64|label_65|label_66|label_67|label_68|label_69|label_7|label_70|label_71|label_72|label_73|label_74|label_75|label_76|label_77|label_78|label_79|label_8|label_80|label_81|label_82|label_83|label_84|label_85|label_86|label_87|label_88|label_89|label_9|label_90|label_91|label_92|label_93|label_94|label_95|label_96|label_97|label_98|label_99|
I want to group by every label_*
column once, and create a dictionary of the results with positive/negative counts. At the moment I am accomplishing this in PySpark SQL like this:
# Evaluate how skewed the sample is after balancing it by resampling stratified_sample = spark.read.json('s3://stackoverflow-events/1901/Sample.Stratified.{}.*.jsonl'.format(limit)) stratified_sample.registerTempTable('stratified_sample') label_counts = {} for i in range(0, 100): count_df = spark.sql('SELECT label_{}, COUNT(*) as total FROM stratified_sample GROUP BY label_{}'.format(i, i)) rows = count_df.rdd.take(2) neg_count = getattr(rows[0], 'total') pos_count = getattr(rows[1], 'total') label_counts[i] = [neg_count, pos_count]
The output is thus:
{0: [1034673, 14491], 1: [1023250, 25914], 2: [1030462, 18702], 3: [1035645, 13519], 4: [1037445, 11719], 5: [1010664, 38500], 6: [1031699, 17465], ...}
This feels like it should be possible in one SQL statement, but I can’t figure out how to do this or find an existing solution. Obviously I don’t want to write out all the column names and generating SQL seems worse than this solution.
Can SQL do this? Thanks!
Advertisement
Answer
You can generate sql without group by.
Something like
SELECT COUNT(*) AS total, SUM(label_k) as positive_k ,.. FROM table
And then use the result to produce your dict {k : [total-positive_k, positive_k]}