Skip to content
Advertisement

How to count unique data occuring in multiple categorical columns from a pyspark dataframe

The Problem:

In a ML based scenario, I am trying to see the occurrence of data from multiple columns in an Inference file versus the files that were provided to me for Training. I need this to be found only for categorical variables, since numerical attributes are scaled.

The Expectation:

I’ve got some success in doing the following in Standard SQL query but I wish to move this into PySpark.

The Training file looks something like this:

A B C Class
10 0.2 RDK AAA
10 0.2 RDK AAA
10 0.2 RDK AAA
10 1.3 USW ACA
10 1.3 USW ACA
10 1.3 USW ACA
8 0.2 RDK BBV
8 0.1 RDJ BBR
10 0.2 RDK AAA
10 5.1 RDK AAA
8 0.1 RDJ BBR
8 0.1 RDJ BBR
10 5.1 RDK AAA
10 5.1 RDK AAA
8 0.1 RDJ BBR
8 0.1 RDJ BBR

and using sql I am able to get the desired output. However notice that I am matching only Col A and C since they are categorical in nature:

select a.A, a.B, a.C, b.prediction, a.Class count(1) as distr_count from train_file a left outer join test_file b 
on a.A = b.A 
and a.C = b.C
group by a.A, a.C

Desired Output:

A B C prediction Class distr_count
10 0.2 RDK AAA AAA 7
10 1.3 USW ACA AAA 3
8 0.2 RDK BBV BBR 1
8 0.1 RDJ BBR BBR 5
10 5.1 RDK AAA AAA 7
7 1.2 SSW BBR XXX 0

Advertisement

Answer

df = train_file.join(test_file, on=['A', 'C'], how='left_outer')
group_df = df.groupBy(['A', 'C']).agg(f.count('class'))
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement