How to count unique data occuring in multiple categorical columns from a pyspark dataframe

group-by pyspark python sql

The Problem:

In a ML based scenario, I am trying to see the occurrence of data from multiple columns in an Inference file versus the files that were provided to me for Training. I need this to be found only for categorical variables, since numerical attributes are scaled.

The Expectation:

I’ve got some success in doing the following in Standard SQL query but I wish to move this into PySpark.

The Training file looks something like this:

A	B	C	Class
10	0.2	RDK	AAA
10	0.2	RDK	AAA
10	0.2	RDK	AAA
10	1.3	USW	ACA
10	1.3	USW	ACA
10	1.3	USW	ACA
8	0.2	RDK	BBV
8	0.1	RDJ	BBR
10	0.2	RDK	AAA
10	5.1	RDK	AAA
8	0.1	RDJ	BBR
8	0.1	RDJ	BBR
10	5.1	RDK	AAA
10	5.1	RDK	AAA
8	0.1	RDJ	BBR
8	0.1	RDJ	BBR

and using sql I am able to get the desired output. However notice that I am matching only Col A and C since they are categorical in nature:

select a.A, a.B, a.C, b.prediction, a.Class count(1) as distr_count from train_file a left outer join test_file b 
on a.A = b.A 
and a.C = b.C
group by a.A, a.C

Desired Output:

A	B	C	prediction	Class	distr_count
10	0.2	RDK	AAA	AAA	7
10	1.3	USW	ACA	AAA	3
8	0.2	RDK	BBV	BBR	1
8	0.1	RDJ	BBR	BBR	5
10	5.1	RDK	AAA	AAA	7
7	1.2	SSW	BBR	XXX	0

Advertisement

Answer

df = train_file.join(test_file, on=['A', 'C'], how='left_outer')
group_df = df.groupBy(['A', 'C']).agg(f.count('class'))