Skip to content
Advertisement

How to count unique data occuring in multiple categorical columns from a pyspark dataframe

The Problem:

In a ML based scenario, I am trying to see the occurrence of data from multiple columns in an Inference file versus the files that were provided to me for Training. I need this to be found only for categorical variables, since numerical attributes are scaled.

The Expectation:

I’ve got some success in doing the following in Standard SQL query but I wish to move this into PySpark.

The Training file looks something like this:

A B C Class
10 0.2 RDK AAA
10 0.2 RDK AAA
10 0.2 RDK AAA
10 1.3 USW ACA
10 1.3 USW ACA
10 1.3 USW ACA
8 0.2 RDK BBV
8 0.1 RDJ BBR
10 0.2 RDK AAA
10 5.1 RDK AAA
8 0.1 RDJ BBR
8 0.1 RDJ BBR
10 5.1 RDK AAA
10 5.1 RDK AAA
8 0.1 RDJ BBR
8 0.1 RDJ BBR

and using sql I am able to get the desired output. However notice that I am matching only Col A and C since they are categorical in nature:

Desired Output:

A B C prediction Class distr_count
10 0.2 RDK AAA AAA 7
10 1.3 USW ACA AAA 3
8 0.2 RDK BBV BBR 1
8 0.1 RDJ BBR BBR 5
10 5.1 RDK AAA AAA 7
7 1.2 SSW BBR XXX 0

Advertisement

Answer

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement