The Problem:
In a ML based scenario, I am trying to see the occurrence of data from multiple columns in an Inference file versus the files that were provided to me for Training. I need this to be found only for categorical variables, since numerical attributes are scaled.
The Expectation:
I’ve got some success in doing the following in Standard SQL query but I wish to move this into PySpark.
The Training file looks something like this:
A | B | C | Class |
---|---|---|---|
10 | 0.2 | RDK | AAA |
10 | 0.2 | RDK | AAA |
10 | 0.2 | RDK | AAA |
10 | 1.3 | USW | ACA |
10 | 1.3 | USW | ACA |
10 | 1.3 | USW | ACA |
8 | 0.2 | RDK | BBV |
8 | 0.1 | RDJ | BBR |
10 | 0.2 | RDK | AAA |
10 | 5.1 | RDK | AAA |
8 | 0.1 | RDJ | BBR |
8 | 0.1 | RDJ | BBR |
10 | 5.1 | RDK | AAA |
10 | 5.1 | RDK | AAA |
8 | 0.1 | RDJ | BBR |
8 | 0.1 | RDJ | BBR |
and using sql I am able to get the desired output. However notice that I am matching only Col A and C since they are categorical in nature:
select a.A, a.B, a.C, b.prediction, a.Class count(1) as distr_count from train_file a left outer join test_file b on a.A = b.A and a.C = b.C group by a.A, a.C
Desired Output:
A | B | C | prediction | Class | distr_count |
---|---|---|---|---|---|
10 | 0.2 | RDK | AAA | AAA | 7 |
10 | 1.3 | USW | ACA | AAA | 3 |
8 | 0.2 | RDK | BBV | BBR | 1 |
8 | 0.1 | RDJ | BBR | BBR | 5 |
10 | 5.1 | RDK | AAA | AAA | 7 |
7 | 1.2 | SSW | BBR | XXX | 0 |
Advertisement
Answer
df = train_file.join(test_file, on=['A', 'C'], how='left_outer') group_df = df.groupBy(['A', 'C']).agg(f.count('class'))