I am trying to convert the following SQL query into pyspark:
SELECT COUNT( CASE WHEN COALESCE(data.pred,0) != 0 AND COALESCE(data.val,0) != 0 AND (ABS(COALESCE(data.pred,0) - COALESCE(data.val,0)) / COALESCE(data.val,0)) > 0.1 THEN data.pred END) / COUNT(*) AS Result
The code I have in PySpark right now is this:
Result = data.select( count( (coalesce(data["pred"], lit(0)) != 0) & (coalesce(data["val"], lit(0)) != 0) & (abs( coalesce(data["pred"], lit(0)) - coalesce(data["val"], lit(0)) ) / coalesce(data["val"], lit(0)) > 0.1 ) ) ) aux_2 = aux_1.select(aux_1.column_name.cast("float")) aux_3 = aux_2.head()[0] Deviation = (aux_3 / data.count())*100
However, this is simply returning the number of rows in the “data” dataframe, and I know this isn’t correct. I am very new at PySpark, can anyone help me solve this?
Advertisement
Answer
You need to collect
the result into an integer, and then divide the numbers in Python:
Result = data.filter( (coalesce(data["pred"], lit(0)) != 0) & (coalesce(data["val"], lit(0)) != 0) & (abs( coalesce(data["pred"], lit(0)) - coalesce(data["val"], lit(0)) ) / coalesce(data["val"], lit(0)) > 0.1 ) ).count() / data.count()