Skip to content
Advertisement

Converting query from SQL to pyspark

I am trying to convert the following SQL query into pyspark:

SELECT COUNT( CASE WHEN COALESCE(data.pred,0) != 0 AND COALESCE(data.val,0) != 0 
    AND (ABS(COALESCE(data.pred,0) - COALESCE(data.val,0)) / COALESCE(data.val,0)) > 0.1
    THEN data.pred END) / COUNT(*) AS Result

The code I have in PySpark right now is this:

Result = data.select(
    count(
        (coalesce(data["pred"], lit(0)) != 0) & 
        (coalesce(data["val"], lit(0)) != 0) & 
        (abs(
             coalesce(data["pred"], lit(0)) - 
             coalesce(data["val"], lit(0))
            ) / coalesce(data["val"], lit(0)) > 0.1
        )
    )
)
aux_2 = aux_1.select(aux_1.column_name.cast("float"))

aux_3 = aux_2.head()[0]

Deviation = (aux_3 / data.count())*100

However, this is simply returning the number of rows in the “data” dataframe, and I know this isn’t correct. I am very new at PySpark, can anyone help me solve this?

Advertisement

Answer

You need to collect the result into an integer, and then divide the numbers in Python:

Result = data.filter(
    (coalesce(data["pred"], lit(0)) != 0) & 
    (coalesce(data["val"], lit(0)) != 0) & 
    (abs(
         coalesce(data["pred"], lit(0)) - 
         coalesce(data["val"], lit(0))
        ) / coalesce(data["val"], lit(0)) > 0.1
    )
).count() / data.count()
Advertisement