Skip to content
Advertisement

Translating pyspark into sql

I’m experiencing an issue with the following function. I’m trying to translate this to a SQL statement so I can have a better idea of exactly what’s happening, so I can more effectively work on my actual issue.

I know that this contains a join between valid_data to ri_data, a filter, and a select statement. I’m primarily having issue understanding how to write the join piece.

        result = (
            valid_data.join(
                ri_data,
                F.col(table_name + "." + column_name) == ri_data.ri_column,
                "left_outer",
            )
            .filter(f"ri_column IS NULL")
            .selectExpr(
                "etl_row_id AS row_id", f"{table_name}.{column_name} AS error_value"
            )
            .distinct()

Any help is appreciated.

Advertisement

Answer

You have some substitutions to do, like the column_name for the join keys, etc. But the general structure looks like this in SQL:

SELECT DISTINCT A.*,
A.etl_row_id AS row_id,
A.column_name AS error_value
FROM valid_data A
LEFT OUTER JOIN ri_data B
ON A.column_name = B.ri_column
WHERE B.ri_column IS NULL
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement