I’m experiencing an issue with the following function. I’m trying to translate this to a SQL statement so I can have a better idea of exactly what’s happening, so I can more effectively work on my actual issue.
I know that this contains a join between valid_data to ri_data, a filter, and a select statement. I’m primarily having issue understanding how to write the join piece.
result = (
valid_data.join(
ri_data,
F.col(table_name + "." + column_name) == ri_data.ri_column,
"left_outer",
)
.filter(f"ri_column IS NULL")
.selectExpr(
"etl_row_id AS row_id", f"{table_name}.{column_name} AS error_value"
)
.distinct()
Any help is appreciated.
Advertisement
Answer
You have some substitutions to do, like the column_name for the join keys, etc. But the general structure looks like this in SQL:
SELECT DISTINCT A.*, A.etl_row_id AS row_id, A.column_name AS error_value FROM valid_data A LEFT OUTER JOIN ri_data B ON A.column_name = B.ri_column WHERE B.ri_column IS NULL