Skip to content
Advertisement

Change null to empty array in databricks SQL?

I have a value in a JSON column that is sometimes all null in an Azure Databricks table. The full process to get to JSON_TABLE is: read parquet, infer schema of JSON column, convert the column from JSON string to deeply nested structure, explode any arrays within. I am working in SQL with python-defined UDFs (json_exists() checks the schema to see if the key is possible to use, json_get() gets a key from the column or returns a default) and want to do the following:

When the data has at least one instance of JSON_COL containing ARRAY, the schema is such that this has no problems. If, however, the data has all null values in JSON_COL.ARRAY, an error occurs because the column has been inferred as a string type (error received: input to function explode should be array or map type, not string). Unfortunately, while the json_exists() function returns the expected values, the error still occurs even when the returned dataset would be empty.

Can I get around this error via casting or replacement of nulls? If not, what is an alternative that still allows inferring the schema of the JSON?

Note: This is a simplified example. I am writing code to generate SQL code for hundreds of similar data structures, so while I am open to workarounds, a direct solution would be ideal. Please ask if anything is unclear.

Example table that causes error:

Example table that does not cause error:

This question seems like it might hold the answer, but I was not able to get anything working from it.

Advertisement

Answer

You can filter the table before calling json_get and explode, so that you only explode when json_get returns a non-null value:

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement