Skip to content
Advertisement

Change null to empty array in databricks SQL?

I have a value in a JSON column that is sometimes all null in an Azure Databricks table. The full process to get to JSON_TABLE is: read parquet, infer schema of JSON column, convert the column from JSON string to deeply nested structure, explode any arrays within. I am working in SQL with python-defined UDFs (json_exists() checks the schema to see if the key is possible to use, json_get() gets a key from the column or returns a default) and want to do the following:

SELECT
  ID, EXPLODE(json_get(JSON_COL, 'ARRAY', NULL)) AS SINGLE_ARRAY_VALUE 
FROM 
  JSON_TABLE 
WHERE 
  JSON_COL IS NOT NULL AND 
  json_exists(JSON_COL, 'ARRAY')==1

When the data has at least one instance of JSON_COL containing ARRAY, the schema is such that this has no problems. If, however, the data has all null values in JSON_COL.ARRAY, an error occurs because the column has been inferred as a string type (error received: input to function explode should be array or map type, not string). Unfortunately, while the json_exists() function returns the expected values, the error still occurs even when the returned dataset would be empty.

Can I get around this error via casting or replacement of nulls? If not, what is an alternative that still allows inferring the schema of the JSON?

Note: This is a simplified example. I am writing code to generate SQL code for hundreds of similar data structures, so while I am open to workarounds, a direct solution would be ideal. Please ask if anything is unclear.

Example table that causes error:

| ID | JSON_COL                                                          |
| 1  | {"_corrupt_record": null, "otherInfo": [{"test": 1, "from": 3}]}  |
| 2  | {"_corrupt_record": null, "otherInfo": [{"test": 5, "from": 2}]}  |

Example table that does not cause error:

| ID | JSON_COL                                                          |
| 1  | {"_corrupt_record": null, "array": [{"test": 1, "from": 3}]}      |
| 2  | {"_corrupt_record": null, "otherInfo": [{"test": 5, "from": 2}]}  |

This question seems like it might hold the answer, but I was not able to get anything working from it.

Advertisement

Answer

You can filter the table before calling json_get and explode, so that you only explode when json_get returns a non-null value:

SELECT
  ID, EXPLODE(json_get(JSON_COL, 'ARRAY', NULL)) AS SINGLE_ARRAY_VALUE 
FROM (
  SELECT *
  FROM JSON_TABLE 
  WHERE 
    JSON_COL IS NOT NULL AND 
    json_exists(JSON_COL, 'ARRAY')==1
)
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement