I have a value in a JSON column that is sometimes all null in an Azure Databricks table. The full process to get to JSON_TABLE is: read parquet, infer schema of JSON column, convert the column from JSON string to deeply nested structure, explode any arrays within. I am working in SQL with python-defined UDFs (json_exists() checks the schema to see if the key is possible to use, json_get() gets a key from the column or returns a default) and want to do the following:
SELECT ID, EXPLODE(json_get(JSON_COL, 'ARRAY', NULL)) AS SINGLE_ARRAY_VALUE FROM JSON_TABLE WHERE JSON_COL IS NOT NULL AND json_exists(JSON_COL, 'ARRAY')==1
When the data has at least one instance of JSON_COL containing ARRAY, the schema is such that this has no problems. If, however, the data has all null values in JSON_COL.ARRAY, an error occurs because the column has been inferred as a string type (error received: input to function explode should be array or map type, not string). Unfortunately, while the json_exists() function returns the expected values, the error still occurs even when the returned dataset would be empty.
Can I get around this error via casting or replacement of nulls? If not, what is an alternative that still allows inferring the schema of the JSON?
Note: This is a simplified example. I am writing code to generate SQL code for hundreds of similar data structures, so while I am open to workarounds, a direct solution would be ideal. Please ask if anything is unclear.
Example table that causes error:
| ID | JSON_COL | | 1 | {"_corrupt_record": null, "otherInfo": [{"test": 1, "from": 3}]} | | 2 | {"_corrupt_record": null, "otherInfo": [{"test": 5, "from": 2}]} |
Example table that does not cause error:
| ID | JSON_COL | | 1 | {"_corrupt_record": null, "array": [{"test": 1, "from": 3}]} | | 2 | {"_corrupt_record": null, "otherInfo": [{"test": 5, "from": 2}]} |
This question seems like it might hold the answer, but I was not able to get anything working from it.
Advertisement
Answer
You can filter the table before calling json_get
and explode
, so that you only explode when json_get
returns a non-null value:
SELECT ID, EXPLODE(json_get(JSON_COL, 'ARRAY', NULL)) AS SINGLE_ARRAY_VALUE FROM ( SELECT * FROM JSON_TABLE WHERE JSON_COL IS NOT NULL AND json_exists(JSON_COL, 'ARRAY')==1 )