Skip to content
Advertisement

Pyspark, iteratively get values from column containing json string

I wonder how you would iteratively get the values from a json string in pyspark. I have the following format of my data and would like to create the “value” column:

id_1 id_2 json_string value
1 1001 {“1001”:106, “2200”:101} 106
1 2200 {“1001”:106, “2200”:101} 101
df_2 = df.withColumn
(
'value', get_json_object(col('json_string'), concat(lit('$.'), col('id_2')))
)

Which gives the error Column is not iterable

However, just inserting the key manually works, i.e:

df_2 = df.withColumn
(
'value', get_json_object(col('json_string'), '$.1001')))
)

Any tips on solving this problem? It is not possible to manually insert the “id_2” values since there are many thousands of keys within the dataset and the json_string is in reality much longer with many more key-value pairs.

Super thankful for any suggestions!
Regards

Advertisement

Answer

You can use it within expr() which would allow you to concat the string and id_2.

data_ls = [
    ("1", "1001", '''{"1001":106, "2200":101}'''), 
    ("1", "2200", '''{"1001":106, "2200":101}''')
]

data_sdf = spark.createDataFrame(data_ls, ("id1", "id2", "jstr"))

# +---+----+--------------------+
# |id1| id2|                jstr|
# +---+----+--------------------+
# |  1|1001|{"1001":106, "220...|
# |  1|2200|{"1001":106, "220...|
# +---+----+--------------------+

data_sdf. 
    withColumn('val', func.expr('get_json_object(jstr, concat("$.", id2))')). 
    show(truncate=False)

# +---+----+------------------------+---+
# |id1|id2 |jstr                    |val|
# +---+----+------------------------+---+
# |1  |1001|{"1001":106, "2200":101}|106|
# |1  |2200|{"1001":106, "2200":101}|101|
# +---+----+------------------------+---+
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement