I have a web scraper dumping data into a variant column in a Snowflake database. This is acraping page data as then creates json arrays for various tables found within the page.
Here is an example of the type of json i would find using a Soccer analogy:
{ "dom_url": "https://www.soccertables.com/european_tables", "event_id": "01b2722a-d8e6-4f67-95d0-8dd7ba088a4a", "event_utc_time": "2020-05-11 09:01:14.821", "ip_address": "125.238.134.96", "table_1": [ { "position": "1", "team_name": "Liverpool", "games_played": "29", "games_won": "26", "games_drawn": "2", "games_lost": "1", "goals_for": "75", "goals_against": "35" "points": "80" }, { "position": "2", "team_name": "Man. City", "games_played": "29", "games_won": "20", "games_drawn": "5", "games_lost": "4", "goals_for": "60", "goals_against": "45" "points": "65" }, { "position": "...", "team_name": "...", "games_played": "...", "games_won": "...", "games_drawn": "...", "games_lost": "...", "goals_for": "...", "goals_against": "..." "points": "..." } ], "table_2": [ { "position": "1", "team_name": "Bayern Munich", "games_played": "29", "games_won": "26", "games_drawn": "2", "games_lost": "1", "goals_for": "75", "goals_against": "35" "points": "80" }, { "position": "2", "team_name": "Bayer Leverkussen", "games_played": "29", "games_won": "20", "games_drawn": "5", "games_lost": "4", "goals_for": "60", "goals_against": "45" "points": "65" }, { "position": "...", "team_name": "...", "games_played": "...", "games_won": "...", "games_drawn": "...", "games_lost": "...", "goals_for": "...", "goals_against": "..." "points": "..." } ], "referrer_url": "https://www.soccertables.com", }
Ideally, i’d like the output of this to be a flat, relational table:
table_name position team_name games_played etc… table_1 1 Liverpool 29 … table_1 2 Man. City 29 … table_2 1 Bayern Munich 29 … ….
I know that if i were only interested in table_1 i could do this:
SELECT v.value:position::NUMBER POSITION , v.value:team_name::STRING TEAM_NAME , v.value:games_played::NUMBER GAMES_PLAYED , ... FROM JSON_TABLE a1, LATERAL FLATTEN(JSON_DATA:table_1) v
and that i could do the same for table_2 and union them, but there can be N possibilities with regards to the table_N placeholder.
I’ve looked at doing LATERAL FLATTEN multiple times:
SELECT v.value:position::NUMBER POSITION , v.value:team_name::STRING TEAM_NAME , v.value:games_played::NUMBER GAMES_PLAYED , ... FROM JSON_TABLE a1, LATERAL FLATTEN(JSON_DATA:table_1) v, LATERAL FLATTEN(JSON_DATA:table_2) v2
But this results in duplication of data, and does not allow me to put each tables columns all in a single relational structure.
I’m sure there is something simple that i am missing here, but i’ve reached a point where i think i’ve been staring at this too long, and just can’;t see it.
Thanks in advance, S
Advertisement
Answer
If you are trying to create a single, flattened view of the table_n data, as well as the attributes of at the first level, then something like this would work.
WITH x AS ( SELECT '{ "dom_url": "https://www.soccertables.com/european_tables", "event_id": "01b2722a-d8e6-4f67-95d0-8dd7ba088a4a", "event_utc_time": "2020-05-11 09:01:14.821", "ip_address": "125.238.134.96", "table_1": [ { "position": "1", "team_name": "Liverpool", "games_played": "29", "games_won": "26", "games_drawn": "2", "games_lost": "1", "goals_for": "75", "goals_against": "35", "points": "80" }, { "position": "2", "team_name": "Man. City", "games_played": "29", "games_won": "20", "games_drawn": "5", "games_lost": "4", "goals_for": "60", "goals_against": "45", "points": "65" }, { "position": "...", "team_name": "...", "games_played": "...", "games_won": "...", "games_drawn": "...", "games_lost": "...", "goals_for": "...", "goals_against": "...", "points": "..." } ], "table_2": [ { "position": "1", "team_name": "Bayern Munich", "games_played": "29", "games_won": "26", "games_drawn": "2", "games_lost": "1", "goals_for": "75", "goals_against": "35", "points": "80" }, { "position": "2", "team_name": "Bayer Leverkussen", "games_played": "29", "games_won": "20", "games_drawn": "5", "games_lost": "4", "goals_for": "60", "goals_against": "45", "points": "65" }, { "position": "...", "team_name": "...", "games_played": "...", "games_won": "...", "games_drawn": "...", "games_lost": "...", "goals_for": "...", "goals_against": "...", "points": "..." } ], "referrer_url": "https://www.soccertables.com", }' as var) SELECT parse_json(x.var):dom_url::string, parse_json(x.var):event_id::string, parse_json(x.var):event_utc_time::string, parse_json(x.var):ip_address::string, x3.value:games_drawn::string, x3.value:games_lost::string, x3.value:games_played::string, x3.value:games_won::string, x3.value:goals_against::string, x3.value:goals_for::string, x3.value:points::string, x3.value:position::string, x3.value:team_name::string FROM x ,LATERAL FLATTEN(parse_json(x.var)) x2 ,LATERAL FLATTEN(X2.VALUE) x3;
The CTE is obviously just to show the example with the sample JSON you provided. If you care about which records came from which table, you can also include x2.key
as an element in your SELECT
.