I have a big nested struct in AWS Athena. Here’s one column named “petowners” in the table:
{_id=5e6b531a412345e0e86aeae0, status=NotAnalyzed, animalcategories=[{categoryname=mammals, matches=1}, {categoryname=birds, matches=2}, {categoryname= UnknownField, matches=4}], ...many-other-values}
I’m looking for:
- The equivalent of the python function
value_counts
in the column. Meaning I’m looking for the SQL Athena command that will output for this row:[mammals:1, birds:2, UnknownField:4]
- A way to query aggregation – to create a histogram of total number
of pets per owner for this
row = 7
- How many pet owners have an
UnknownField
in ‘animalycategories’ - How many animal types are there in the entire table?
Advertisement
Answer
Here’s the beginning of the solution: Let’s call the table “entire_table”
SELECT t.entire_table._id, t.petowners.animalcategories, ac.categoryname, ac.matches FROM entire_table t, UNNEST(t.petowners.animalcategories) AS t(ac)
This query will output a table with columns named “categoryname” and “matches”, where each row is duplicated for as many category names as there are for each user_id:
| _id | animalcategories | categoryname | matches | |--------------------------|---------------------------------------------------------------------------------------------------------------|--------------|---------| | 5e6b531a412345e0e86aeae0 | [{categoryname=mammals, matches=1}, {categoryname=birds, matches=2}, {categoryname= UnknownField, matches=4}] | mammals | 1 | | 5e6b531a412345e0e86aeae0 | [{categoryname=mammals, matches=1}, {categoryname=birds, matches=2}, {categoryname= UnknownField, matches=4}] | birds | 2 | | 5e6b531a412345e0e86aeae0 | [{categoryname=mammals, matches=1}, {categoryname=birds, matches=2}, {categoryname= UnknownField, matches=4}] | UnknownField | 4 |
And here are the most relevant links by order of importance that enabled the solution:
- A similar question in stackoverflow
- Presto documentation showing Lambda Expressions and Functions which are another way to work with nested structs
- AWS explaining about “Querying Arrays with Complex Types and Nested Structures”
- A good blog read from Joe Celko about “Nesting levels in SQL”
- SQL original paper from 1970 IBM research by E.F.CODD added for the sake of “being pretty” and as a token of respect
- SQL pdf HUGE manual – a bit of an overkill but under “Query expressions” at page 323 I look for the answers I can’t seem to find anywhere else
Going down the rabbit hole I encountered some less helpful links that I find worth mentioning, for the sake of this thorough review I’ll add them here:
- AWS Athena forum – many good questions, yet sadly few answers
- Presto google group – focused on the engineering part, not many answers as well
I hope someone someday will find this post useful and get themselves a shortcut from a few hours of browsing the web for answers I had to go through. Good luck.