Skip to content
Advertisement

Understanding the tables referred to in a BigQuery array cross join

I know the namespace isn’t the correct term for this but it conveys what I’m trying to understand. Take this query:

WITH Movies AS (
  SELECT 'Titanic' AS title,
  1997 AS year, 
  ['Drama',' Romance'] AS Genres
)
SELECT title, year, Movies.Genres FROM Movies, Movies.Genres where Genres='Drama'

enter image description here

But then using Movies.Genres instead of Genres gives a different result:

WITH Movies AS (
  SELECT 'Titanic' AS title,
  1997 AS year, 
  ['Drama',' Romance'] AS Genres
)
SELECT title, year,Genres FROM Movies, Movies.Genres where Genres='Drama'

enter image description here

Why does that occur? Additionally, is this standard (ISO) SQL, or is this just how BigQuery performs array-unnesting and joins?

Advertisement

Answer

When you need to use array in cross (or other types of) join – you have below options

  1. from tableA cross join unnest(array) as element
  2. from tableA, unnest(array) as element — here comma is actually shortcut for cross join
  3. if array is a column of tableA, you can use yet another shortcut (now for unnest)
    from tableA, tableA.array as element

the benefit of using UNNEST is that you can define OFFSET – like in below example

from tableA, unnest(array) as element with offset      

Having offset is extremely important in many use cases
In my practice – i am using all of above options depends on specific case

Now, as of difference in two queries:

So, the first query

SELECT title, year, Movies.Genres 
FROM Movies, Movies.Genres 
where Genres='Drama'    

is equivalent to below (notice aliase)

SELECT title, year, Movies.Genres 
FROM Movies, Movies.Genres as Genres
where Genres='Drama' 

Genres in where Genres='Drama' refers to element from implicitly unnested Movies.Genres – so one of element is Drama which returns 1 element out of total two
And, now trick is that in select statement you explicitly calling out Movies.Genres which is the original array in that row – so that explains the output

You can try below to kind of confirm above explanation

SELECT title, year, Movies.Genres
FROM Movies, Movies.Genres as Genres
where Genres in ('Drama', ' Romance')

for above output will be

enter image description here

While above explained the output for first query – I hope it is now clear why second query returns actual genre (Drama) instead of array

Hope this helped in understanding differences :o)

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement