Skip to content
Advertisement

Big Query Standard SQL using Partition By with the ARRAY_AGG() function

I am trying to use the PARTITION BY clause with the ARRAY_AGG() function to collapse a column into an array.

My Standard SQL in Big Query is as follows:

        WITH initial_30days
           AS (
          SELECT 
            date,
            fullvisitorId AS user_id,
            visitNumber, 
            CONCAT(fullvisitorid, CAST(VisitId AS STRING)) AS session_id
          FROM
            `my-data.XXXXXXX.ga_sessions_*`
            WHERE _TABLE_SUFFIX BETWEEN '20181004' AND  '20181103'
            GROUP BY 1,2,3,4
            )

          SELECT
            date,
            ARRAY_AGG(sessions) OVER (PARTITION BY date ROWS BETWEEN 5 PRECEDING 
            AND CURRENT ROW) AS agg_array
          FROM(

          SELECT
            date,
            user_id,
            COUNT(DISTINCT( session_id))  AS sessions
            FROM initial_30days
            GROUP BY date,user_id) 
            GROUP BY date,sessions

My expected output is:

+----------+--------------------------+
|   date   |        agg_array         |
+----------+--------------------------+
| 20181004 | [34,21,34,21,6,7,4,43]   |
| 20181005 | [1,5,56,76,23,1,3,54,45] |
| 20181006 | [22,67,43,1,2,67,3,24]   |
| 20181007 | [34,21,34,21,6,7,4,43]   |
+----------+--------------------------+

My current output looks something like this taking one date value for example:

+----------+------------------------+
|   date   |       agg_array        |
+----------+------------------------+
| 20181004 | [34]                   |
| 20181004 | [34,21]                |
| 20181004 | [34,21,34]             |
| 20181004 | [34,21,34,21]          |
| 20181004 | [34,21,34,21,6]        |
| 20181004 | [34,21,34,21,6,7]      |
| 20181004 | [34,21,34,21,6,7,4]    |
| 20181004 | [34,21,34,21,6,7,4,43] |
+----------+------------------------+

You can see the array partitioned by date creates an incremental row for each value of that array.

The dataset that the ARRAY_AGG() function is applied over looks like:

+----------+------------------+----------+
|   date   |     user_id      | sessions |
+----------+------------------+----------+
| 20181004 | 2526262363754747 |       34 |
| 20181004 | 2525626325173256 |       21 |
| 20181004 | 7436783255747736 |       34 |
| 20181004 | 6526241526363536 |       21 |
| 20181004 | 4252636353637423 |        6 |
| 20181004 | 3636325636673563 |        7 |
+----------+------------------+----------+

I get the feeling its because I’m grouping by sessions above but that is because I get a validation error like so if I don’t:

    SELECT list expression references column sessions which is 
neither grouped nor aggregated at 

Advertisement

Answer

Below is for BigQuery Standard SQL

Just add below around your original query

SELECT date, 
  ARRAY_AGG(STRUCT(agg_array) ORDER BY ARRAY_LENGTH(agg_array) DESC LIMIT 1)[OFFSET(0)].*
FROM (
  ...   
  ...   
)
GROUP BY date   

So, the whole stuff will look like below (and will produce desired result – while preserving your idea of using windowed functions)

#standardSQL
WITH initial_30days AS (
  SELECT 
    date,
    fullvisitorId AS user_id,
    visitNumber, 
    CONCAT(fullvisitorid, CAST(VisitId AS STRING)) AS session_id
  FROM `my-data.XXXXXXX.ga_sessions_*`
  WHERE _TABLE_SUFFIX BETWEEN '20181004' AND  '20181103'
  GROUP BY 1,2,3,4
)
SELECT date, 
  ARRAY_AGG(STRUCT(agg_array) ORDER BY ARRAY_LENGTH(agg_array) DESC LIMIT 1)[OFFSET(0)].*
FROM (
  SELECT
    date, 
    ARRAY_AGG(sessions) OVER(PARTITION BY date ROWS BETWEEN 5 PRECEDING AND CURRENT ROW) AS agg_array
  FROM(
    SELECT
      date,
      user_id,
      COUNT(DISTINCT( session_id))  AS sessions
    FROM initial_30days
    GROUP BY date,user_id
  )
  GROUP BY date,sessions
)
GROUP BY date   
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement