I am trying to use the PARTITION BY
clause with the ARRAY_AGG()
function to collapse a column into an array.
My Standard SQL in Big Query is as follows:
WITH initial_30days
AS (
SELECT
date,
fullvisitorId AS user_id,
visitNumber,
CONCAT(fullvisitorid, CAST(VisitId AS STRING)) AS session_id
FROM
`my-data.XXXXXXX.ga_sessions_*`
WHERE _TABLE_SUFFIX BETWEEN '20181004' AND '20181103'
GROUP BY 1,2,3,4
)
SELECT
date,
ARRAY_AGG(sessions) OVER (PARTITION BY date ROWS BETWEEN 5 PRECEDING
AND CURRENT ROW) AS agg_array
FROM(
SELECT
date,
user_id,
COUNT(DISTINCT( session_id)) AS sessions
FROM initial_30days
GROUP BY date,user_id)
GROUP BY date,sessions
My expected output is:
+----------+--------------------------+
| date | agg_array |
+----------+--------------------------+
| 20181004 | [34,21,34,21,6,7,4,43] |
| 20181005 | [1,5,56,76,23,1,3,54,45] |
| 20181006 | [22,67,43,1,2,67,3,24] |
| 20181007 | [34,21,34,21,6,7,4,43] |
+----------+--------------------------+
My current output looks something like this taking one date value for example:
+----------+------------------------+
| date | agg_array |
+----------+------------------------+
| 20181004 | [34] |
| 20181004 | [34,21] |
| 20181004 | [34,21,34] |
| 20181004 | [34,21,34,21] |
| 20181004 | [34,21,34,21,6] |
| 20181004 | [34,21,34,21,6,7] |
| 20181004 | [34,21,34,21,6,7,4] |
| 20181004 | [34,21,34,21,6,7,4,43] |
+----------+------------------------+
You can see the array partitioned by date creates an incremental row for each value of that array.
The dataset that the ARRAY_AGG()
function is applied over looks like:
+----------+------------------+----------+
| date | user_id | sessions |
+----------+------------------+----------+
| 20181004 | 2526262363754747 | 34 |
| 20181004 | 2525626325173256 | 21 |
| 20181004 | 7436783255747736 | 34 |
| 20181004 | 6526241526363536 | 21 |
| 20181004 | 4252636353637423 | 6 |
| 20181004 | 3636325636673563 | 7 |
+----------+------------------+----------+
I get the feeling its because I’m grouping by sessions
above but that is because I get a validation error like so if I don’t:
SELECT list expression references column sessions which is
neither grouped nor aggregated at
Advertisement
Answer
Below is for BigQuery Standard SQL
Just add below around your original query
SELECT date,
ARRAY_AGG(STRUCT(agg_array) ORDER BY ARRAY_LENGTH(agg_array) DESC LIMIT 1)[OFFSET(0)].*
FROM (
)
GROUP BY date
So, the whole stuff will look like below (and will produce desired result – while preserving your idea of using windowed functions)
#standardSQL
WITH initial_30days AS (
SELECT
date,
fullvisitorId AS user_id,
visitNumber,
CONCAT(fullvisitorid, CAST(VisitId AS STRING)) AS session_id
FROM `my-data.XXXXXXX.ga_sessions_*`
WHERE _TABLE_SUFFIX BETWEEN '20181004' AND '20181103'
GROUP BY 1,2,3,4
)
SELECT date,
ARRAY_AGG(STRUCT(agg_array) ORDER BY ARRAY_LENGTH(agg_array) DESC LIMIT 1)[OFFSET(0)].*
FROM (
SELECT
date,
ARRAY_AGG(sessions) OVER(PARTITION BY date ROWS BETWEEN 5 PRECEDING AND CURRENT ROW) AS agg_array
FROM(
SELECT
date,
user_id,
COUNT(DISTINCT( session_id)) AS sessions
FROM initial_30days
GROUP BY date,user_id
)
GROUP BY date,sessions
)
GROUP BY date