I am trying to use the PARTITION BY
clause with the ARRAY_AGG()
function to collapse a column into an array.
My Standard SQL in Big Query is as follows:
WITH initial_30days AS ( SELECT date, fullvisitorId AS user_id, visitNumber, CONCAT(fullvisitorid, CAST(VisitId AS STRING)) AS session_id FROM `my-data.XXXXXXX.ga_sessions_*` WHERE _TABLE_SUFFIX BETWEEN '20181004' AND '20181103' GROUP BY 1,2,3,4 ) SELECT date, ARRAY_AGG(sessions) OVER (PARTITION BY date ROWS BETWEEN 5 PRECEDING AND CURRENT ROW) AS agg_array FROM( SELECT date, user_id, COUNT(DISTINCT( session_id)) AS sessions FROM initial_30days GROUP BY date,user_id) GROUP BY date,sessions
My expected output is:
+----------+--------------------------+ | date | agg_array | +----------+--------------------------+ | 20181004 | [34,21,34,21,6,7,4,43] | | 20181005 | [1,5,56,76,23,1,3,54,45] | | 20181006 | [22,67,43,1,2,67,3,24] | | 20181007 | [34,21,34,21,6,7,4,43] | +----------+--------------------------+
My current output looks something like this taking one date value for example:
+----------+------------------------+ | date | agg_array | +----------+------------------------+ | 20181004 | [34] | | 20181004 | [34,21] | | 20181004 | [34,21,34] | | 20181004 | [34,21,34,21] | | 20181004 | [34,21,34,21,6] | | 20181004 | [34,21,34,21,6,7] | | 20181004 | [34,21,34,21,6,7,4] | | 20181004 | [34,21,34,21,6,7,4,43] | +----------+------------------------+
You can see the array partitioned by date creates an incremental row for each value of that array.
The dataset that the ARRAY_AGG()
function is applied over looks like:
+----------+------------------+----------+ | date | user_id | sessions | +----------+------------------+----------+ | 20181004 | 2526262363754747 | 34 | | 20181004 | 2525626325173256 | 21 | | 20181004 | 7436783255747736 | 34 | | 20181004 | 6526241526363536 | 21 | | 20181004 | 4252636353637423 | 6 | | 20181004 | 3636325636673563 | 7 | +----------+------------------+----------+
I get the feeling its because I’m grouping by sessions
above but that is because I get a validation error like so if I don’t:
SELECT list expression references column sessions which is neither grouped nor aggregated at
Advertisement
Answer
Below is for BigQuery Standard SQL
Just add below around your original query
SELECT date, ARRAY_AGG(STRUCT(agg_array) ORDER BY ARRAY_LENGTH(agg_array) DESC LIMIT 1)[OFFSET(0)].* FROM ( ... ... ) GROUP BY date
So, the whole stuff will look like below (and will produce desired result – while preserving your idea of using windowed functions)
#standardSQL WITH initial_30days AS ( SELECT date, fullvisitorId AS user_id, visitNumber, CONCAT(fullvisitorid, CAST(VisitId AS STRING)) AS session_id FROM `my-data.XXXXXXX.ga_sessions_*` WHERE _TABLE_SUFFIX BETWEEN '20181004' AND '20181103' GROUP BY 1,2,3,4 ) SELECT date, ARRAY_AGG(STRUCT(agg_array) ORDER BY ARRAY_LENGTH(agg_array) DESC LIMIT 1)[OFFSET(0)].* FROM ( SELECT date, ARRAY_AGG(sessions) OVER(PARTITION BY date ROWS BETWEEN 5 PRECEDING AND CURRENT ROW) AS agg_array FROM( SELECT date, user_id, COUNT(DISTINCT( session_id)) AS sessions FROM initial_30days GROUP BY date,user_id ) GROUP BY date,sessions ) GROUP BY date