I have some data that I am trying to aggregate (greatly simplified here). The raw data uses a schema similar to the following:
UserID - STRING A - RECORD REPEATED A.Action - STRING A.Visit - INTEGER A.Order - INTEGER MISC - RECORD REPEATED ( other columns omitted here )
There are many actual records due to the “MISC” column, but I’m only trying to focus on the first 5 columns shown above. A sample of the raw data is shown below (note that the values shown are a sample only, many other values exist so these cannot be hard coded into the query) :
Table 0: (Raw data sample)
(empty values under UserID are as shown in BiqQuery – “A” fields are part of a nested record)
My query produces the data shown in Table 1 below. I am trying to use ARRAY_AGG with ORDINAL to select only the first two “Action”s for each user and restructure as shown in TABLE 2.
SELECT UserId, ARRAY_AGG( STRUCT(A.Action, A.Visit, A.Order) ORDER BY A.Visit, A.Order, A.Action ) FROM `table` LEFT JOIN UNNEST(A) AS A GROUP BY UserId
Table 1: (Sample output of above query )
Table 2: (The format needed)
So I need to:
- Get distinct “Action” values for each user
- Preserve the order ( UserID, Visit, Order )
- Show only the 1st and 2nd actions in one row
My attempted query strategy was to ORDER BY UserID, Visit, Order and get DISTINCT values of Action using something like:
UserId, ARRAY_AGG(DISTINCT Action ORDER BY UserID, Visit, Order) FirstAction, ARRAY_AGG(DISTINCT Action ORDER BY UserID, Visit, Order) SecondAction
However, that approach produces the following error:
Error: An aggregate function that has both DISTINCT and ORDER BY arguments can only ORDER BY columns that are arguments to the function
Any thoughts on how to correct this error (or an alternative approach?)
Advertisement
Answer
Not sure why the original query has DISTINCT
, if the results shown in table 2 don’t need de-duplication.
With that said:
#standardSQL WITH sample AS ( SELECT actor.login userid, type action , EXTRACT(HOUR FROM created_at) visit , EXTRACT(MINUTE FROM created_at) `order` FROM `githubarchive.day.20171005` ) SELECT userid, actions[OFFSET(0)] firstaction, actions[SAFE_OFFSET(1)] secondaction FROM ( SELECT userid, ARRAY_AGG(action ORDER BY visit, `order` LIMIT 2) actions FROM sample GROUP BY 1 ORDER BY 1 LIMIT 100 )