Skip to content
Advertisement

Using ARRAY_AGG() with DISTINCT and ORDER BY with ORDINAL

I have some data that I am trying to aggregate (greatly simplified here). The raw data uses a schema similar to the following:

UserID - STRING
A - RECORD REPEATED
A.Action - STRING
A.Visit - INTEGER
A.Order - INTEGER
MISC - RECORD REPEATED
( other columns omitted here )

There are many actual records due to the “MISC” column, but I’m only trying to focus on the first 5 columns shown above. A sample of the raw data is shown below (note that the values shown are a sample only, many other values exist so these cannot be hard coded into the query) :

Table 0: (Raw data sample)

(empty values under UserID are as shown in BiqQuery – “A” fields are part of a nested record)

Table0

My query produces the data shown in Table 1 below. I am trying to use ARRAY_AGG with ORDINAL to select only the first two “Action”s for each user and restructure as shown in TABLE 2.

SELECT
  UserId, ARRAY_AGG( STRUCT(A.Action, A.Visit, A.Order)
          ORDER BY A.Visit, A.Order, A.Action ) 
  FROM
    `table` 
  LEFT JOIN UNNEST(A) AS A
GROUP BY
    UserId

Table 1: (Sample output of above query )

Table1

Table 2: (The format needed)

Table2

So I need to:

  1. Get distinct “Action” values for each user
  2. Preserve the order ( UserID, Visit, Order )
  3. Show only the 1st and 2nd actions in one row

My attempted query strategy was to ORDER BY UserID, Visit, Order and get DISTINCT values of Action using something like:

UserId,
ARRAY_AGG(DISTINCT Action ORDER BY UserID, Visit, Order) FirstAction,
ARRAY_AGG(DISTINCT Action ORDER BY UserID, Visit, Order) SecondAction

However, that approach produces the following error:

Error: An aggregate function that has both DISTINCT and ORDER BY arguments can only ORDER BY columns that are arguments to the function

Any thoughts on how to correct this error (or an alternative approach?)

Advertisement

Answer

Not sure why the original query has DISTINCT, if the results shown in table 2 don’t need de-duplication.

With that said:

#standardSQL
WITH sample AS (
  SELECT actor.login userid, type action
    , EXTRACT(HOUR FROM created_at) visit
    , EXTRACT(MINUTE FROM created_at) `order`
  FROM `githubarchive.day.20171005` 
)

SELECT userid, actions[OFFSET(0)] firstaction, actions[SAFE_OFFSET(1)] secondaction
FROM (
  SELECT userid, ARRAY_AGG(action ORDER BY visit, `order` LIMIT 2) actions
  FROM sample
  GROUP BY 1
  ORDER BY 1
  LIMIT 100
)

enter image description here

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement