Standard SQL – How to count frequency of values in array

I get the following Table with the query underneath:

result of SQL query - screenshot

SELECT 
  fullVisitorId,
  COUNT(fullVisitorId) as id_count,
  ARRAY_AGG(trafficSource.medium) AS trafic_medium
FROM 
  `bigquery-public-data.google_analytics_sample.ga_sessions_20170101`
GROUP BY
  fullVisitorId
ORDER BY
  id_count DESC

​x
 
SELECT   fullVisitorId,  COUNT(fullVisitorId) as id_count,  ARRAY_AGG(trafficSource.medium) AS trafic_mediumFROM   `bigquery-public-data.google_analytics_sample.ga_sessions_20170101`GROUP BY  fullVisitorIdORDER BY  id_count DESC​

For each of the values in the trafic_medium column (e.g.: cpc, referral, organic, etc.) I am trying to figure out how often each value occurred in the array, so preferably add a new column ‘count’ that shows ho often that value occurred?

+-----------+---------+------+
| array_agg | medium  | count|
+-----------+---------+------+
| 123       | cpc     |   2  |
+-----------+---------+------+
|           | organic |   1  |
+-----------+---------+------+
|           | cpc     |   2  |
+-----------+---------+------+
| 456       | organic |   2  |
+-----------+---------+------+
|           | organic |   2  |
+-----------+---------+------+
|           | cpc     |   1  |
+-----------+---------+------+

 
+-----------+---------+------+| array_agg | medium  | count|+-----------+---------+------+| 123       | cpc     |   2  |+-----------+---------+------+|           | organic |   1  |+-----------+---------+------+|           | cpc     |   2  |+-----------+---------+------+| 456       | organic |   2  |+-----------+---------+------+|           | organic |   2  |+-----------+---------+------+|           | cpc     |   1  |+-----------+---------+------+​

I’m new to SQL so I’m quite stuck.

I tried this so far:

WITH medium AS
(
    SELECT 
        fullVisitorId,
        COUNT(fullVisitorId) as id_count,
        ARRAY_AGG(trafficSource.medium) AS trafic_medium
    FROM 
        `bigquery-public-data.google_analytics_sample.ga_sessions_20170101`
    GROUP BY
        fullVisitorId
    ORDER BY
        id_count DESC
) 
SELECT
    fullVisitorId,
    trafic_medium,
    (SELECT AS STRUCT Any_Value(trafic_medium) AS name, COUNT(*) AS count
FROM 
    UNNEST(trafic_medium) AS trafic_medium) AS trafic_medium_2,
FROM 
    medium

 
WITH medium AS(    SELECT         fullVisitorId,        COUNT(fullVisitorId) as id_count,        ARRAY_AGG(trafficSource.medium) AS trafic_medium    FROM         `bigquery-public-data.google_analytics_sample.ga_sessions_20170101`    GROUP BY        fullVisitorId    ORDER BY        id_count DESC) SELECT    fullVisitorId,    trafic_medium,    (SELECT AS STRUCT Any_Value(trafic_medium) AS name, COUNT(*) AS countFROM     UNNEST(trafic_medium) AS trafic_medium) AS trafic_medium_2,FROM     medium​

Based on this thread: How to count frequency of elements in a bigquery array field

However this only shows the number of ‘Any_Value not for all distinct.

I would appreciate some help!

p.s. I am doing this in BigQuery on the ‘bigquery-public-dataset.google_analytics_sample’

Answer

Below is for BigQuery Standard SQL to help you get started

#standardSQL
SELECT id, trafic_medium,
  ARRAY(
    SELECT AS STRUCT medium, COUNT(1) `count`
    FROM t.trafic_medium medium
    GROUP BY medium
  ) stats
FROM `project.dataset.table` t

 
#standardSQLSELECT id, trafic_medium,  ARRAY(    SELECT AS STRUCT medium, COUNT(1) `count`    FROM t.trafic_medium medium    GROUP BY medium  ) statsFROM `project.dataset.table` t​

if to apply to sample/dummy data from you question as in below example

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 123 id, ['cpc', 'organic', 'cpc'] trafic_medium UNION ALL
  SELECT 456, ['organic', 'organic', 'cpc']
)
SELECT id, trafic_medium,
  ARRAY(
    SELECT AS STRUCT medium, COUNT(1) `count`
    FROM t.trafic_medium medium
    GROUP BY medium
  ) stats
FROM `project.dataset.table` t
-- ORDER BY id

 
#standardSQLWITH `project.dataset.table` AS (  SELECT 123 id, ['cpc', 'organic', 'cpc'] trafic_medium UNION ALL  SELECT 456, ['organic', 'organic', 'cpc'])SELECT id, trafic_medium,  ARRAY(    SELECT AS STRUCT medium, COUNT(1) `count`    FROM t.trafic_medium medium    GROUP BY medium  ) statsFROM `project.dataset.table` t-- ORDER BY id   ​

result will be

As an option – you can use below version

#standardSQL
SELECT id, 
  ARRAY(
    SELECT AS STRUCT medium, `count`
    FROM t.trafic_medium medium
    LEFT JOIN (
      SELECT AS STRUCT medium, COUNT(1) `count`
      FROM t.trafic_medium medium
      GROUP BY medium
    ) stats
    USING(medium) 
  ) trafic_medium  
FROM `project.dataset.table` t
-- ORDER BY id

 
#standardSQLSELECT id,   ARRAY(    SELECT AS STRUCT medium, `count`    FROM t.trafic_medium medium    LEFT JOIN (      SELECT AS STRUCT medium, COUNT(1) `count`      FROM t.trafic_medium medium      GROUP BY medium    ) stats    USING(medium)   ) trafic_medium  FROM `project.dataset.table` t-- ORDER BY id   ​

which (if to apply to the same dummy data) will output below

This version looks more aligned with your expected result

Advertisement

Answer