Skip to content
Advertisement

Bigquery: Select top 3 with Group By condition

I have a table like this

type    Total
 A       100
 A       123
 A       154
 A       50
 A       54
 B       200
 B       166
 B       423
 B       342
 B       213
 C       520
 C       130
 C       234
 C       512

I want to select the top 3 total by groups. How can i do it?

Advertisement

Answer

In most [big data] use cases using ROW_NUMBER() is not fine as it ends up with resource exceeded error. This is because it requires all point of same group be present in same/one node which in case of data skew leads to above mentioned error in BigQuery

Option 1

One of the usual ways to address this issue is using ARRAY_AGG() function as it is in below below example

#standardSQL
SELECT type, total FROM (
  SELECT type, ARRAY_AGG(total ORDER BY total DESC LIMIT 3) arr
  FROM `project.dataset.table` GROUP BY type
), UNNEST(arr) total

If to run above against data example from your question

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'A' type, 100 total UNION ALL
  SELECT 'A', 123 UNION ALL
  SELECT 'A', 154 UNION ALL
  SELECT 'A', 50 UNION ALL
  SELECT 'A', 54 UNION ALL
  SELECT 'B', 200 UNION ALL
  SELECT 'B', 166 UNION ALL
  SELECT 'B', 423 UNION ALL
  SELECT 'B', 342 UNION ALL
  SELECT 'B', 213 UNION ALL
  SELECT 'C', 520 UNION ALL
  SELECT 'C', 130 UNION ALL
  SELECT 'C', 234 UNION ALL
  SELECT 'C', 512 
)
SELECT type, total FROM (
  SELECT type, ARRAY_AGG(total ORDER BY total DESC LIMIT 3) arr
  FROM `project.dataset.table` GROUP BY type
), UNNEST(arr) total
-- ORDER BY type   

you will get expected result as

Row type    total    
1   A       154  
2   A       123  
3   A       100  
4   B       423  
5   B       342  
6   B       213  
7   C       520  
8   C       512  
9   C       234    

Option 2

But there is yet another interesting option to consider for really big data – to use APPROX_TOP_SUM() function as in below example

#standardSQL
SELECT type, value AS total FROM (
  SELECT type, APPROX_TOP_SUM(total, total, 3) arr
  FROM `project.dataset.table` GROUP BY type
), UNNEST(arr)  

obviously, with the same output as above for sample data

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement