Skip to content
Advertisement

calculating percentiles in aws athena

result of my query is being used in aws quicksight. even though quicksight offers percentileCont() which does the job for us I want to use it in the query instead of using calculated field.

eventually what I want to do is create a point column where

under 25percentile -> 0
under 50 percentile -> 1 
under 75 percentile -> 2
rest -> 3

depending on a column that ranges from [a, b].

Right now I find out value at each percentile and manually create a range

With table as (
    SELECT *
         , cast(date_diff('day', last_transaction, current_date) as double) as col
)
SELECT *
     , case 
         when col between 0 and 25 then 0
         when col between 26 and 66 then 1
         when col between 67 and 193 then 2
         when col >= 194 then 3
       end as point
  FROM table;

however I want to make it dynamic so instead of [0,25] would be something like [min(col), 25percentile(col)].

above query outputs

col   point
333     3
166     2
 96     1
 .

With NTILE() added Thanks to @Gordon Linoff

With table as (
    SELECT *
         , cast(date_diff('day', last_transaction, current_date) as double) as col
)
SELECT *
     , case 
         when col between 0 and 25 then 0
         when col between 26 and 66 then 1
         when col between 67 and 193 then 2
         when col >= 194 then 3
       end as point
      , NTILE(4) over(order by col) as pt
  FROM table;

outputs

col   point
0     1
0     1
0     1
 .

seems to mess up col calculation

Advertisement

Answer

You are pretty much describing the ntile() function:

SELECT t.*,,
       NTILE(4) OVER (ORDER BY col) - 1 as point
FROM table;

Two caveats:

  • NTILE(<n>) returns values between 1 and n
  • NTILE() makes sure the resulting tiles are equal. That means that values on the boundaries could end up in different bins.

An alternative that puts values in separate bins (but the bins might have different sizes) is percent_rank(). In your case:

SELECT t.*,,
       CEILING(PRECENT_RANK() OVER (ORDER BY col) * 4) - 1 as point
FROM table;
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement