calculating percentiles in aws athena

result of my query is being used in aws quicksight. even though quicksight offers percentileCont() which does the job for us I want to use it in the query instead of using calculated field.

eventually what I want to do is create a point column where

under 25percentile -> 0
under 50 percentile -> 1 
under 75 percentile -> 2
rest -> 3

​x
 
under 25percentile -> 0under 50 percentile -> 1 under 75 percentile -> 2rest -> 3​

depending on a column that ranges from [a, b].

Right now I find out value at each percentile and manually create a range

With table as (
    SELECT *
         , cast(date_diff('day', last_transaction, current_date) as double) as col
)
SELECT *
     , case 
         when col between 0 and 25 then 0
         when col between 26 and 66 then 1
         when col between 67 and 193 then 2
         when col >= 194 then 3
       end as point
  FROM table;

 
With table as (    SELECT *         , cast(date_diff('day', last_transaction, current_date) as double) as col)SELECT *     , case          when col between 0 and 25 then 0         when col between 26 and 66 then 1         when col between 67 and 193 then 2         when col >= 194 then 3       end as point  FROM table;​

however I want to make it dynamic so instead of [0,25] would be something like [min(col), 25percentile(col)].

above query outputs

col   point
333     3
166     2
 96     1
 .

 
col   point333     3166     2 96     1 .​

With NTILE() added Thanks to @Gordon Linoff

With table as (
    SELECT *
         , cast(date_diff('day', last_transaction, current_date) as double) as col
)
SELECT *
     , case 
         when col between 0 and 25 then 0
         when col between 26 and 66 then 1
         when col between 67 and 193 then 2
         when col >= 194 then 3
       end as point
      , NTILE(4) over(order by col) as pt
  FROM table;

 
With table as (    SELECT *         , cast(date_diff('day', last_transaction, current_date) as double) as col)SELECT *     , case          when col between 0 and 25 then 0         when col between 26 and 66 then 1         when col between 67 and 193 then 2         when col >= 194 then 3       end as point      , NTILE(4) over(order by col) as pt  FROM table;​

outputs

col   point
0     1
0     1
0     1
 .

 
col   point0     10     10     1 .​

seems to mess up col calculation

Answer

You are pretty much describing the ntile() function:

SELECT t.*,,
       NTILE(4) OVER (ORDER BY col) - 1 as point
FROM table;

 
SELECT t.*,,       NTILE(4) OVER (ORDER BY col) - 1 as pointFROM table;​

Two caveats:

NTILE(<n>) returns values between 1 and n
NTILE() makes sure the resulting tiles are equal. That means that values on the boundaries could end up in different bins.

An alternative that puts values in separate bins (but the bins might have different sizes) is percent_rank(). In your case:

SELECT t.*,,
       CEILING(PRECENT_RANK() OVER (ORDER BY col) * 4) - 1 as point
FROM table;

 
SELECT t.*,,       CEILING(PRECENT_RANK() OVER (ORDER BY col) * 4) - 1 as pointFROM table;​

Advertisement

Answer