Skip to content
Advertisement

Calculate mode() on an array column without skewing averages in PostgreSQL

I’ve designed a table to keep track of running processes:

                                Table "public.processes"
         Column          |           Type           | Collation | Nullable | Default
-------------------------+--------------------------+-----------+----------+---------
 id                      | uuid                     |           | not null |
 duration                | bigint                   |           |          |
 pauses                  | bigint                   |           |          |
 start_date              | timestamp with time zone |           |          |
 end_date                | timestamp with time zone |           |          |
 power_levels            | integer[]                |           |          |

Where pauses is the number of times the process has haulted and power_levels is an array of integers from 0 to 4, with repetitions allowed, representing power levels that that process has been consuming.

Using a single query I would like to select all processes executed within a span of a certain week and display:

  1. A breakdown of the average duration, per day of week
  2. The maximum number of pauses within a single process, throughout the week
  3. The most commonly used power level used throughout the week

So given example data:

       start_date       |        end_date        | power_levels | duration | pauses
------------------------+------------------------+--------------+----------+-------
 2020-06-06 10:00:00+00 | 2020-06-06 10:10:00+00 | {3}          |     1000 |     3
 2020-06-07 10:00:00+00 | 2020-06-07 10:10:00+00 | {2}          |     2000 |    10
 2020-06-07 12:00:00+00 | 2020-06-07 12:10:00+00 | {4,1}        |     3000 |    60
 2020-06-08 10:00:00+00 | 2020-06-08 10:10:00+00 | {4,2}        |     4000 |    10
 2020-06-08 12:00:00+00 | 2020-06-08 12:10:00+00 | {4,4,3}      |     1337 |     2

I would like a result similar to:

 most_used_power_level |   avg  | max | dow
-----------------------+--------+-----+-----
                     4 |   2500 |  60 |   0
                     4 | 2668,5 |  60 |   1
                     4 |   1000 |  60 |   6

So far I have:

select 
    mode() within group (order by most_used_power), 
    avg(duration),
    max(pauses), 
    extract (dow from start_date) as dow 
from (
    select 
        unnest(power_levels) as most_used_power,
        duration,
        pauses,
        start_date
    from processes 
    where start_date >= '2020-06-01' and start_date < '2020-06-09'
) as foo
group by dow;

But there are two issues with this query:

  1. Unnesting the power levels in the sub-query skew the calculation of the averages
  2. The final result groups the most used power level, and number of pauses per day of week, rather than the whole duration

Apart from splitting the query in 2, I’m not sure how to proceed from here. Is there a way to achieve this in a single query?

Advertisement

Answer

If you want the most used power level independent of the day of the week, then calculate it separately:

select avg(p.duration),
       max(p.pauses), 
       extract(dow from start_date) as dow,
       pl.most_Used_power_level
from processes p cross join
     (select mode() within group (order by power_level) as most_Used_power_level
      from processes p cross join lateral
           unnest(power_levels) power_level
      where p.start_date >= '2020-06-01' and p.start_date < '2020-06-09'
     ) pl
where p.start_date >= '2020-06-01' and p.start_date < '2020-06-09'
group by dow, pl.most_Used_power_level;
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement