PostgreSQL – Query Optimization

I have this below query which takes about 15-20 secs to run.

with cte0 as (
    SELECT
        label,
        date,
        CASE
            WHEN
                Lead(label || date || "number") OVER (PARTITION BY label || date || "number" ORDER BY "label", "date", "number", "time") IS NULL
            THEN
                '1'::numeric
            ELSE
                '0'::numeric
        END As "unique"
    FROM table_data
    LEFT JOIN table_mapper ON
        table_mapper."type" = table_data."type"
    WHERE Date BETWEEN date_trunc('month', current_date - 1) and current_date - 1
)
SELECT 'MTD' as "label", round(sum("unique") / count("unique") *100,1) as "value" FROM cte0 WHERE "date" BETWEEN date_trunc('month', current_date - 1) AND current_date -1
UNION ALL
SELECT 'Week' as "label", round(sum("unique") / count("unique") *100,1) as "value" FROM cte0 WHERE "date" BETWEEN date_trunc('week', current_date - 1) AND current_date -1
UNION ALL
SELECT 'FTD' as "label", round(sum("unique") / count("unique") *100,1) as "value" FROM cte0 WHERE "date" = current_date -1

In the table table_data I have a index on date column.

CREATE INDEX ix_cli_date
  ON table_data
  USING btree
  (date);

Table Definition (`d table_data`)

Table "public.table_data"
      Column      |          Type          | Modifiers
------------------+------------------------+-----------
 date             | date                   | not null
 number           | bigint                 | not null
 time             | time without time zone | not null
 end time         | time without time zone | not null
 duration         | integer                | not null
 time1            | integer                | not null
 time2            | integer                | not null
 time3            | integer                | not null
 time4            | integer                | not null
 time5            | integer                | not null
 time6            | integer                | not null
 time7            | integer                | not null
 type             | text                   | not null
 name             | text                   | not null
 id1              | integer                | not null
 id2              | integer                | not null
 key              | integer                | not null
 status           | text                   | not null
Indexes:
    "ix_cli_date" btree (date)

Table Definition (d table_mapper)

 Table "public.table_mapper"
   Column   | Type | Modifiers
------------+------+-----------
 type       | text | not null
 label     | text | not null
 label2     | text | not null
 label3     | text | not null
 label4     | text | not null
 label5     | text | not null

EXPLAIN ANALYZE of the query

Result  (cost=184342.66..230332.86 rows=3 width=64) (actual time=23377.923..25695.478 rows=3 loops=1)"
  CTE cte0"
    ->  WindowAgg  (cost=121516.06..156751.65 rows=612793 width=23) (actual time=14578.000..18985.958 rows=696157 loops=1)"
          ->  Sort  (cost=121516.06..123048.04 rows=612793 width=23) (actual time=14577.975..17084.405 rows=696157 loops=1)"
                Sort Key: (((table_mapper.label || (table_data.date)::text) || (table_data."number")::text)), table_mapper.label, table_data.date, table_data."number", table_data."time""
                Sort Method: external merge  Disk: 39480kB"
                ->  Hash Left Join  (cost=11.96..37474.21 rows=612793 width=23) (actual time=1.449..3308.718 rows=696157 loops=1)"
                      Hash Cond: (table_data."type" = table_mapper."type")"
                      ->  Index Scan using ix_cli_date on table_data  (cost=0.02..29036.36 rows=612793 width=38) (actual time=0.141..946.648 rows=696157 loops=1)"
                            Index Cond: ((date >= date_trunc('month'::text, ((('now'::text)::date - 1))::timestamp with time zone)) AND (date   Hash  (cost=7.53..7.53 rows=353 width=25) (actual time=1.275..1.275 rows=336 loops=1)"
                            Buckets: 1024  Batches: 1  Memory Usage: 15kB"
                            ->  Seq Scan on table_mapper  (cost=0.00..7.53 rows=353 width=25) (actual time=0.020..0.589 rows=336 loops=1)"
  ->  Append  (cost=27591.00..73581.21 rows=3 width=64) (actual time=23377.920..25695.467 rows=3 loops=1)"
        ->  Aggregate  (cost=27591.00..27591.02 rows=1 width=32) (actual time=23377.917..23377.918 rows=1 loops=1)"
              ->  CTE Scan on cte0  (cost=0.00..27575.68 rows=3064 width=32) (actual time=14578.052..22335.236 rows=696157 loops=1)"
                    Filter: ((date = date_trunc('month'::text, ((('now'::text)::date - 1))::timestamp with time zone)))"
        ->  Aggregate  (cost=27591.00..27591.02 rows=1 width=32) (actual time=1741.509..1741.510 rows=1 loops=1)"
              ->  CTE Scan on cte0  (cost=0.00..27575.68 rows=3064 width=32) (actual time=20.009..1522.352 rows=168261 loops=1)"
                    Filter: ((date = date_trunc('week'::text, ((('now'::text)::date - 1))::timestamp with time zone)))"
        ->  Aggregate  (cost=18399.11..18399.13 rows=1 width=32) (actual time=576.029..576.030 rows=1 loops=1)"
              ->  CTE Scan on cte0  (cost=0.00..18383.79 rows=3064 width=32) (actual time=9.308..546.735 rows=23486 loops=1)"
                    Filter: (date = (('now'::text)::date - 1))"
Total runtime: 25710.506 ms"

Description :

I’m taking the unique count and repeated count from the table_data and this where LEAD helped me out where I give the value 0 for the last repeated value of a column.

Suppose I have 3 x in a column. I give 1 value to the first 2 x and the 3rd x is given 0.

Actually through a cte I’m taking the entire rows from the table table_data and doing some calculation using the lead and concatinating the strings for a defined date range where each row 1 and 0 value is defined as per the criteria.

If the lead is null it’ll be counted as 1 and if it is not null then 0.

And the I return 3 rows MTD, Current Week and FTD respectively with a calculation on taking the sum() I got from the lead and the count(*) entire rows.

For MTD I have the sum and count for the current month.

For Week – It’s the current week and FTD is for yesterday.

Answer

WITH cte AS (
   SELECT d.thedate
        , lead(m.label) OVER (PARTITION BY m.label, d.thedate, d.number
                              ORDER BY d.thetime) AS leader
   FROM   table_data d
   LEFT   JOIN table_mapper m USING (type)
   WHERE  thedate BETWEEN date_trunc('month', current_date - 1)
                  AND current_date - 1
   )

SELECT 'MTD' AS label, round(count(leader)::numeric / count(*) * 100, 1) AS val
FROM   cte

UNION ALL
SELECT 'Week', round(count(leader)::numeric / count(*) * 100, 1)
FROM   cte
WHERE  thedate BETWEEN date_trunc('week', current_date - 1) AND current_date - 1

UNION ALL
SELECT 'FTD', round(count(leader)::numeric / count(*) * 100, 1)
FROM   cte
WHERE  thedate = current_date - 1;

​x
 
WITH cte AS (   SELECT d.thedate        , lead(m.label) OVER (PARTITION BY m.label, d.thedate, d.number                              ORDER BY d.thetime) AS leader   FROM   table_data d   LEFT   JOIN table_mapper m USING (type)   WHERE  thedate BETWEEN date_trunc('month', current_date - 1)                  AND current_date - 1   )​SELECT 'MTD' AS label, round(count(leader)::numeric / count(*) * 100, 1) AS valFROM   cte​UNION ALLSELECT 'Week', round(count(leader)::numeric / count(*) * 100, 1)FROM   cteWHERE  thedate BETWEEN date_trunc('week', current_date - 1) AND current_date - 1​UNION ALLSELECT 'FTD', round(count(leader)::numeric / count(*) * 100, 1)FROM   cteWHERE  thedate = current_date - 1;​

The CTE makes sense for big tables, so you only scan it once. For smaller tables it may be faster without …

Using thedate instead of reserved word date (in standard SQL). thetime, uni instead of time, unique. Etc.

Simplified the lead() call. You get a value or NULL for the leading row. That seems the be the only relevant information.
It’s a pointless waste to repeat columns from the PARTITION clause in the ORDER BY clause of a window function.

Building on that, count(leader) / count(*) instead of sum(uni) / count(uni) is a bit faster. count(column) only counts non-null values, while count(*) counts all rows.

The condition for the first term of the UNION query was redundant.

More advice and links about data definition in the comments to the question.

Table design / Indexes

You should have primary keys. I suggest serial or IDENTITY column as surrogate PK for table_data:

ALTER TABLE table_data ADD COLUMN table_data_id serial PRIMARY KEY;

 
ALTER TABLE table_data ADD COLUMN table_data_id serial PRIMARY KEY;​

See:

Auto increment table column

Make type the primary key of table_mapper (also needed for the following FK constraint):

ALTER TABLE table_mapper ADD CONSTRAINT table_mapper_pkey (type);

 
ALTER TABLE table_mapper ADD CONSTRAINT table_mapper_pkey (type);​

Add a foreign key constraint for type to enforce referential integrity. Something like:

ALTER TABLE table_data ADD CONSTRAINT table_data_type_fkey
  FOREIGN KEY (type) REFERENCES table_mapper (type)
  ON UPDATE CASCADE ON DELETE NO ACTION;

 
ALTER TABLE table_data ADD CONSTRAINT table_data_type_fkey  FOREIGN KEY (type) REFERENCES table_mapper (type)  ON UPDATE CASCADE ON DELETE NO ACTION;​

For ultimate read performance (at some cost for writes), add a multi-column index to possibly allow index-only scans for above query:

CREATE INDEX table_data_foo_idx ON table_data (thedate, number, thetime);

 
CREATE INDEX table_data_foo_idx ON table_data (thedate, number, thetime);​

Table Definition (d table_data)

EXPLAIN ANALYZE of the query

Advertisement

Answer

Table design / Indexes

Table Definition (`d table_data`)