Why does the following join increase the query time significantly?

I have a star schema here and I am querying the fact table and would like to join one very small dimension table. I can’t really explain the following:

EXPLAIN ANALYZE SELECT 
  COUNT(impression_id), imp.os_id 
  FROM bi.impressions imp 
  GROUP BY imp.os_id;

                                                                  QUERY PLAN
    --------------------------------------------------------------------------------------------------------------------------------------
     HashAggregate  (cost=868719.08..868719.24 rows=16 width=10) (actual time=12559.462..12559.466 rows=26 loops=1)
       ->  Seq Scan on impressions imp  (cost=0.00..690306.72 rows=35682472 width=10) (actual time=0.009..3030.093 rows=35682474 loops=1)
     Total runtime: 12559.523 ms
    (3 rows)

​x
 
EXPLAIN ANALYZE SELECT   COUNT(impression_id), imp.os_id   FROM bi.impressions imp   GROUP BY imp.os_id;​                                                                  QUERY PLAN    --------------------------------------------------------------------------------------------------------------------------------------     HashAggregate  (cost=868719.08..868719.24 rows=16 width=10) (actual time=12559.462..12559.466 rows=26 loops=1)       ->  Seq Scan on impressions imp  (cost=0.00..690306.72 rows=35682472 width=10) (actual time=0.009..3030.093 rows=35682474 loops=1)     Total runtime: 12559.523 ms    (3 rows)​

This takes ~12600ms, but of course there is no joined data, so I can’t “resolve” the imp.os_id to something meaningful, so I add a join:

EXPLAIN ANALYZE SELECT 
  COUNT(impression_id), imp.os_id, os.os_desc 
  FROM  bi.impressions imp, bi.os_desc os 
  WHERE imp.os_id=os.os_id 
  GROUP BY imp.os_id, os.os_desc;
                                                                     QUERY PLAN
    --------------------------------------------------------------------------------------------------------------------------------------------
     HashAggregate  (cost=1448560.83..1448564.99 rows=416 width=22) (actual time=25565.124..25565.127 rows=26 loops=1)
       ->  Hash Join  (cost=1.58..1180942.29 rows=35682472 width=22) (actual time=0.046..15157.684 rows=35682474 loops=1)
             Hash Cond: (imp.os_id = os.os_id)
             ->  Seq Scan on impressions imp  (cost=0.00..690306.72 rows=35682472 width=10) (actual time=0.007..3705.647 rows=35682474 loops=1)
             ->  Hash  (cost=1.26..1.26 rows=26 width=14) (actual time=0.028..0.028 rows=26 loops=1)
                   Buckets: 1024  Batches: 1  Memory Usage: 2kB
                   ->  Seq Scan on os_desc os  (cost=0.00..1.26 rows=26 width=14) (actual time=0.003..0.010 rows=26 loops=1)
     Total runtime: 25565.199 ms
    (8 rows)

 
EXPLAIN ANALYZE SELECT   COUNT(impression_id), imp.os_id, os.os_desc   FROM  bi.impressions imp, bi.os_desc os   WHERE imp.os_id=os.os_id   GROUP BY imp.os_id, os.os_desc;                                                                     QUERY PLAN    --------------------------------------------------------------------------------------------------------------------------------------------     HashAggregate  (cost=1448560.83..1448564.99 rows=416 width=22) (actual time=25565.124..25565.127 rows=26 loops=1)       ->  Hash Join  (cost=1.58..1180942.29 rows=35682472 width=22) (actual time=0.046..15157.684 rows=35682474 loops=1)             Hash Cond: (imp.os_id = os.os_id)             ->  Seq Scan on impressions imp  (cost=0.00..690306.72 rows=35682472 width=10) (actual time=0.007..3705.647 rows=35682474 loops=1)             ->  Hash  (cost=1.26..1.26 rows=26 width=14) (actual time=0.028..0.028 rows=26 loops=1)                   Buckets: 1024  Batches: 1  Memory Usage: 2kB                   ->  Seq Scan on os_desc os  (cost=0.00..1.26 rows=26 width=14) (actual time=0.003..0.010 rows=26 loops=1)     Total runtime: 25565.199 ms    (8 rows)​

This effectively doubles the execution time of my query. My question is, what did I leave out from the picture? I would think such a small lookup was not causing huge difference in query execution time.

Answer

Rewritten with (recommended) explicit ANSI JOIN syntax:

SELECT COUNT(impression_id), imp.os_id, os.os_desc 
FROM   bi.impressions imp
JOIN   bi.os_desc os ON os.os_id = imp.os_id
GROUP  BY imp.os_id, os.os_desc;

 
SELECT COUNT(impression_id), imp.os_id, os.os_desc FROM   bi.impressions impJOIN   bi.os_desc os ON os.os_id = imp.os_idGROUP  BY imp.os_id, os.os_desc;​

First of all, your second query might be wrong, if more or less than exactly one match are found in os_desc for every row in impressions.
This can be ruled out if you have a foreign key constraint on os_id in place, that guarantees referential integrity, plus a NOT NULL constraint on bi.impressions.os_id. If so, in a first step, simplify to:

SELECT COUNT(*) AS ct, imp.os_id, os.os_desc 
FROM   bi.impressions imp
JOIN   bi.os_desc     os USING (os_id)
GROUP  BY imp.os_id, os.os_desc;

 
SELECT COUNT(*) AS ct, imp.os_id, os.os_desc FROM   bi.impressions impJOIN   bi.os_desc     os USING (os_id)GROUP  BY imp.os_id, os.os_desc;​

count(*) is faster than count(column) and equivalent here if the column is NOT NULL. And add a column alias for the count.

Faster, yet:

SELECT os_id, os.os_desc, sub.ct
FROM  (
   SELECT os_id, COUNT(*) AS ct
   FROM   bi.impressions
   GROUP  BY 1
   ) sub
JOIN   bi.os_desc os USING (os_id)

 
SELECT os_id, os.os_desc, sub.ctFROM  (   SELECT os_id, COUNT(*) AS ct   FROM   bi.impressions   GROUP  BY 1   ) subJOIN   bi.os_desc os USING (os_id)​

Aggregate first, join later. More here:

Advertisement

Answer