Skip to content
Advertisement

Postgres not using index on two comparison operator

here is the query

explain analyze
SELECT first_name, last_name, date_of_birth 
  FROM employees
 WHERE phone_number < '989898' 
   AND date_of_birth    < '2020-01-01'

I have index on

Indexes:
    "employees_pk" PRIMARY KEY, btree (employee_id)
    "dob_pn_on_employess" btree (date_of_birth, phone_number)
    "ln_dob_employees" btree (upper(last_name::text), date_of_birth)

and here is the analyze output

"Seq Scan on employees  (cost=0.00..301.00 rows=1000 width=14) (actual time=0.110..8.644 rows=1000 loops=1)"
"  Filter: (((phone_number)::text < 'we'::text) AND (date_of_birth < '2020-01-01'::date))"
"Planning Time: 0.127 ms"
"Execution Time: 15.740 ms"

Why is postgres not using index on the compound index.

Advertisement

Answer

There is not enough info in the question to really know for sure but here are some tips:

The filters you have in the query are very inclusive: date_of_birth < ‘2020-01-01’ will most likely match all the rows as there will be only a few 5 month old babies that own a phone. phone_number < ‘989898’ will also match most of the rows.

Postgress knows that you are asking it for (almost) the full table and in this case seq scan is faster. This is because index is helpful to pick which pages to read from the disc. But there is cost with using an index. So there is no point in using the index if you already know you are reading all of them.

And indeed here postgress knows you are reading the full table: (cost=0.00..301.00 rows=1000 width=14) and that is why it chooses seq scan as it will be faster. If you create a more exclusive filter like phone phone_number < ’11’ (depending in your data distribution of course!) you should see an index scan.

Postgress has internal statistics about each column, when creating an execution plan it will estimate the number of rows that will be returned for the query. The statistics are not perfect and Postgress assumes that columns are independent. This is by design to provide the best mix of time-to-plan vs power. So if it assumes that filter1 matches 0,1 rows and filter2 matches 0,01 rows it will assume that the number of rows returned will be 0,1*0,001*number_of_rows. There is also a number of other statistics available and used. Based on this Postgress makes a decision if it will be more beneficial to do a seq scan or use an index (and which index).

In this case Postgress needs to do a seq scan as it needs to go to the disc to fetch first_name, last_name columns as those are not included in the index(es).

A way to have a faster query (depending on you usage pattern!) is to create a covering index. You have 4 columns involved in the query: first_name, last_name, date_of_birth, phone_number. If you create an index like: btree (date_of_birth, phone_number, first_name, last_name) Postgress will be able to always run an index-only scan for this query and never use the disc. But mind that this index can get large and it will only work if you can fit it in memory. So be careful with that.

You did not add which Postgress version you are using but starting with 11 (if I remember correctly, for sure more than 10) you are able to INCLUDE columns in the indexes. This is a very cool new feature. If you always filter only on phone number and day of birth you could do for example: btree (date_of_birth, phone_number) INCLUDE (first_name, last_name) and get index-only scans here with a smaller index.

If this filter on phone_number and date_of_birth is a very common one you can consider creating a compound statistic on both columns. That should allow Postgress to create better query plans. This will not change anything in this case as this plan with seq scan is already optimal but may help with different filter values.

These two tips will depend on the type of the columns which was not added to the question:

If you have a table like date_of_birth it may be beneficial to look into a BRIN index.

Also mind that with time columns asking date_of_birth < ‘2020-01-01’ means you are asking for all people born from 2020 to the begining of time 🙂 Depending on the column type it MAY be beneficial to provide a lower bound ex. date_of_birth < ‘2020-01-01’ AND date_of_birth > ‘1900-01-01’. But you will need to test this on a large dataset to see if you do see a difference.

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement