pagination and filtering on a very large table in postgresql (keyset pagination?)

I have a scientific database with currently 4,300,000 records. It’s a scientific database, and an API is feeding it. In june 2020, I will probably have about 100,000,000 records.

This is de layout of the table ‘output’:

ID   | sensor_ID    | speed    | velocity | direction
-----------------------------------------------------
1    | 1            | 10       | 1        | up
2    | 2            | 12       | 2        | up
3    | 2            | 11.5     | 1.5      | down
4    | 1            | 9.5      | 0.8      | down
5    | 3            | 11       | 0.75     | up
...

BTW, this is dummy data. But output is a table with 5 columns: ID, sensor_ID, speed, velocity and direction.

What I want to achieve is a decent pagination and filter method. I want to create a website (in nodejs) where this +4,000,000 records (for now) will be displayed, 10,000 records per page. I also want to be able to filter on sensor_ID, speed, velocity or direction.

For now, I have this query for selecting specific rows:

SELECT * FROM output ORDER BY ID DESC OFFSET 0 LIMIT 10000 // first 10,000 rows

SELECT * FROM output ORDER BY ID DESC OFFSET 10000 LIMIT 10000 // next 10,000 rows

...

I’m searching for some information/tips about creating a decent pagination method. For now, it’s still quiet fast the way I do it, but I think it will be a lot slower when we hit +50,000,000 records.

First of all, I found this page: https://www.citusdata.com/blog/2016/03/30/five-ways-to-paginate/. I’m interested in the keyset pagination. But to be honest, I have no clue how to start.

What I think I must do:

Create an index on the ID-field:

CREATE UNIQUE INDEX index_id ON output USING btree (ID)

I also found this page: https://leopard.in.ua/2014/10/11/postgresql-paginattion. When you scroll down to “Improvement #2: The Seek Method”, you can see that they dropped the OFFSET-clause, and are using a WHERE-clause. I also see that they are using the last insert ID in their query:

SELECT * FROM output WHERE ID < <last_insert_id_here> ORDER BY ID DESC LIMIT 10000

I do not fully understand this. For the first page, I need the very last insert ID. Then I fetch the 10,000 newest records. But after that, to get the second page, I don’t need the very last insert ID, I need the 10,000th last insert ID (I guess).

Can someone give me a good explanation about pagination and filtering in a fast way.

The stuff I’m using: – postgresql – pgadmin (for database management) – node.js (latest version)

Thanks everyone! And have a nice 2020!

EDIT 1: I have no clue, but could massiveJS (https://massivejs.org/) be something good to use? And should I use it on ALL queries, or only on the pagination queries?

EDIT 2: I THINK I got it figured out a little bit (correct me if I’m wrong).

Let’s say I have 100,000 records:

1) Get the last inserted ID

2) Use this last inserted ID to fetch the last 10,000 records

SELECT * FROM output WHERE ID < 100000 ORDER BY ID DESC LIMIT 10000 // last insert ID is here 100,000 because I have 100,000 records

3) Show the 10,000 records but also save the insert ID of the 10,000the record to use in the next query

4) Get the next 10,000 records with the new last insert id

SELECT * FROM output WHERE ID < 90000 ORDER BY ID DESC LIMIT 10000 // 90,000 is the very last insert id - 10,000

5) …

Is this correct?

Answer

Here’s how I handle this. For the first page I fetch, I use

SELECT id, col, col, col 
  FROM output 
 ORDER BY id DESC
 LIMIT 10000

Then, in my client program (node.js) I capture the id value from the last row of the result set. When I need the next page, I do this.

 SELECT id, col, col, col
   FROM output
  WHERE id < my_captured_id_value
  ORDER BY id DESC

This exploits the index. And it works correctly even if you have deleted some rows from the table.

By the way, you probably want a descending index if your first pagination page has the largest ids. CREATE UNIQUE INDEX index_id ON output USING btree (ID DESC).

Pro tip SELECT * is harmful to performance on large databases. Always list the columns you actually need.

Advertisement

Answer