Skip to content
Advertisement

Suitable Index for improving Top N recent queries with tags

I am designing the schema of a database in PostgreSQL which would contain posts like Stack Overflow with tags. Users would search recent posts by different tags. From my research, I found that searching by tags would benefit from having generalized inverted index on tags column. These are the relevant columns of the table:

+--------+---------------------------+----------+
|  p_id  |   time                    | tags     |
+--------+---------------------------+------- --+
|    001 | 2020-04-30T23:35:50+00:00 | [CA,CD]  |                                          
|    109 | 2020-03-30T23:34:50+00:00 | [AB]     |   
|    321 | 2019-10-30T23:34:50+00:00 | [CD,AB]  |                                
+--------+---------------------------+----------+

However, to show 50 most recent posts, the database would need to scan all rows which match the given tags. This could be problematic because in real world scenarios only a few tags end up becoming popular and a large portion of documents usually contain a select few tags, So majority of users would end up querying most of the data and therefore rendering the GIN on tags ineffective. I also think indexing time would not be useful because the rows selected by GIN on tags would return an unsorted list on time.

Is there a way to construct an index which could take advantage of both time as well tags or is it a trade off which cannot be improved upon?

Edit:

I am not benchmarking it because I am doing this exercise to learn sql. However this is the intended query I would like to run.

   Select p_id 
   From Table
   Where tags @> ?
    And
   time between ? and ? 

Advertisement

Answer

Benchmarking is not about learning SQL, it is about learning about performance. And your proposed query doesn’t match you description, as “50 most recent” should have an ORDER BY…DESC and a LIMIT 50, rather than having a BETWEEN conditions on time in the WHERE clause.

If you have a GIN index on tags and a regular index on time, then it will have two choices. If your tags are very selective, it pulls out all of the tagged entries with the GIN index, and then sorts the small number of rows. Sorting a small number of rows is fast. If the tags are very popular, it will walk the time index and stop once it catches its LIMIT. It should make reasonably intelligent decisions about which of those will be faster if it has reasonable statistics.

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement