Skip to content
Advertisement

Implementation of QueryCache

Aside from doing a direct match on something like a whitespace normalized hash of a query, what might be a useful (but-not-necessarily-perfect) way to handle query cache in a partial manner? For example, let’s take the following basic case:

This potentially could be used as a ‘base-cache’ upon which a further query could be executed to potentially improve performance:

So, assuming the data in the from table(s) don’t change, the following might serve as a starting point for determining cache:

  • fields: [field:type, …] // can be less but not more
  • from: hash of table(s)+joins
  • filters: [filter1, filter2, …] // can be less but not more
  • aggregations: [agg1, agg2, …] // can be less but not more
  • having: [having1, having2, …] // can be less but not more
  • order+limit+offset if limited result-set // can be less but not more

However, this becomes quite tricky when we think about something like the following case:

What might be a realistic starting point for how a partial- query cache could be implemented.


Use case: writing SQL to query data in a non-DBMS-managed source, such as a CSV file which will take ~20s or so to issue any query and we cannot create indexes on the file. https://en.wikipedia.org/wiki/SQL/MED or Spark-like.

Advertisement

Answer

I think the following might be a good starting place for a basic cache implementation that allows the usage of a cache that can be further queried for refinements:

  1. Start by substituting any udf’s or cte’s. The query itself needs to be self-contained.
  2. Normalize whitespaces and capitalization.
  3. Hash the entire query. This will be our starting place.
  4. Remove the select fields and hash the rest of the query. Now store a hash of all the individual items in the select list.
  5. For partial cache, generate a hash minus select fields, where, sort, and limit+offset. Hash the where’s list (separated by AND), making sure no filter is contained in the cache that is not contained in the current query, the orderby, seeing if the data needs to be re-sorted, and the limit+offset number, making sure the limit+offset in the initial query is null or greater than the current query.

Here would be an example of how the data might look saved:

Hash 673c0185c6a580d51266e78608e8e9b2
HashMinusFields 41257d239fb19ec0ccf34c36eba1948e
HashOfFields [dc99e4006c8a77025c0407c1fdebeed3, …]
HashMinusFieldsWhereOrderLimit d50961b6ca0afe05120a0196a93726f5
HashOfWheres [0519669bae709d2efdc4dc8db2d171aa, …]
HashOfOrder 81961d1ff6063ed9d7515a3cefb0c2a5
LimitOffset null

Now let’s try a few examples, I will use human-readable hashes for easier readability:

Does the above seem like a valid initial implementation?

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement