Skip to content

Query Cache #34011

@alexey-milovidov

Description

@alexey-milovidov

Use case

Provide quick answer and avoid excessive computations if the same query is requested more than once in a short period of time.

(Un)marked checkboxes refer to #43797

Requirements

  • A setting (all settings are on query level) for tuneable cache usage with three options: - don't use cache; - use cache if data exists but don't put new items into cache (passive cache usage); - use cache and put new items into it.

  • A setting for max item size to be put in cache. E.g. put results not bigger than 1 MB and not bigger than 100 000 records. It should not prevent streaming queries, e.g. we can stream data into cache, and discard its storage if it is too large.

    • in number of records
    • in number of bytes
  • A setting to control the max lifetime of data that will be put in cache if not invalidated earlier.

  • A setting to control what data in cache will be accepted: not older than specified time; not older than specified time after related tables were modified. Note that we will not control every possible modification of the related tables.

  • A setting to enable or disable cache sharing per user as extra security measure. If disabled, then user will be included in the cache key.

  • Cache can be stored both in memory and in filesystem. There should be separate knobs to control the max size of each parts.

  • Data in memory is stored in a sequence of Blocks with optional compression (like in Memory tables). Data on disk is stored in files in a dedicated directory (like query_cache), one file per item, named with the hash of cache key (UInt128) with control on the maximum number of files. Files are placed into subdirs by first 12bit of the hash. The files are compressed (and checksummed) as usual, with lz4 by default and an option to change it with a setting.

  • Data is moved to filesystem from memory instead of eviction and put back on first access.

  • A setting to control should we put and use result from cache if query contains non deterministic functions like rand or now.

  • A setting to enable minimum number of query runs before the result will be put in cache. E.g. put result in cache only on 2nd run of the same query.

  • A setting to put the result in cache only if query was hard to run (time to first result block?).

  • Cache key should use hashed query AST or query pipeline and every setting that may affect query behavior. As it can be difficult to determine, better to simply hash all the settings.

  • Tables and columns access control is always performed before going to cache.

  • A setting to enable or disable caching for secondary queries of distributed queries and dictionaries.

  • Cache eviction strategy can be simple LRU or it can include a priority setting to roughly balance cache usage by some proportion based on priorities.

  • Per user limits on total usable cache size.

  • Allow to enable cache for subqueries. Allow to enable cache of data before LIMIT by speculatively fetching more records. E.g. if user run a query with LIMIT 10, we can fetch 1000 records and then get the result for subsequent query with LIMIT 10, 10.

  • Synchronize concurrent queries that can reuse results of each other by stealing blocks of data that are going to be put in cache.

  • System commands to drop query cache.

  • Invalidation of cache in the filesystem during server upgrade - not necessary.

Additional context

MySQL used to have query cache, but it was so terrible, that is was deprecated and finally removed.

PostgreSQL does not have query cache although there are 3rd party tools for that.

ClickHouse cache is already available in chproxy

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions