-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Cache For Filters #67768
Description
Basic Idea
Let's suppose a query has a highly selective WHERE condition, but it does not benefit from the index. Subsequent queries use the same condition.
Let's remember the ranges in data parts where that condition was satisfied and not, in the form of the ephemeral index in memory. Subsequent queries will use this index.
Implementation Proposal
Add ChunkInfo with the information about the table, data part, and marks range, where this chunk came from.
This is applicable only for MergeTree tables but will also naturally work for Merge tables. It only works for Atomic and Replicated databases (databases that have table UUIDs).
Maintain an index data structure in memory in the form of
table -> data part -> marks range -> condition -> 0 or 1
When calculating a WHERE or PREWHERE condition, check if it is deterministic, look at the chunk info, and update the cache.
When running a query, use the cache around MergeTreeDataSelectExecutor.
Add settings to control cache usage on query analysis and cache update on processing.
Add a SYSTEM command to flush this cache.
Add a server setting to control the maximum size of this cache in the number of cells.
The cache is shared between users, and we don't mind side-channel information leakage (about whether another user has run a similar query recently).
Additional Context
It could work for external data, like S3 table functions, if we learn to use etag.