-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Filtering by secondary indices should be performed on data reading #75774
Description
Company or project name
ClickHouse
Describe the situation
Currently, index analysis is done before the query starts reading the data.
It determines which granules to read by filtering out unnecessary granules.
This has the following disadvantages:
- Even if a query will finish earlier due to LIMIT, a larger amount of the index has to be read.
- Analysing index produces a visible delay before query processing: the progress bar does not show instantly.
- Sometimes, reading indices can be heavier than reading the data, but we cannot make the decision of not reading an index dynamically.
- The order in which different indices are processed cannot be adapted dynamically.
It also has some advantages, but all of them do not really matter:
- It better estimates the number of records to read, and the pipeline (such as the number of processing lanes) can be adapted from this information.
- EXPLAIN ESTIMATE shows more precise results but takes more time.
Solution
When we are going to read a granule, check if there are index granules covering it. Read index granules first, and skip this granule if necessary.
We can collect statistics about how much time it takes to read index granules, and how large portions of the data every index filtered. These statistics can be used to avoid reading certain indices at all, and reading other indices in the optimal order.
These statistics can live only during the query run, there is no need to share them.
Additional context
Probably index analysis also affects the choice of projections, but we don't want it to be too heavy. If it affects join order - we should better estimate it from runtime statistics rather than requiring index analysis in advance.