-
Notifications
You must be signed in to change notification settings - Fork 8k
Open
Labels
Description
Now we have two similar concepts: secondary skipping indexes and statistics (here). They both represent some aggregated data that can be used to optimize queries.
The differences are:
- Secondary indexes are calculated per each granule and statistics are calculated per each part.
- Secondary indexes must be read from disk for each query and statistics are stored in memory. (note: the former is mitigated by Skipping index cache #70102)
- Secondary indexes are written into separate files and statistics are written into one file (not now but will be implemented).
- Secondary indexes store metadata per granule. As a result, their precision is more "predictable" (usually better) whereas the precision of statistics degrades as the part becomes bigger.
The idea is to calculate aggregated skipping indexes for whole part and use them as statistics which later can be used for various optimizations:
- Filter parts according per-part aggregated index/statistic at the first stage to avoid reading of index from disk and analysis.
- Push
minmaxindex/statistic as a block-level hint to optimizemin/maxfunction or aggregation if range of[min; max]is small. - Use
setindex/statistic to estimate cardinality of column. Can be used for aggregation, joins, parallel replicas, etc. - Use
hypothesisindex to optimize calculation of predicate and filtering. For instance, if we have table with definition:
(s String, INDEX ind pred(s) TYPE hypothesis), we can optimize queries withWHERE NOT pred(s)by skipping granules wherepred(s) == 1. However with statistic of typehypothesiswe can push the result of predicate if it is equal to 1 and optimize queries withWHERE pred(s)as well. - ... This is incomplete list, there are more possible cases.
Advanced task: we can unify interfaces of statistics and skipping indexes because we can represent any statistic as a skipping index with infinite granularity.
alexey-milovidov, EmeraldShift and nickitat