-
Notifications
You must be signed in to change notification settings - Fork 4.1k
opt: collect stats on inverted indexes #48219
Description
We currently do not collect statistics on inverted indexes, which results in poor cardinality and cost estimation of inverted index scans and inverted zig zag joins. This issue covers the work needed to collect statistics, but not to use the statistics in the optimizer. That will be covered by a separate issue.
For inverted index statistics, we can probably reuse a lot of the machinery we've already built for collecting statistics on the primary index. Just like with normal column statistics, we'll want to store the number of distinct keys in the index, as well as the total number of values in the index. Since inverted indexes can contain duplicate values, this total will be greater than or equal to the number of rows in the table. Additionally, we'll want to collect histograms on inverted indexes. Each bucket will represent a range of keys in the index, and it will store the number of values indexed by that range of keys.
As described in the Geospatial RFC, improving statistics on inverted indexes will benefit the work we're doing for geospatial support, as well as for JSON and Array inverted index support.