-
Notifications
You must be signed in to change notification settings - Fork 4.1k
sql, stats: collect small histograms on all columns #49374
Copy link
Copy link
Closed
Labels
A-sql-optimizerSQL logical planning and optimizations.SQL logical planning and optimizations.C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)C-performancePerf of queries or internals. Solution not expected to change functional behavior.Perf of queries or internals. Solution not expected to change functional behavior.
Metadata
Metadata
Assignees
Labels
A-sql-optimizerSQL logical planning and optimizations.SQL logical planning and optimizations.C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)C-performancePerf of queries or internals. Solution not expected to change functional behavior.Perf of queries or internals. Solution not expected to change functional behavior.
Currently, we only collect histograms on the first column of each index in a table. The reason for this is that the default size of histograms is 200 buckets, so they use a non-trivial amount of space and require a non-trivial amount of computation in order to collect them and use them in the optimizer. Still, it's generally worth the extra space and computation for index columns; index columns are likely to be used in predicates, so histograms are likely to be useful for cardinality estimation.
We could still get some benefit from collecting smaller histograms on other columns, however. Collecting a histogram with 2 buckets would be equivalent to storing the maximum and minimum value for a column, which could significantly improve our selectivity estimation for some queries. Instead of assuming all inequalities (e.g.,
x < 10) have selectivity equal to 1/3, we can make an educated guess based on the domain of the variable.