-
Notifications
You must be signed in to change notification settings - Fork 4.1k
opt: predict future statistics based on historical stats #79872
Description
Is your feature request related to a problem? Please describe.
Currently, we only support collecting table statistics on the entire table, and statistics are only automatically refreshed when ~20% of rows have changed. This is problematic for very large tables where only a portion of the table is regularly updated or queried. As stats become more stale, there is greater likelihood that the optimizer will not choose optimal query plans.
For example, if only the rows most recently inserted into a table are regularly queried, stats on these rows will often be stale. This also gets worse over time: As the table increases in size, the 20% trigger for automatic refreshes will happen less and less frequently, and therefore stats on the recent rows will become more and more stale.
Describe the solution you'd like
We could take advantage of the fact that we store 4-5 historical stats for every column. We could use these historical stats to build a simple regression model (or a more complex model) to predict how the stats have changed since they were last collected. Predictions are only possible for column types where a rate of change can be determined between two values, such as DATE, TIME[TZ], TIMESTAMP[TZ], INT[2|4|8] , FLOAT[4|8], and DECIMAL. This prediction would not be stored on disk, but instead would be calculated on the fly, either inside the stats cache or the statisticsBuilder.
Describe alternatives you've considered
There are a number of alternatives described in #75625 (this proposal is one of them). That RFC proposes a more comprehensive solution, but it is heavier weight.
Additional context
This solution is already in progress in #77070.
Epic CRDB-13963
Jira issue: CRDB-15884
Metadata
Metadata
Assignees
Labels
Type
Projects
Status