sql: always include rows with minimum/maximum histogram values in statistics samples

To collect histograms for columns in a table, we sample up to 10k rows at random. For large tables, this means that there is a significant chance that histograms will not cover a range of minimum or maximum values of a column. Queries for values outside of the minimum and maximum histogram bounds often have poor query plans as a result (see #64570 and #83431).

Ideally, we could retain any rows in the sample that contain a minimum or maximum value for histogram. This would require the sample to keep track of the minimum and maximum values seen of each column and assign a low rank when added to the sampler so that the row is sampled and not dropped. However, the ranking mechanism doesn't seem appropriate because we'd likely end up with a sample containing only rows with values near the minimum and maximum. So, when finding a row with a new minimum or maximum value, we need to evict the previous row with a minimum or maximum from the sample (or assign it a random rank so it can get randomly evicted). Rows containing minimums and maximums of multiple columns complicate this further.

Epic: CRDB-16930

Jira issue: CRDB-17227

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: always include rows with minimum/maximum histogram values in statistics samples #83730

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

sql: always include rows with minimum/maximum histogram values in statistics samples #83730

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions