Skip to content

sql: always include rows with minimum/maximum histogram values in statistics samples #83730

@mgartner

Description

@mgartner

To collect histograms for columns in a table, we sample up to 10k rows at random. For large tables, this means that there is a significant chance that histograms will not cover a range of minimum or maximum values of a column. Queries for values outside of the minimum and maximum histogram bounds often have poor query plans as a result (see #64570 and #83431).

Ideally, we could retain any rows in the sample that contain a minimum or maximum value for histogram. This would require the sample to keep track of the minimum and maximum values seen of each column and assign a low rank when added to the sampler so that the row is sampled and not dropped. However, the ranking mechanism doesn't seem appropriate because we'd likely end up with a sample containing only rows with values near the minimum and maximum. So, when finding a row with a new minimum or maximum value, we need to evict the previous row with a minimum or maximum from the sample (or assign it a random rank so it can get randomly evicted). Rows containing minimums and maximums of multiple columns complicate this further.

Epic: CRDB-16930

Jira issue: CRDB-17227

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-sql-table-statsTable statistics (and their automatic refresh).C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)T-sql-queriesSQL Queries Team

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions