sql: fix bugs and improve merging partial and full statistics by Uzair5162 · Pull Request #126830 · cockroachdb/cockroach

Uzair5162 · 2024-07-08T16:35:01Z

sql: ignore outer buckets when merging partial and full statistics

We merge partial stats with full stats by prepending partial histogram buckets with UpperBound less than the first full histogram bucket and appending partial histogram buckets with UpperBounds greater than the last full histogram bucket. This doesn't work when outer buckets exist in either histogram since they have max or min valued UpperBounds, which overlap with the other statistic's buckets.

This commit fixes this by removing outer buckets from both histograms before merging them, and then adjusting counts at the end to add outer buckets back if necessary.

sql: fix and improve the computation of merged statistic counts

Previously, we were iterating over the buckets of the merged histogram to compute its RowCount and DistinctCount. We were casting each bucket's DistinctRange count from a float to an int in this loop, which resulted in a final distinct count that was less than expected for the merged stat for larger tables.

This commit fixes this and simplifies the computation of the merged stat counts by using the partial and full stat counts instead of iterating over the merged histogram.

Fixes: #126822
Fixes: #126730

See also: #125950

Release note (bug fix): Fixed a bug when creating partial statistics using extremes (which is disabled by default) where it would occasionally fail to merge the new partial stat with existing full stats. This occurs when outer buckets were added to either statistic to account for extra distinct count. Also fixed a bug where creating partial statistics would result in a merged statistic with inaccurate distinct counts, and made the merged statistic computation more efficient.

We merge partial stats with full stats by prepending partial histogram buckets with UpperBound less than the first full histogram bucket and appending partial histogram buckets with UpperBounds greater than the last full histogram bucket. This doesn't work when outer buckets exist in either histogram since they have max or min valued UpperBounds, which overlap with the other statistic's buckets. This commit fixes this by removing outer buckets from both histograms before merging them, and then adjusting counts at the end to add outer buckets back if necessary. Fixes: cockroachdb#126822 See also: cockroachdb#125950 Release note (bug fix): Fixed a bug when creating partial statistics using extremes (which is disabled by default) where it would occasionally fail to merge the new partial stat with existing full stats. This occurs when outer buckets were added to either statistic to account for extra distinct count.

cockroach-teamcity · 2024-07-08T16:35:11Z

This change is

DrewKimball

Just a few minor suggestions.

Reviewed 2 of 2 files at r1, 2 of 2 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @Uzair5162)

pkg/sql/stats/merge.go line 140 at r1 (raw file):

	ctx context.Context, fullStat *TableStatistic, partialStat *TableStatistic, st *cluster.Settings,
) (*TableStatistic, error) {
	var compareCtx *eval.Context

[nit] There's a similar cmpCtx field defined below - let's de-duplicate.

More importantly, I don't think we should be using a nil eval.Context for these comparisons. We should fix that before we flip partial stats on. Mind opening an issue?

pkg/sql/opt/exec/execbuilder/testdata/partial_stats line 1316 at r1 (raw file):

----
statistics_name  column_names  row_count  distinct_count  null_count
__merged__       {a}           10002      9803            1

Think it's worth showing what's inside the merged stat? Or at least, the first and last couple buckets?

Previously, we were iterating over the buckets of the merged histogram to compute its RowCount and DistinctCount. We were casting each bucket's DistinctRange count from a float to an int in this loop, which resulted in a final distinct count that was less than expected for the merged stat for larger tables. This commit fixes this and simplifies the computation of the merged stat counts by using the partial and full stat counts instead of iterating over the merged histogram. Fixes: cockroachdb#126730 See also: cockroachdb#125950 Release note (bug fix): Fixed a bug when creating partial statistics using extremes (which is disabled by default) where it would result in a merged statistic with inaccurate distinct counts. Also made the merged statistic computation more efficient.

Uzair5162

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball)

pkg/sql/stats/merge.go line 140 at r1 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

[nit] There's a similar cmpCtx field defined below - let's de-duplicate.

More importantly, I don't think we should be using a nil eval.Context for these comparisons. We should fix that before we flip partial stats on. Mind opening an issue?

Done, #126856

pkg/sql/opt/exec/execbuilder/testdata/partial_stats line 1316 at r1 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

Think it's worth showing what's inside the merged stat? Or at least, the first and last couple buckets?

Done, I've added the last few merged stat buckets to the test to verify that the partial stat buckets have been appended as expected.

Uzair5162 · 2024-07-09T18:05:23Z

bors r+

craig · 2024-07-09T18:30:41Z

Build succeeded:

Uzair5162 requested a review from DrewKimball July 8, 2024 16:35

Uzair5162 requested a review from a team as a code owner July 8, 2024 16:35

DrewKimball approved these changes Jul 8, 2024

View reviewed changes

Uzair5162 force-pushed the fix-partial-stats-merging-outer-buckets-and-counts-gh-126822-126730 branch from 23d3e34 to 87559c3 Compare July 8, 2024 20:15

Uzair5162 commented Jul 8, 2024

View reviewed changes

Uzair5162 requested a review from DrewKimball July 8, 2024 20:26

craig bot merged commit 21cddcf into cockroachdb:master Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: fix bugs and improve merging partial and full statistics#126830

sql: fix bugs and improve merging partial and full statistics#126830
craig[bot] merged 2 commits intocockroachdb:masterfrom
Uzair5162:fix-partial-stats-merging-outer-buckets-and-counts-gh-126822-126730

Uzair5162 commented Jul 8, 2024

Uh oh!

cockroach-teamcity commented Jul 8, 2024

Uh oh!

DrewKimball left a comment

Uh oh!

Uzair5162 left a comment

Uh oh!

Uzair5162 commented Jul 9, 2024

Uh oh!

craig bot commented Jul 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Uzair5162 commented Jul 8, 2024

sql: ignore outer buckets when merging partial and full statistics

sql: fix and improve the computation of merged statistic counts

Uh oh!

cockroach-teamcity commented Jul 8, 2024

Uh oh!

DrewKimball left a comment

Choose a reason for hiding this comment

Uh oh!

Uzair5162 left a comment

Choose a reason for hiding this comment

Uh oh!

Uzair5162 commented Jul 9, 2024

Uh oh!

craig bot commented Jul 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants