Skip to content

sql: fix bugs and improve merging partial and full statistics#126830

Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom
Uzair5162:fix-partial-stats-merging-outer-buckets-and-counts-gh-126822-126730
Jul 9, 2024
Merged

sql: fix bugs and improve merging partial and full statistics#126830
craig[bot] merged 2 commits intocockroachdb:masterfrom
Uzair5162:fix-partial-stats-merging-outer-buckets-and-counts-gh-126822-126730

Conversation

@Uzair5162
Copy link
Copy Markdown
Contributor

sql: ignore outer buckets when merging partial and full statistics

We merge partial stats with full stats by prepending partial histogram buckets with UpperBound less than the first full histogram bucket and appending partial histogram buckets with UpperBounds greater than the last full histogram bucket. This doesn't work when outer buckets exist in either histogram since they have max or min valued UpperBounds, which overlap with the other statistic's buckets.

This commit fixes this by removing outer buckets from both histograms before merging them, and then adjusting counts at the end to add outer buckets back if necessary.

sql: fix and improve the computation of merged statistic counts

Previously, we were iterating over the buckets of the merged histogram to compute its RowCount and DistinctCount. We were casting each bucket's DistinctRange count from a float to an int in this loop, which resulted in a final distinct count that was less than expected for the merged stat for larger tables.

This commit fixes this and simplifies the computation of the merged stat counts by using the partial and full stat counts instead of iterating over the merged histogram.

Fixes: #126822
Fixes: #126730

See also: #125950

Release note (bug fix): Fixed a bug when creating partial statistics using extremes (which is disabled by default) where it would occasionally fail to merge the new partial stat with existing full stats. This occurs when outer buckets were added to either statistic to account for extra distinct count. Also fixed a bug where creating partial statistics would result in a merged statistic with inaccurate distinct counts, and made the merged statistic computation more efficient.

We merge partial stats with full stats by prepending partial
histogram buckets with UpperBound less than the first full histogram
bucket and appending partial histogram buckets with UpperBounds greater
than the last full histogram bucket. This doesn't work when outer
buckets exist in either histogram since they have max or min valued
UpperBounds, which overlap with the other statistic's buckets.

This commit fixes this by removing outer buckets from both histograms
before merging them, and then adjusting counts at the end to add outer
buckets back if necessary.

Fixes: cockroachdb#126822

See also: cockroachdb#125950

Release note (bug fix): Fixed a bug when creating partial statistics
using extremes (which is disabled by default) where it would
occasionally fail to merge the new partial stat with existing full
stats. This occurs when outer buckets were added to either
statistic to account for extra distinct count.
@Uzair5162 Uzair5162 requested a review from DrewKimball July 8, 2024 16:35
@Uzair5162 Uzair5162 requested a review from a team as a code owner July 8, 2024 16:35
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Collaborator

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Just a few minor suggestions.

Reviewed 2 of 2 files at r1, 2 of 2 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @Uzair5162)


pkg/sql/stats/merge.go line 140 at r1 (raw file):

	ctx context.Context, fullStat *TableStatistic, partialStat *TableStatistic, st *cluster.Settings,
) (*TableStatistic, error) {
	var compareCtx *eval.Context

[nit] There's a similar cmpCtx field defined below - let's de-duplicate.

More importantly, I don't think we should be using a nil eval.Context for these comparisons. We should fix that before we flip partial stats on. Mind opening an issue?


pkg/sql/opt/exec/execbuilder/testdata/partial_stats line 1316 at r1 (raw file):

----
statistics_name  column_names  row_count  distinct_count  null_count
__merged__       {a}           10002      9803            1

Think it's worth showing what's inside the merged stat? Or at least, the first and last couple buckets?

Previously, we were iterating over the buckets of the merged
histogram to compute its RowCount and DistinctCount. We were casting
each bucket's DistinctRange count from a float to an int in this loop,
which resulted in a final distinct count that was less than expected for
the merged stat for larger tables.

This commit fixes this and simplifies the computation of the merged stat
counts by using the partial and full stat counts instead of iterating
over the merged histogram.

Fixes: cockroachdb#126730

See also: cockroachdb#125950

Release note (bug fix): Fixed a bug when creating partial statistics
using extremes (which is disabled by default) where it would
result in a merged statistic with inaccurate distinct counts. Also made
the merged statistic computation more efficient.
@Uzair5162 Uzair5162 force-pushed the fix-partial-stats-merging-outer-buckets-and-counts-gh-126822-126730 branch from 23d3e34 to 87559c3 Compare July 8, 2024 20:15
Copy link
Copy Markdown
Contributor Author

@Uzair5162 Uzair5162 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball)


pkg/sql/stats/merge.go line 140 at r1 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

[nit] There's a similar cmpCtx field defined below - let's de-duplicate.

More importantly, I don't think we should be using a nil eval.Context for these comparisons. We should fix that before we flip partial stats on. Mind opening an issue?

Done, #126856


pkg/sql/opt/exec/execbuilder/testdata/partial_stats line 1316 at r1 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

Think it's worth showing what's inside the merged stat? Or at least, the first and last couple buckets?

Done, I've added the last few merged stat buckets to the test to verify that the partial stat buckets have been appended as expected.

@Uzair5162 Uzair5162 requested a review from DrewKimball July 8, 2024 20:26
@Uzair5162
Copy link
Copy Markdown
Contributor Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jul 9, 2024

@craig craig bot merged commit 21cddcf into cockroachdb:master Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sql/stats: can't merge partial and full statistics with outer buckets sql/stats: merged partial statistics have inaccurate distinct counts

3 participants