sql: fix bugs and improve merging partial and full statistics#126830
Conversation
We merge partial stats with full stats by prepending partial histogram buckets with UpperBound less than the first full histogram bucket and appending partial histogram buckets with UpperBounds greater than the last full histogram bucket. This doesn't work when outer buckets exist in either histogram since they have max or min valued UpperBounds, which overlap with the other statistic's buckets. This commit fixes this by removing outer buckets from both histograms before merging them, and then adjusting counts at the end to add outer buckets back if necessary. Fixes: cockroachdb#126822 See also: cockroachdb#125950 Release note (bug fix): Fixed a bug when creating partial statistics using extremes (which is disabled by default) where it would occasionally fail to merge the new partial stat with existing full stats. This occurs when outer buckets were added to either statistic to account for extra distinct count.
DrewKimball
left a comment
There was a problem hiding this comment.
Reviewed 2 of 2 files at r1, 2 of 2 files at r2, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @Uzair5162)
pkg/sql/stats/merge.go line 140 at r1 (raw file):
ctx context.Context, fullStat *TableStatistic, partialStat *TableStatistic, st *cluster.Settings, ) (*TableStatistic, error) { var compareCtx *eval.Context
[nit] There's a similar cmpCtx field defined below - let's de-duplicate.
More importantly, I don't think we should be using a nil eval.Context for these comparisons. We should fix that before we flip partial stats on. Mind opening an issue?
pkg/sql/opt/exec/execbuilder/testdata/partial_stats line 1316 at r1 (raw file):
---- statistics_name column_names row_count distinct_count null_count __merged__ {a} 10002 9803 1
Think it's worth showing what's inside the merged stat? Or at least, the first and last couple buckets?
Previously, we were iterating over the buckets of the merged histogram to compute its RowCount and DistinctCount. We were casting each bucket's DistinctRange count from a float to an int in this loop, which resulted in a final distinct count that was less than expected for the merged stat for larger tables. This commit fixes this and simplifies the computation of the merged stat counts by using the partial and full stat counts instead of iterating over the merged histogram. Fixes: cockroachdb#126730 See also: cockroachdb#125950 Release note (bug fix): Fixed a bug when creating partial statistics using extremes (which is disabled by default) where it would result in a merged statistic with inaccurate distinct counts. Also made the merged statistic computation more efficient.
23d3e34 to
87559c3
Compare
Uzair5162
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball)
pkg/sql/stats/merge.go line 140 at r1 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
[nit] There's a similar
cmpCtxfield defined below - let's de-duplicate.More importantly, I don't think we should be using a
nileval.Contextfor these comparisons. We should fix that before we flip partial stats on. Mind opening an issue?
Done, #126856
pkg/sql/opt/exec/execbuilder/testdata/partial_stats line 1316 at r1 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
Think it's worth showing what's inside the merged stat? Or at least, the first and last couple buckets?
Done, I've added the last few merged stat buckets to the test to verify that the partial stat buckets have been appended as expected.
|
bors r+ |
sql: ignore outer buckets when merging partial and full statistics
We merge partial stats with full stats by prepending partial histogram buckets with UpperBound less than the first full histogram bucket and appending partial histogram buckets with UpperBounds greater than the last full histogram bucket. This doesn't work when outer buckets exist in either histogram since they have max or min valued UpperBounds, which overlap with the other statistic's buckets.
This commit fixes this by removing outer buckets from both histograms before merging them, and then adjusting counts at the end to add outer buckets back if necessary.
sql: fix and improve the computation of merged statistic counts
Previously, we were iterating over the buckets of the merged histogram to compute its RowCount and DistinctCount. We were casting each bucket's DistinctRange count from a float to an int in this loop, which resulted in a final distinct count that was less than expected for the merged stat for larger tables.
This commit fixes this and simplifies the computation of the merged stat counts by using the partial and full stat counts instead of iterating over the merged histogram.
Fixes: #126822
Fixes: #126730
See also: #125950
Release note (bug fix): Fixed a bug when creating partial statistics using extremes (which is disabled by default) where it would occasionally fail to merge the new partial stat with existing full stats. This occurs when outer buckets were added to either statistic to account for extra distinct count. Also fixed a bug where creating partial statistics would result in a merged statistic with inaccurate distinct counts, and made the merged statistic computation more efficient.