sql/stats: convert between histograms and quantile functions#83796
sql/stats: convert between histograms and quantile functions#83796craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
|
This PR is based on #83123. The first commit is from that PR and can be ignored. |
rytaft
left a comment
There was a problem hiding this comment.
Very nice!
Reviewed 5 of 5 files at r1, 2 of 2 files at r2, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @mgartner, @michae2, and @yuzefovich)
pkg/sql/stats/quantile.go line 82 at r2 (raw file):
// 190 + - - - - - - - - - - // 0 .2 .4 .6 .8 1 //
Love this comment! Super helpful.
pkg/sql/stats/quantile.go line 118 at r2 (raw file):
types.TimestampTZFamily: // TODO(michae2): Check that there are no constraints making this a de facto // ENUM. (Could also check histogram for sum(NumRange) > 0.)
How is this an enum? I'm a bit confused by this comment.... (I know it's a note to yourself, but wouldn't hurt to make it understandable by others too)
pkg/sql/stats/quantile.go line 275 at r2 (raw file):
lowerBound := getNextLowerBound(nilEvalContext, currentUpperBound) distinctRange := estimatedDistinctValuesInRange( nilEvalContext, currentBucket.NumRange, lowerBound, currentUpperBound,
seems like lowerBound and currentUpperBound are going to be basically the same here. Maybe you need to get the value of lowerBound earlier (i.e. from the previous bucket)?
pkg/sql/stats/quantile.go line 337 at r2 (raw file):
} func isPositive(x float64) bool {
nit: should this be isNonNegative?
pkg/sql/stats/quantile_test.go line 27 at r2 (raw file):
) type testHistogram []testBucket
Do you think you could add a test where you create a random histogram or quantile function and test that it round-trips?
yuzefovich
left a comment
There was a problem hiding this comment.
Great work - it was an interesting read!
Reviewed 1 of 2 files at r2, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @mgartner, @michae2, and @rytaft)
pkg/sql/stats/quantile.go line 82 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
Love this comment! Super helpful.
+1
pkg/sql/stats/quantile.go line 102 at r2 (raw file):
// makeQuantile and quantile.toHistogram might need to change if we introduce a // new histogram version. const _ uint = 1 - uint(histVersion)
nit: is this supposed to fail the compilation once histVersion is increased? If so, why that would be the case? Or this just a usage of histVersion value so that in the future this comment would be found via "Find usages"?
pkg/sql/stats/quantile.go line 141 at r2 (raw file):
// To produce a quantile with first point at p=0 and at least two points, we // need the first bucket to have NumRange == 0. if hist.buckets[0].NumRange != 0 {
Just a question: is this a limiting assumption or can we always have such histograms that satisfy the assumption?
pkg/sql/stats/quantile.go line 151 at r2 (raw file):
// zero-row buckets from the beginning and end of the histogram. qfTrimLo, qfTrimHi quantileIndex qf quantile
How large can qf get? Will it be limited by the number of histogram buckets times two? Should we do memory accounting somewhere? (It doesn't seem like we're using the quantiles yet, and maybe the responsibility will be pushed to the caller - might be worth mentioning it in the comment to makeQuantile.)
pkg/sql/stats/quantile.go line 214 at r2 (raw file):
// Fix any floating point errors or histogram errors (e.g. sum of bucket row // counts < total row count) causing p to be below 1 at the end. qf[len(qf)-1].p = 1
nit: shouldn't this already be fixed in addPoint?
pkg/sql/stats/quantile.go line 238 at r2 (raw file):
var i int // Skip any leading p=0 points instead of emitting zero-row buckets.
IIUC we must have already trimmed all but one zero-row buckets from the start in makeQuantile - do we do this out of caution? Maybe we should have an assertion check that qf[0].p == 0 && qf[1].p > 0?
mgartner
left a comment
There was a problem hiding this comment.
Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @michae2, @rytaft, and @yuzefovich)
pkg/sql/stats/quantile.go line 82 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
+1
+2
pkg/sql/stats/quantile.go line 102 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: is this supposed to fail the compilation once
histVersionis increased? If so, why that would be the case? Or this just a usage ofhistVersionvalue so that in the future this comment would be found via "Find usages"?
Compilation will fail when histVersion increases because a negative value overflows uint.
pkg/sql/stats/quantile.go line 156 at r2 (raw file):
) addPoint := func(num, v float64) error {
nit: might be helpful to add a comment explaining this function and its arguments. I'm assuming num is the number times value v exists in the histogram?
pkg/sql/stats/quantile.go line 214 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: shouldn't this already be fixed in
addPoint?
Looks like that only fixes a p if p > 1. The final p could be slightly less than 1, so this is fixing that - 100% of the values should be less than or equal to the maximum value in the histogram.
pkg/sql/stats/quantile.go line 258 at r2 (raw file):
var pEq float64 closeCurrentBucket := func() error {
nit: describe briefly what this function does
michae2
left a comment
There was a problem hiding this comment.
Thanks for the feedback so far!
- Added a test that tries to round-trip a random histogram.
- Added a
histogram.Stringmethod for easier debugging. - Now preallocating slices in
makeQuantileandquantile.toHistogram.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner, @rytaft, and @yuzefovich)
pkg/sql/stats/quantile.go line 102 at r2 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
Compilation will fail when
histVersionincreases because a negative value overflowsuint.
Yep, it's a compile-time check to make sure someone looks at this file when changing histogram formats. I tried making the comment clearer.
pkg/sql/stats/quantile.go line 118 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
How is this an enum? I'm a bit confused by this comment.... (I know it's a note to yourself, but wouldn't hurt to make it understandable by others too)
This is about columns like the hidden shard column of a hash-sharded index, which have the correct type but behave more like enums than integers. I tried making the comment a little better, lmk if it needs more work.
pkg/sql/stats/quantile.go line 141 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Just a question: is this a limiting assumption or can we always have such histograms that satisfy the assumption?
We currently always produce histograms that satisfy this assumption. I don't think NumRange != 0 in the first bucket would ever really make sense (there is only one endpoint for the range counted by NumRange of the first bucket) but I wanted to be explicit.
pkg/sql/stats/quantile.go line 151 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
How large can
qfget? Will it be limited by the number of histogram buckets times two? Should we do memory accounting somewhere? (It doesn't seem like we're using the quantiles yet, and maybe the responsibility will be pushed to the caller - might be worth mentioning it in the comment tomakeQuantile.)
Good questions. Yes, qf is limited to len(hist.buckets) * 2 so I went ahead and pre-allocated it (and also pre-allocated the histogram slice in toHistogram). The quantiles will only exist briefly whenever calculating a statistics forecast, after which they will be thrown away. The end result will be an in-memory histogram in the stats cache. I'm not sure the stats cache actually has memory accounting, I should probably add that in a future PR.
pkg/sql/stats/quantile.go line 156 at r2 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
nit: might be helpful to add a comment explaining this function and its arguments. I'm assuming
numis the number times valuevexists in the histogram?
Done. (num could also be counting values < v if we're adding a point for NumRange.)
pkg/sql/stats/quantile.go line 214 at r2 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
Looks like that only fixes a
pifp > 1. The finalpcould be slightly less than1, so this is fixing that - 100% of the values should be less than or equal to the maximum value in the histogram.
Yes, the check in addPoint handles p > 1, this handles p < 1.
pkg/sql/stats/quantile.go line 238 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
IIUC we must have already trimmed all but one zero-row buckets from the start in
makeQuantile- do we do this out of caution? Maybe we should have an assertion check thatqf[0].p == 0 && qf[1].p > 0?
It's a little silly in this PR, but it will make more sense when I open the next PR adding operations on quantiles for performing linear regression. Those operations could create a quantile with multiple p=0 points.
pkg/sql/stats/quantile.go line 258 at r2 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
nit: describe briefly what this function does
Done.
pkg/sql/stats/quantile.go line 275 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
seems like
lowerBoundandcurrentUpperBoundare going to be basically the same here. Maybe you need to get the value oflowerBoundearlier (i.e. from the previous bucket)?
Gah, good catch! Thank you. Done.
pkg/sql/stats/quantile.go line 337 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
nit: should this be
isNonNegative?
Yes, it should be. I wanted to avoid the double negative (if !isNonNegative) (or is that a triple negative? 😁) so I went with isInvalidCount instead, lmk what you think.
pkg/sql/stats/quantile_test.go line 27 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
Do you think you could add a test where you create a random histogram or quantile function and test that it round-trips?
Yes, definitely. It's not particularly succinct, but here's my attempt at a random histogram generator. This caught one bug, which was that I needed a real non-nil eval.Context in quantile.toHistogram. After fixing that, it seems to be round-tripping successfully so far.
mgartner
left a comment
There was a problem hiding this comment.
Reviewed 2 of 6 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @mgartner, @rytaft, and @yuzefovich)
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewed 2 of 6 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @michae2 and @rytaft)
pkg/sql/stats/quantile_test.go line 37 at r4 (raw file):
// changing. func TestRandomQuantileRoundTrip(t *testing.T) { colTypes := []*types.T{
nit: so that in the future we don't forget to include newly-supported types, it might be better to use CanMakeQuantile and types.Scalar here to populate colTypes (we'll then need to add ints and floats of smaller widths separately).
pkg/sql/stats/quantile_test.go line 53 at r4 (raw file):
t.Run(fmt.Sprintf("%v/%v", colType.Name(), i), func(t *testing.T) { hist, rowCount := randHist(evalCtx, colType, rng) //t.Log(hist)
nit: seems like a leftover.
rytaft
left a comment
There was a problem hiding this comment.
Reviewed 2 of 6 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status:complete! 2 of 0 LGTMs obtained (waiting on @michae2 and @rytaft)
pkg/sql/stats/quantile.go line 337 at r2 (raw file):
Previously, michae2 (Michael Erickson) wrote…
Yes, it should be. I wanted to avoid the double negative (
if !isNonNegative) (or is that a triple negative? 😁) so I went withisInvalidCountinstead, lmk what you think.
How about leaving the logic as it was before and changing the name to isValidCount?
pkg/sql/stats/quantile_test.go line 27 at r2 (raw file):
Previously, michae2 (Michael Erickson) wrote…
Yes, definitely. It's not particularly succinct, but here's my attempt at a random histogram generator. This caught one bug, which was that I needed a real non-nil
eval.Contextinquantile.toHistogram. After fixing that, it seems to be round-tripping successfully so far.
Nice!
To predict histograms in statistics forecasts, we will use linear regression over quantile functions. (Quantile functions are another representation of histogram data, in a form more amenable to statistical manipulation.) This commit defines quantile functions and adds methods to convert between histograms and quantile functions. This code was originally part of cockroachdb#77070 but has been pulled out to simplify that PR. A few changes have been made: - Common code has been factored into closures. - More checks have been added for positive values. - In `makeQuantile` we now trim leading empty buckets as well as trailing empty buckets. - The logic in `quantile.toHistogram` to steal from `NumRange` if `NumEq` is zero now checks that `NumRange` will still be >= 1. - More tests have been added. Assists: cockroachdb#79872 Release note: None
michae2
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @mgartner, @michae2, @rytaft, and @yuzefovich)
pkg/sql/stats/quantile.go line 337 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
How about leaving the logic as it was before and changing the name to
isValidCount?
Done.
pkg/sql/stats/quantile_test.go line 37 at r4 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: so that in the future we don't forget to include newly-supported types, it might be better to use
CanMakeQuantileandtypes.Scalarhere to populatecolTypes(we'll then need to add ints and floats of smaller widths separately).
Ah, great idea! Done.
pkg/sql/stats/quantile_test.go line 53 at r4 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: seems like a leftover.
Done.
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewed 2 of 2 files at r5, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @rytaft)
rytaft
left a comment
There was a problem hiding this comment.
Reviewed 2 of 2 files at r5, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @rytaft)
mgartner
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @rytaft)
|
TFTRs! bors r=mgartner,rytaft,yuzefovich |
|
Build succeeded: |
To predict histograms in statistics forecasts, we will use linear
regression over quantile functions. (Quantile functions are another
representation of histogram data, in a form more amenable to statistical
manipulation.)
This commit defines quantile functions and adds methods to convert
between histograms and quantile functions.
This code was originally part of #77070 but has been pulled out to
simplify that PR. A few changes have been made:
makeQuantilewe now trim leading empty buckets as well astrailing empty buckets.
quantile.toHistogramto steal fromNumRangeifNumEqis zero now checks thatNumRangewill still be >= 1.Assists: #79872
Release note: None