Skip to content

opt: support BYTES for histogram range calculations#68740

Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom
mgartner:fix-bytes-histogram-filtering
Sep 24, 2021
Merged

opt: support BYTES for histogram range calculations#68740
craig[bot] merged 2 commits intocockroachdb:masterfrom
mgartner:fix-bytes-histogram-filtering

Conversation

@mgartner
Copy link
Copy Markdown
Contributor

Fixes #68346

Release note (performance improvement): The accuracy of histogram
calculations for BYTES types has been improved. As a result, the
optimizer should generate more efficient query plans in some cases.

@mgartner mgartner requested review from a team, michae2 and rytaft August 11, 2021 23:12
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for looking at this!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @rytaft)


pkg/sql/opt/props/histogram.go, line 864 at r1 (raw file):

Quoted 5 lines of code…
	switch typ.Family() {
	case types.StringFamily, types.BytesFamily, types.UuidFamily, types.INetFamily:
		return true
	}
	return false

To my untrained eye it looks like this list is missing a few more non-numeric type families, such as CollatedStringFamily and JsonFamily. Plus, I'm wondering if ArrayFamily and TupleFamily belong here as well? Hmm. I am new to this code, but it almost seems like getRangeNonNumeric should be the default range-calculation function, and getRange should only be called for special cases, since presumably every type's key encoding is bytewise ordered / comparable and at least somewhat bytewise "dense".

Copy link
Copy Markdown
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner and @michae2)


pkg/sql/opt/props/histogram.go, line 864 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…
	switch typ.Family() {
	case types.StringFamily, types.BytesFamily, types.UuidFamily, types.INetFamily:
		return true
	}
	return false

To my untrained eye it looks like this list is missing a few more non-numeric type families, such as CollatedStringFamily and JsonFamily. Plus, I'm wondering if ArrayFamily and TupleFamily belong here as well? Hmm. I am new to this code, but it almost seems like getRangeNonNumeric should be the default range-calculation function, and getRange should only be called for special cases, since presumably every type's key encoding is bytewise ordered / comparable and at least somewhat bytewise "dense".

I like the idea of swapping the default, and checking instead for the types that we explicitly support in getRange. Can these other non-numeric types be easily added in this PR? Otherwise, maybe open an issue for the other types that we're missing?

On a somewhat related note, I think we can also add Interval, Enum, and Oid as numeric types (in a separate PR...).


pkg/sql/opt/props/histogram_test.go, line 710 at r1 (raw file):

	})

	t.Run("bytes", func(t *testing.T) {

is all of this basically directly copied from above? if so, can you just have one copy that's used for both string and bytes?

@mgartner mgartner force-pushed the fix-bytes-histogram-filtering branch from e2ae5b2 to 28adb74 Compare September 16, 2021 16:31
@mgartner mgartner requested a review from a team as a code owner September 16, 2021 16:31
Copy link
Copy Markdown
Contributor Author

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay in getting back to these comments. PTAL!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @rytaft)


pkg/sql/opt/props/histogram.go, line 864 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

I like the idea of swapping the default, and checking instead for the types that we explicitly support in getRange. Can these other non-numeric types be easily added in this PR? Otherwise, maybe open an issue for the other types that we're missing?

On a somewhat related note, I think we can also add Interval, Enum, and Oid as numeric types (in a separate PR...).

I've combined the three functions getRange, getRangeNonNumeric and isNonNumeric range into a single switch statement. I think this is more clear and avoids having to keep three functions in-sync.


pkg/sql/opt/props/histogram_test.go, line 710 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

is all of this basically directly copied from above? if so, can you just have one copy that's used for both string and bytes?

Done.

Copy link
Copy Markdown
Contributor Author

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created #70320 to track calculating ranges for additional types.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @rytaft)

@mgartner mgartner force-pushed the fix-bytes-histogram-filtering branch from 28adb74 to 73f3a67 Compare September 16, 2021 20:16
Copy link
Copy Markdown
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 3 files at r1, 1 of 1 files at r2, 4 of 4 files at r4.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mgartner, @michae2, and @rytaft)


pkg/sql/opt/props/histogram.go, line 864 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

I've combined the three functions getRange, getRangeNonNumeric and isNonNumeric range into a single switch statement. I think this is more clear and avoids having to keep three functions in-sync.

OK, thanks for the refactor and the filed issue.


pkg/sql/opt/props/histogram_test.go, line 647 at r4 (raw file):

	})

	t.Run("string/bytes", func(t *testing.T) {

nit: "/" is the subtest separator for go test, so this name creates a TestFilterBucket/string/bytes sub-sub-test instead of a TestFilterBucket/<x> sub-test. Maybe use "string,bytes" or some other symbol instead?

Copy link
Copy Markdown
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r2, 4 of 4 files at r4, 1 of 1 files at r5, all commit messages.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @mgartner)


pkg/sql/opt/props/histogram.go, line 716 at r5 (raw file):

//
func getRangesBeforeAndAfter(
	bucketLowerBound, bucketUpperBound, spanLowerBound, spanUpperBound tree.Datum, swap bool,

nit: I think we can get rid of the bucket v. span distinction to further clarify things. i.e. bucket -> before, span -> after


pkg/sql/opt/props/histogram.go, line 820 at r5 (raw file):

		// For non-numeric types, convert the datums to encoded keys to
		// approximate the range. We utilize an array to reduce repetitive code
		// number of repetitive calls.

nit: "...to reduce repetitive code number of repetitive calls." is repetitive! 😉

@mgartner mgartner force-pushed the fix-bytes-histogram-filtering branch from 73f3a67 to a041913 Compare September 23, 2021 16:34
Copy link
Copy Markdown
Contributor Author

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @rytaft)


pkg/sql/opt/props/histogram.go, line 716 at r5 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: I think we can get rid of the bucket v. span distinction to further clarify things. i.e. bucket -> before, span -> after

Done.


pkg/sql/opt/props/histogram.go, line 820 at r5 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: "...to reduce repetitive code number of repetitive calls." is repetitive! 😉

🙃


pkg/sql/opt/props/histogram_test.go, line 647 at r4 (raw file):

Previously, michae2 (Michael Erickson) wrote…

nit: "/" is the subtest separator for go test, so this name creates a TestFilterBucket/string/bytes sub-sub-test instead of a TestFilterBucket/<x> sub-test. Maybe use "string,bytes" or some other symbol instead?

Good catch. Fixed.

Copy link
Copy Markdown
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r6, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @mgartner)


pkg/sql/opt/props/histogram.go, line 666 at r6 (raw file):

}

// getRangesBeforeAndAfter returns the size of the ranges before and after the

nit: update comment to remove references to bucket/span


pkg/sql/opt/props/histogram.go, line 736 at r6 (raw file):

	// TODO(rytaft): handle more types here.
	// Note: the calculations below assume that bucketLowerBound is inclusive and
	// Span.PreferInclusive() has been called on the span.

nit: update comment (just say all bounds are inclusive)

Fixes cockroachdb#68346

Release note (performance improvement): The accuracy of histogram
calculations for BYTES types has been improved. As a result, the
optimizer should generate more efficient query plans in some cases.
Histogram range calculation has been refactored into a single switch
statement. This prohibits inconsistencies between the three previous
functions `getRange`, `getRangeNonNumeric` and `isNonNumeric`. It also
highlights the preference for calculating numeric ranges when possible.

Release note: None
@mgartner mgartner force-pushed the fix-bytes-histogram-filtering branch from a041913 to 03c45a0 Compare September 23, 2021 20:16
Copy link
Copy Markdown
Contributor Author

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @rytaft)


pkg/sql/opt/props/histogram.go, line 666 at r6 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: update comment to remove references to bucket/span

Done.


pkg/sql/opt/props/histogram.go, line 736 at r6 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: update comment (just say all bounds are inclusive)

Done.

Copy link
Copy Markdown
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r7, 2 of 2 files at r8, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner)

@mgartner
Copy link
Copy Markdown
Contributor Author

TFTRs!

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Sep 24, 2021

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

opt: row estimates for bytes columns inaccurate

4 participants