Skip to content

sql: introduce special type for inverted index keys#77501

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
RaduBerinde:inv-key-type
Mar 16, 2022
Merged

sql: introduce special type for inverted index keys#77501
craig[bot] merged 1 commit intocockroachdb:masterfrom
RaduBerinde:inv-key-type

Conversation

@RaduBerinde
Copy link
Copy Markdown
Member

Note: this is intended to merge to master after the 22.1 branch is cut.

Currently we pass around inverted index keys as having Bytes type,
but these datums behave in subtly different ways. The
encoding/decoding is special-cased to be a no-op. This leads to sharp
edges; for example, we put these keys in EncodedDatums but if we
ever try to decode them (via EnsureDecodedd), we get obscure
decoding errors.

This commit cleans up this situation by adding a special EncodedKey
type and corresponding DEncodedKey datum. It behaves like
Bytes/DBytes in most cases, except key encoding and decoding.

Release note: None

@RaduBerinde RaduBerinde requested a review from a team as a code owner March 8, 2022 19:10
@RaduBerinde RaduBerinde requested review from a team March 8, 2022 19:10
@RaduBerinde RaduBerinde requested a review from a team as a code owner March 8, 2022 19:10
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 27 of 27 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mgartner, @RaduBerinde, and @rytaft)


-- commits, line 8 at r1:
nit: s/Decodedd/Decoded/.


pkg/sql/sem/tree/datum.go, line 1626 at r1 (raw file):

// Format implements the NodeFormatter interface.
func (d *DEncodedKey) Format(ctx *FmtCtx) {
	f := ctx.flags

nit: it seems like it'd be good to extract a helper to be shared by two Bytes datums.


pkg/sql/sem/tree/datum_alloc.go, line 151 at r1 (raw file):

}

// NewDEncodedKey allocates a DBytes.

nit: s/DBytes/DEncodedKey/.

Copy link
Copy Markdown
Contributor

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea!

To prevent poor row count estimations I think we'll need to include the new type you've created here:

case types.StringFamily, types.BytesFamily, types.UuidFamily, types.INetFamily:

Otherwise, if a query filters by a value in the middle of a histogram bucket, the estimated row count will be the bucket's num_range divided by two.

Does the histo_col_type field need to change in table statistics? Or is it fine to leave that as BYTES for inverted index columns.

Reviewed 27 of 27 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @RaduBerinde and @rytaft)


pkg/sql/types/types.go, line 468 at r1 (raw file):

	// EncodedKey is a special type used internally for inverted index keys, which
	// do not fully encode an object.
	EncodedKey = &T{

nit: Unless you think it's likely this will be used for other purposes in the future, it might make sense to make the name of the type and family more specific to inverted index keys.

@mgartner
Copy link
Copy Markdown
Contributor

mgartner commented Mar 8, 2022

To prevent poor row count estimations...

See #68740 for some more context on this.

@mgartner
Copy link
Copy Markdown
Contributor

mgartner commented Mar 8, 2022

To prevent poor row count estimations...

Hmm, I guess these datums are the upper and lower bound of the histogram buckets, so if the type remains as BYTES, perhaps no change is needed here.

Copy link
Copy Markdown
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The statistics creation and props code still uses DBytes (props.makeSpanFromInvertedSpan), I don't think it's worth changing that. I will add some more comments around the code though.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mgartner, @RaduBerinde, and @rytaft)


pkg/sql/types/types.go, line 468 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

nit: Unless you think it's likely this will be used for other purposes in the future, it might make sense to make the name of the type and family more specific to inverted index keys.

I think I'd like to keep it, it's a sound concept of its own, maybe there will be some use case for passing around encoded keys. I could imagine potential usecases around composite types like collated strings.

Copy link
Copy Markdown
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner, @rytaft, and @yuzefovich)


pkg/sql/sem/tree/datum.go, line 1626 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: it seems like it'd be good to extract a helper to be shared by two Bytes datums.

I just "rerouted" to the DBytes implementation.

Copy link
Copy Markdown
Contributor

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The statistics creation and props code still uses DBytes (props.makeSpanFromInvertedSpan), I don't think it's worth changing that. I will add some more comments around the code though.

SGTM. Thanks!

Reviewed 4 of 5 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner, @RaduBerinde, @rytaft, and @yuzefovich)


pkg/sql/sem/tree/datum.go, line 1564 at r2 (raw file):

}

// DEncodedKey is the bytes Datum. The underlying type is a string because we want

nit: DEncodedKey is not the bytes datum. It'd be helpful to explain the contrast between the type and DBytes here.


pkg/sql/types/types.go, line 468 at r1 (raw file):

Previously, RaduBerinde wrote…

I think I'd like to keep it, it's a sound concept of its own, maybe there will be some use case for passing around encoded keys. I could imagine potential usecases around composite types like collated strings.

Sounds good to me. In that case, I think the comment should be more general and explain the behavior of the type (e.g. "EncodedKey is similar to DBytes but has a different behavior for encoding and decoding ...") before mentioning the inverted index key use-case.

Copy link
Copy Markdown
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner, @RaduBerinde, @rytaft, and @yuzefovich)


pkg/sql/types/types.go, line 468 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Sounds good to me. In that case, I think the comment should be more general and explain the behavior of the type (e.g. "EncodedKey is similar to DBytes but has a different behavior for encoding and decoding ...") before mentioning the inverted index key use-case.

Done, PTAL

Copy link
Copy Markdown
Contributor

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 5 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @rytaft and @yuzefovich)


pkg/sql/types/types.go, line 468 at r1 (raw file):

Previously, RaduBerinde wrote…

Done, PTAL

Looks great thanks!

@RaduBerinde RaduBerinde force-pushed the inv-key-type branch 3 times, most recently from 07f4098 to d17cef9 Compare March 13, 2022 05:32
Currently we pass around inverted index keys as having `Bytes` type,
but these datums behave in subtly different ways. The
encoding/decoding is special-cased to be a no-op. This leads to sharp
edges; for example, we put these keys in `EncodedDatum`s but if we
ever try to decode them (via `EnsureDecoded`), we get obscure
decoding errors.

This commit cleans up this situation by adding a special `EncodedKey`
type and corresponding `DEncodedKey` datum. It behaves like
`Bytes/DBytes` in most cases, except key encoding and decoding.

Release note: None
@RaduBerinde
Copy link
Copy Markdown
Member Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 16, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants