Skip to content

sql: add tsvector and tsquery types to SQL#90842

Merged
craig[bot] merged 3 commits intocockroachdb:masterfrom
jordanlewis:tsvector-types
Dec 2, 2022
Merged

sql: add tsvector and tsquery types to SQL#90842
craig[bot] merged 3 commits intocockroachdb:masterfrom
jordanlewis:tsvector-types

Conversation

@jordanlewis
Copy link
Copy Markdown
Member

@jordanlewis jordanlewis commented Oct 28, 2022

Updates #41288.

Broken off from #83736

This PR adds the tsvector and tsquery types and pgwire encodings to sql, and makes them available for use using the string::tsvector and string::tsquery casts. It also implements the in-memory evaluation of @@ (the tsquery match operator) by delegating to the tsearch.eval package.

Release note (sql change): implement in memory handling for the tsquery and tsvector Postgres datatypes, which provide text search capabilities. See https://www.postgresql.org/docs/current/datatype-textsearch.html for more details.

Epic: None

@jordanlewis jordanlewis requested review from a team October 28, 2022 15:44
@jordanlewis jordanlewis requested a review from a team as a code owner October 28, 2022 15:44
@jordanlewis jordanlewis requested a review from a team October 28, 2022 15:44
@jordanlewis jordanlewis requested a review from a team as a code owner October 28, 2022 15:44
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@jordanlewis jordanlewis requested a review from rafiss October 28, 2022 15:46
@jordanlewis jordanlewis marked this pull request as draft October 28, 2022 16:39
@jordanlewis jordanlewis force-pushed the tsvector-types branch 7 times, most recently from d0836be to fb8a80d Compare October 29, 2022 19:00
@jordanlewis jordanlewis marked this pull request as ready for review October 29, 2022 19:01
@jordanlewis jordanlewis requested review from a team as code owners October 29, 2022 19:01
@jordanlewis jordanlewis requested review from ajwerner, renatolabs and srosenberg and removed request for a team October 29, 2022 19:01
@postamar postamar removed the request for review from a team November 9, 2022 15:23
Copy link
Copy Markdown
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is looking good! is there already a cluster version gate on being able to use these new types?

Reviewed 16 of 63 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jordanlewis, @renatolabs, and @srosenberg)


pkg/sql/pgwire/types.go line 244 at r2 (raw file):

	case *tree.DTSQuery:
		b.writeLengthPrefixedString(v.String())

does this take care of "tree written in infix notation. Example: ( 'abc':*B | 'def' ) & !'ghi'".


pkg/sql/pgwire/types.go line 247 at r2 (raw file):

	case *tree.DTSVector:
		b.writeLengthPrefixedString(v.String())

does this take care of " As seen above. ' is escaped with '' and \ is escaped with \." consider adding a separate function for the pgwire text encodings to make this more explicit.


pkg/sql/rowenc/valueside/array.go line 322 at r2 (raw file):

		return encoding.EncodeUntaggedBytesValue(b, encoded), nil
	case *tree.DTSVector:
		encoded := tsearch.EncodeTSVector(nil, t.TSVector)

why do you use the PGBinary version for tsquery, but not for tsvector?


pkg/sql/rowenc/valueside/encode.go line 88 at r2 (raw file):

		return encoding.EncodeTSQueryValue(appendTo, uint32(colID), encoded), nil
	case *tree.DTSVector:
		encoded := tsearch.EncodeTSVector(scratch, t.TSVector)

ditto question about PGBinary


pkg/sql/sem/cast/cast_map_gen.sh line 12 at r2 (raw file):

DATABASE="${1-postgres}"

psql $DATABASE -Xqtc "SELECT postgis_full_version()" &> /dev/null

why was this removed?


pkg/util/tsearch/encoding.go line 22 at r1 (raw file):

)

// EncodeTSVector encodes a tsvector into a serialized representation.

can you say when you would use this, versus EncodeTSVectorPGBinary?


pkg/util/tsearch/encoding.go line 87 at r1 (raw file):

//	The words are sorted when parsed, and only written once. Positions are also sorted and only written once.
//	For some reason, the unique check does not seem to be made for binary input, only text input...
//	text: As seen above. ' is escaped with '' and \ is escaped with \\.

what is "as seen above" referencing? if this is copied from https://www.npgsql.org/dev/types.html we should link to it

also, this is the "text" encoding, which i don't think we need in this comment


pkg/util/tsearch/encoding.go line 101 at r1 (raw file):

		appendTo = append(appendTo, []byte(l)...)
		appendTo = append(appendTo, byte(0))
		appendTo = encoding.EncodeUint16Ascending(appendTo, uint16(len(term.positions)))

should it error if len(positions) is greater than MaxUint16?


pkg/util/tsearch/encoding.go line 108 at r1 (raw file):

				return nil, err
			}
			// TODO(jordan): do we need to clear the top 2 bits first?

how can we find out how to answer this?


pkg/util/tsearch/encoding.go line 160 at r1 (raw file):

// A tree with operands and operators (&, |, !). Operands are strings,
// with optional weight (bitmask of ABCD) and prefix search (yes/no, written with *).
// text: the tree written in infix notation. Example: ( 'abc':*B | 'def' ) & !'ghi'

nit: don't need the text encoding in this comment


pkg/util/tsearch/encoding.go line 164 at r1 (raw file):

// First the number of tokens (a token is an operand or an operator).
// For each token:
// UInt8 type (1 = val, 2 = oper) followed by

https://www.npgsql.org/dev/types.html had some indentation here that made it easier to read


pkg/util/tsearch/random.go line 21 at r1 (raw file):

var alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"

// RandomTSVector returns a random TSVector for testing.

can these go into a test file?

Copy link
Copy Markdown
Member Author

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Added gates for the casts, which should be enough since we don't have disk-storage yet in this PR.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @rafiss, @renatolabs, and @srosenberg)


pkg/sql/pgwire/types.go line 244 at r2 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

does this take care of "tree written in infix notation. Example: ( 'abc':*B | 'def' ) & !'ghi'".

Yes, this is how the String method works.


pkg/sql/pgwire/types.go line 247 at r2 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

does this take care of " As seen above. ' is escaped with '' and \ is escaped with \." consider adding a separate function for the pgwire text encodings to make this more explicit.

Yes, also just how the string function works. I guess I could make it more explicit, but I think it's typical that we have the .String() method produce the same output that you'd see on the commandline, which is the same as the text serialization format usually, right?

I'm just looking at how the JSON case above works as prior art here.


pkg/sql/rowenc/valueside/array.go line 322 at r2 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

why do you use the PGBinary version for tsquery, but not for tsvector?

I created a hand-rolled TSVector storage format while implementing TSVector, but re-used the PGBinary one for TSQuery. There's no real rhyme or reason to this, just how the implementation started.

Thinking it over, I think the main reason to have different functions is if the PG wire protocol evolves in a new version - in that case, we would need to copy the function and only change the wire protocol one.

I went ahead and made a new version for TSQuery. It's identical to the PGBinary one, except it uses the Cockroach-style length-prefixed strings instead of the Postgres-style 0-byte terminated strings.


pkg/sql/rowenc/valueside/encode.go line 88 at r2 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

ditto question about PGBinary

Done.


pkg/sql/sem/cast/cast_map_gen.sh line 12 at r2 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

why was this removed?

Oops, thanks for catching this - I deleted it temporarily because I couldn't figure out how to install postgis locally to run this script.


pkg/util/tsearch/encoding.go line 22 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

can you say when you would use this, versus EncodeTSVectorPGBinary?

This one is for on-disk storage, and the other one is for pg binary wire protocol.


pkg/util/tsearch/encoding.go line 87 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

what is "as seen above" referencing? if this is copied from https://www.npgsql.org/dev/types.html we should link to it

also, this is the "text" encoding, which i don't think we need in this comment

Done.


pkg/util/tsearch/encoding.go line 101 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

should it error if len(positions) is greater than MaxUint16?

Yes, it should. I am missing a bunch of max length checks, I've added them and tests.


pkg/util/tsearch/encoding.go line 108 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

how can we find out how to answer this?

It was leftover when writing this code, it's not necessary.


pkg/util/tsearch/encoding.go line 164 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

https://www.npgsql.org/dev/types.html had some indentation here that made it easier to read

Done.


pkg/util/tsearch/random.go line 21 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

can these go into a test file?

They cannot, because of the way randgen is set up (it's a non-test package).

@jordanlewis jordanlewis force-pushed the tsvector-types branch 2 times, most recently from ef06db0 to 985307a Compare November 18, 2022 20:51
Copy link
Copy Markdown
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more request for tests: can you add encoding tests for the new types to pgwire.TestEncodings?

  1. Add examples to pkg/cmd/generate-binary/main.go
  2. Start PG server
  3. cd pkg/cmd/generate-binary; go run main.go > ../../sql/pgwire/testdata/encodings.json
  4. I think there might be some tech debt so that some of the float encodings are incorrectly adjusted. undo that part of the diff

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jordanlewis, @renatolabs, and @srosenberg)


pkg/sql/pgwire/types.go line 247 at r2 (raw file):

.String() method produce the same output that you'd see on the commandline, which is the same as the text serialization format usually

i don't think there is a standard one way or another about String. the reliable way to get the pgwire text format is to use the FmtPgwireText flag, but as you say, not all types need that. for example, arrays and tuples do need it. might as well use it here too


pkg/sql/rowenc/valueside/array.go line 322 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I created a hand-rolled TSVector storage format while implementing TSVector, but re-used the PGBinary one for TSQuery. There's no real rhyme or reason to this, just how the implementation started.

Thinking it over, I think the main reason to have different functions is if the PG wire protocol evolves in a new version - in that case, we would need to copy the function and only change the wire protocol one.

I went ahead and made a new version for TSQuery. It's identical to the PGBinary one, except it uses the Cockroach-style length-prefixed strings instead of the Postgres-style 0-byte terminated strings.

sgtm

Teach tsearch to respect the Postgres limits on various pieces of the
text search implementations.

- at most 256 positions, silently truncated
- positions are at largest 16383, silently truncated
- at most 16384 inside of a phrase search operator
- lexemes are at most length 2046

Release note: None
This commit adds value and pgwire encodings for tsvector and tsquery.

Release note: None
Copy link
Copy Markdown
Member Author

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot about this test - good suggestion, it turned up 2 bugs in the pgwire encoding. It would be nice if we didn't have to hand-generate these but I guess I don't see a better way given that you need a running pg server to make it happen.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @rafiss, @renatolabs, and @srosenberg)


pkg/sql/pgwire/types.go line 247 at r2 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

.String() method produce the same output that you'd see on the commandline, which is the same as the text serialization format usually

i don't think there is a standard one way or another about String. the reliable way to get the pgwire text format is to use the FmtPgwireText flag, but as you say, not all types need that. for example, arrays and tuples do need it. might as well use it here too

Done.


pkg/sql/rowenc/valueside/array.go line 322 at r2 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

sgtm

Done.

This commit adds the tsvector and tsquery types to sql, and makes them
available for use using the string::tsvector and string::tsquery casts.
It also implements the in-memory evaluation of @@ (the tsquery match
operator) by delegating to the tsearch.eval package.

Release note (sql change): add in-memory-only evaluation of tsvector and
tsquery datatypes and the @@ matches operator.
Copy link
Copy Markdown
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @renatolabs, and @srosenberg)

@jordanlewis
Copy link
Copy Markdown
Member Author

TFTR!

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Dec 2, 2022

Build succeeded:

@craig craig bot merged commit f7f24fb into cockroachdb:master Dec 2, 2022
@jordanlewis jordanlewis deleted the tsvector-types branch December 2, 2022 20:32
@rafiss
Copy link
Copy Markdown
Collaborator

rafiss commented Dec 5, 2022

@jordanlewis somehow this passed CI even though benchmarks fail with

10:47:17       --- FAIL: BenchmarkAsJSON/tsquery
10:47:17           json_test.go:56: unexpected type *tree.DTSQuery for AsJSON
10:47:17       --- FAIL: BenchmarkAsJSON/tsvector
10:47:17           json_test.go:56: unexpected type *tree.DTSVector for AsJSON

https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelEssentialCi/7844044?

@srosenberg
Copy link
Copy Markdown
Member

It failed on Bazel Extended CI [1] during the original run. Note that microbenches are not executed by bors. The latter runs only Bazel Essential CI. @herkolategan FYI as we were discussing something related earlier today.

[1] https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelExtendedCi/7708653

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants