sql: add tsvector and tsquery types to SQL#90842
sql: add tsvector and tsquery types to SQL#90842craig[bot] merged 3 commits intocockroachdb:masterfrom
Conversation
d0836be to
fb8a80d
Compare
rafiss
left a comment
There was a problem hiding this comment.
this is looking good! is there already a cluster version gate on being able to use these new types?
Reviewed 16 of 63 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jordanlewis, @renatolabs, and @srosenberg)
pkg/sql/pgwire/types.go line 244 at r2 (raw file):
case *tree.DTSQuery: b.writeLengthPrefixedString(v.String())
does this take care of "tree written in infix notation. Example: ( 'abc':*B | 'def' ) & !'ghi'".
pkg/sql/pgwire/types.go line 247 at r2 (raw file):
case *tree.DTSVector: b.writeLengthPrefixedString(v.String())
does this take care of " As seen above. ' is escaped with '' and \ is escaped with \." consider adding a separate function for the pgwire text encodings to make this more explicit.
pkg/sql/rowenc/valueside/array.go line 322 at r2 (raw file):
return encoding.EncodeUntaggedBytesValue(b, encoded), nil case *tree.DTSVector: encoded := tsearch.EncodeTSVector(nil, t.TSVector)
why do you use the PGBinary version for tsquery, but not for tsvector?
pkg/sql/rowenc/valueside/encode.go line 88 at r2 (raw file):
return encoding.EncodeTSQueryValue(appendTo, uint32(colID), encoded), nil case *tree.DTSVector: encoded := tsearch.EncodeTSVector(scratch, t.TSVector)
ditto question about PGBinary
pkg/sql/sem/cast/cast_map_gen.sh line 12 at r2 (raw file):
DATABASE="${1-postgres}" psql $DATABASE -Xqtc "SELECT postgis_full_version()" &> /dev/null
why was this removed?
pkg/util/tsearch/encoding.go line 22 at r1 (raw file):
) // EncodeTSVector encodes a tsvector into a serialized representation.
can you say when you would use this, versus EncodeTSVectorPGBinary?
pkg/util/tsearch/encoding.go line 87 at r1 (raw file):
// The words are sorted when parsed, and only written once. Positions are also sorted and only written once. // For some reason, the unique check does not seem to be made for binary input, only text input... // text: As seen above. ' is escaped with '' and \ is escaped with \\.
what is "as seen above" referencing? if this is copied from https://www.npgsql.org/dev/types.html we should link to it
also, this is the "text" encoding, which i don't think we need in this comment
pkg/util/tsearch/encoding.go line 101 at r1 (raw file):
appendTo = append(appendTo, []byte(l)...) appendTo = append(appendTo, byte(0)) appendTo = encoding.EncodeUint16Ascending(appendTo, uint16(len(term.positions)))
should it error if len(positions) is greater than MaxUint16?
pkg/util/tsearch/encoding.go line 108 at r1 (raw file):
return nil, err } // TODO(jordan): do we need to clear the top 2 bits first?
how can we find out how to answer this?
pkg/util/tsearch/encoding.go line 160 at r1 (raw file):
// A tree with operands and operators (&, |, !). Operands are strings, // with optional weight (bitmask of ABCD) and prefix search (yes/no, written with *). // text: the tree written in infix notation. Example: ( 'abc':*B | 'def' ) & !'ghi'
nit: don't need the text encoding in this comment
pkg/util/tsearch/encoding.go line 164 at r1 (raw file):
// First the number of tokens (a token is an operand or an operator). // For each token: // UInt8 type (1 = val, 2 = oper) followed by
https://www.npgsql.org/dev/types.html had some indentation here that made it easier to read
pkg/util/tsearch/random.go line 21 at r1 (raw file):
var alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" // RandomTSVector returns a random TSVector for testing.
can these go into a test file?
fb8a80d to
0005974
Compare
jordanlewis
left a comment
There was a problem hiding this comment.
TFTR!
Added gates for the casts, which should be enough since we don't have disk-storage yet in this PR.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @rafiss, @renatolabs, and @srosenberg)
pkg/sql/pgwire/types.go line 244 at r2 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
does this take care of "tree written in infix notation. Example: ( 'abc':*B | 'def' ) & !'ghi'".
Yes, this is how the String method works.
pkg/sql/pgwire/types.go line 247 at r2 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
does this take care of " As seen above. ' is escaped with '' and \ is escaped with \." consider adding a separate function for the pgwire text encodings to make this more explicit.
Yes, also just how the string function works. I guess I could make it more explicit, but I think it's typical that we have the .String() method produce the same output that you'd see on the commandline, which is the same as the text serialization format usually, right?
I'm just looking at how the JSON case above works as prior art here.
pkg/sql/rowenc/valueside/array.go line 322 at r2 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
why do you use the
PGBinaryversion for tsquery, but not for tsvector?
I created a hand-rolled TSVector storage format while implementing TSVector, but re-used the PGBinary one for TSQuery. There's no real rhyme or reason to this, just how the implementation started.
Thinking it over, I think the main reason to have different functions is if the PG wire protocol evolves in a new version - in that case, we would need to copy the function and only change the wire protocol one.
I went ahead and made a new version for TSQuery. It's identical to the PGBinary one, except it uses the Cockroach-style length-prefixed strings instead of the Postgres-style 0-byte terminated strings.
pkg/sql/rowenc/valueside/encode.go line 88 at r2 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
ditto question about
PGBinary
Done.
pkg/sql/sem/cast/cast_map_gen.sh line 12 at r2 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
why was this removed?
Oops, thanks for catching this - I deleted it temporarily because I couldn't figure out how to install postgis locally to run this script.
pkg/util/tsearch/encoding.go line 22 at r1 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
can you say when you would use this, versus
EncodeTSVectorPGBinary?
This one is for on-disk storage, and the other one is for pg binary wire protocol.
pkg/util/tsearch/encoding.go line 87 at r1 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
what is "as seen above" referencing? if this is copied from https://www.npgsql.org/dev/types.html we should link to it
also, this is the "text" encoding, which i don't think we need in this comment
Done.
pkg/util/tsearch/encoding.go line 101 at r1 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
should it error if
len(positions)is greater thanMaxUint16?
Yes, it should. I am missing a bunch of max length checks, I've added them and tests.
pkg/util/tsearch/encoding.go line 108 at r1 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
how can we find out how to answer this?
It was leftover when writing this code, it's not necessary.
pkg/util/tsearch/encoding.go line 164 at r1 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
https://www.npgsql.org/dev/types.html had some indentation here that made it easier to read
Done.
pkg/util/tsearch/random.go line 21 at r1 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
can these go into a test file?
They cannot, because of the way randgen is set up (it's a non-test package).
ef06db0 to
985307a
Compare
985307a to
d72ccff
Compare
rafiss
left a comment
There was a problem hiding this comment.
one more request for tests: can you add encoding tests for the new types to pgwire.TestEncodings?
- Add examples to pkg/cmd/generate-binary/main.go
- Start PG server
cd pkg/cmd/generate-binary; go run main.go > ../../sql/pgwire/testdata/encodings.json- I think there might be some tech debt so that some of the float encodings are incorrectly adjusted. undo that part of the diff
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jordanlewis, @renatolabs, and @srosenberg)
pkg/sql/pgwire/types.go line 247 at r2 (raw file):
.String() method produce the same output that you'd see on the commandline, which is the same as the text serialization format usually
i don't think there is a standard one way or another about String. the reliable way to get the pgwire text format is to use the FmtPgwireText flag, but as you say, not all types need that. for example, arrays and tuples do need it. might as well use it here too
pkg/sql/rowenc/valueside/array.go line 322 at r2 (raw file):
Previously, jordanlewis (Jordan Lewis) wrote…
I created a hand-rolled TSVector storage format while implementing TSVector, but re-used the PGBinary one for TSQuery. There's no real rhyme or reason to this, just how the implementation started.
Thinking it over, I think the main reason to have different functions is if the PG wire protocol evolves in a new version - in that case, we would need to copy the function and only change the wire protocol one.
I went ahead and made a new version for TSQuery. It's identical to the PGBinary one, except it uses the Cockroach-style length-prefixed strings instead of the Postgres-style 0-byte terminated strings.
sgtm
Teach tsearch to respect the Postgres limits on various pieces of the text search implementations. - at most 256 positions, silently truncated - positions are at largest 16383, silently truncated - at most 16384 inside of a phrase search operator - lexemes are at most length 2046 Release note: None
This commit adds value and pgwire encodings for tsvector and tsquery. Release note: None
d72ccff to
4bcc649
Compare
jordanlewis
left a comment
There was a problem hiding this comment.
I forgot about this test - good suggestion, it turned up 2 bugs in the pgwire encoding. It would be nice if we didn't have to hand-generate these but I guess I don't see a better way given that you need a running pg server to make it happen.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @rafiss, @renatolabs, and @srosenberg)
pkg/sql/pgwire/types.go line 247 at r2 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
.String() method produce the same output that you'd see on the commandline, which is the same as the text serialization format usually
i don't think there is a standard one way or another about
String. the reliable way to get the pgwire text format is to use theFmtPgwireTextflag, but as you say, not all types need that. for example, arrays and tuples do need it. might as well use it here too
Done.
pkg/sql/rowenc/valueside/array.go line 322 at r2 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
sgtm
Done.
This commit adds the tsvector and tsquery types to sql, and makes them available for use using the string::tsvector and string::tsquery casts. It also implements the in-memory evaluation of @@ (the tsquery match operator) by delegating to the tsearch.eval package. Release note (sql change): add in-memory-only evaluation of tsvector and tsquery datatypes and the @@ matches operator.
4bcc649 to
1e8af4e
Compare
rafiss
left a comment
There was a problem hiding this comment.
lgtm!
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @renatolabs, and @srosenberg)
|
TFTR! bors r+ |
|
Build succeeded: |
|
@jordanlewis somehow this passed CI even though benchmarks fail with https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelEssentialCi/7844044? |
|
It failed on Bazel Extended CI [1] during the original run. Note that microbenches are not executed by [1] https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelExtendedCi/7708653 |
Updates #41288.
Broken off from #83736
This PR adds the tsvector and tsquery types and pgwire encodings to sql, and makes them available for use using the string::tsvector and string::tsquery casts. It also implements the in-memory evaluation of @@ (the tsquery match operator) by delegating to the tsearch.eval package.
Release note (sql change): implement in memory handling for the tsquery and tsvector Postgres datatypes, which provide text search capabilities. See https://www.postgresql.org/docs/current/datatype-textsearch.html for more details.
Epic: None