util: add tsearch package#85185
Conversation
964d3ab to
02276a0
Compare
|
Thank you for updating your pull request. Before a member of our team reviews your PR, I have some potential action items for you:
I have added a few people who may be able to assist in reviewing: 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
|
Lol, @jordanlewis how come you're a community member? Did they take away your engineering permissions? :D |
588ea67 to
a19197b
Compare
rytaft
left a comment
There was a problem hiding this comment.
Reviewed 3 of 5 files at r1, 6 of 6 files at r4, 3 of 3 files at r5, 3 of 3 files at r6, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis, @mgartner, and @rafiss)
pkg/util/tsearch/eval.go line 89 at r6 (raw file):
positions []tsposition width int invert bool
These could use comments
pkg/util/tsearch/eval.go line 97 at r6 (raw file):
emitMatches emitMode = 1 << iota emitLeftUnmatched emitRightUnmatched
These could use comments
pkg/util/tsearch/lex.go line 80 at r4 (raw file):
// If a word is not single-quote wrapped, the next term will begin if there is // whitespace after the word. Whitespace and colons may be entered by escaping // them with backslashes. All other uses of backslashes are no-ops.
When you say "no-op", do you mean the backslash is ignored, or just treated like a normal character? If ignored, can a backslash escape another backslash to treat it as a normal character?
pkg/util/tsearch/lex.go line 288 at r4 (raw file):
} else { p.state = expectingPosDelimiter }
nit: add a comment to explain why this doesn't use the same format as the ones above
pkg/util/tsearch/tsquery.go line 23 at r5 (raw file):
// tsNode represents a single AST node within the tree of a TSQuery. type tsNode struct { // Only one of term, op, or paren will be set.
paren doesn't exist
pkg/util/tsearch/tsvector.go line 46 at r6 (raw file):
} if w&weightD != 0 { ret.WriteByte('D')
Should this be printed if it's the default?
pkg/util/tsearch/tsvector.go line 83 at r6 (raw file):
return 1 } panic(fmt.Sprintf("no precdence for operator %d", o))
AssertionFailedf?
pkg/util/tsearch/tsvector.go line 197 at r6 (raw file):
// Make sure to sort and uniq the position list even if there's only 1 // entry. latsIdx := len(ret) - 1
nit: latsIdx -> lastIdx
pkg/util/tsearch/eval_test.go line 80 at r6 (raw file):
{`!a <-> !b`, `a:3`, true}, //{`!a <-> !b`, ``, true},
leftover?
pkg/util/tsearch/eval_test.go line 124 at r6 (raw file):
t.Run("ComparePG", func(t *testing.T) { skip.IgnoreLint(t, "need to manually enable") conn, err := pgx.Connect(context.Background(), "postgresql://jordan@localhost:5432")
Same question I had before -- how would someone else run this?
pkg/util/tsearch/tsquery_test.go line 199 at r5 (raw file):
t.Run("ComparePG", func(t *testing.T) { skip.IgnoreLint(t, "need to manually enable") conn, err := pgx.Connect(context.Background(), "postgresql://jordan@localhost:5432")
How will other people run this? Can you give some instructions on how to run this test?
pkg/util/tsearch/tsvector_test.go line 39 at r4 (raw file):
{`foo:3A`, `'foo':3A`}, {`foo:3D`, `'foo':3`}, {`foo:3D`, `'foo':3`},
duplicate test case
This commit adds a tsearch package which will contain text search algorithms for the tsvector/tsquery full text search capability. It also adds parsing of the tsvector "sub language". Release note: None
jordanlewis
left a comment
There was a problem hiding this comment.
Thanks much for the review! Reading the code again I can see that there was not much provided context, so I've gone through and added a lot of explanatory comments that hopefully make it a bit easier to read this code. I've also renamed a few structs to be more canonical (like tsOperator instead of tsoperator).
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner, @rafiss, and @rytaft)
pkg/util/tsearch/eval.go line 89 at r6 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
These could use comments
Done.
pkg/util/tsearch/eval.go line 97 at r6 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
These could use comments
Done.
pkg/util/tsearch/lex.go line 80 at r4 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
When you say "no-op", do you mean the backslash is ignored, or just treated like a normal character? If ignored, can a backslash escape another backslash to treat it as a normal character?
Confusingly they're just ignored. A doubled backslash is treated as a backslash literal. Clarified the comment (and added the missing test cases for \ in tsvector and tsquery).
Extra confusingly, Postgres outputs single backslash literals as double backslashes (WTF), this behavior can be confirmed by converted the string to bytes and noticing that there's just a single backslash byte... so I changed the test that runs against Postgres to deal with this. Truly confusing and awful!
pkg/util/tsearch/tsquery.go line 23 at r5 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
paren doesn't exist
Done.
pkg/util/tsearch/tsvector.go line 46 at r6 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
Should this be printed if it's the default?
It should - the reason we need to do this is that tsWeight is reused for both TSQuery and TSVector. In TSQuery, the default is not present, so D needs to be printed. In TSVector, the default is D, and it doesn't need to be printed - but we just store absence for the weight if D is seen in TSVector.
It's a little confusing but I think easier to keep it like this than introduce 2 separate tsWeight enums. I've added clarifying comments.
pkg/util/tsearch/tsvector.go line 83 at r6 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
AssertionFailedf?
Done.
pkg/util/tsearch/eval_test.go line 80 at r6 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
leftover?
Done.
pkg/util/tsearch/tsquery_test.go line 199 at r5 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
How will other people run this? Can you give some instructions on how to run this test?
Done.
pkg/util/tsearch/tsvector_test.go line 39 at r4 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
duplicate test case
Done.
jordanlewis
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner, @rafiss, and @rytaft)
pkg/util/tsearch/eval_test.go line 124 at r6 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
Same question I had before -- how would someone else run this?
Done.
This commit adds lexing and parsing for the tsquery sublanguage, which is similar to tsvector but with a few different bells and whistles. The tsquery is lexed and then parsed into a simple AST. Release note: None
The eval package contains a function to evaluate a tsquery against a tsvector, in memory. Release note: None
rytaft
left a comment
There was a problem hiding this comment.
Thanks!
Reviewed 6 of 6 files at r7, 8 of 8 files at r10, 3 of 3 files at r11, 8 of 8 files at r12, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jordanlewis, @mgartner, and @rafiss)
pkg/util/tsearch/eval.go line 106 at r12 (raw file):
width int // invert, if true, indicates that this match should be inverted. It's used // to handle followed by matches within not operators.
Do you have tests that cover this case? (more generally, have you checked for code coverage?)
pkg/util/tsearch/lex.go line 85 at r12 (raw file):
// whitespace after the word. Whitespace and colons may be entered by escaping // them with backslashes. All other uses of backslashes are skipped, allowing // the following character to be included as a literal.
might help to explicitly mention that / itself could be the following character
pkg/util/tsearch/lex.go line 99 at r12 (raw file):
// operator. For example, foo:3AC* // // See examples in tsvector_test.go and tsquery_test.go.
might also help to include a reference to the postgres docs (or your comment in tsvector.go which includes that ref)
pkg/util/tsearch/tsquery.go line 24 at r12 (raw file):
// tsoperator is an enum that represents the different operators within a // TSQuery. type tsoperator int
did you want to make this tsOperator?
pkg/util/tsearch/tsquery.go line 42 at r12 (raw file):
not // followedby is the <-> operator. It can also be specified with a number like // <1> or <2> or <3>. It requires that the left operand is next to the right
nit: next to -> followed by
pkg/util/tsearch/tsvector.go line 20 at r12 (raw file):
) // This file defines the tsvector data structure, which is used to implement
nit: TSVector
pkg/util/tsearch/tsvector.go line 28 at r12 (raw file):
// list of lexemes in the doc (typically without stop words like common or short // words, and typically stemmed using a stemming algorithm like snowball // stemming), along with an associated set of positions that those lexemes occur
nit: maybe add a link to explain snowball stemming (like this one? https://en.wikipedia.org/wiki/Snowball_(programming_language) )
pkg/util/tsearch/tsvector.go line 44 at r12 (raw file):
// - TSVector is a list of tsTerms, ordered by their lexeme. // tsWeight is an bitfield that represents the weight of a given term. When
nit: an -> a
... and rename a few structs to be canonical. Release note: None
jordanlewis
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner, @rafiss, and @rytaft)
pkg/util/tsearch/eval.go line 106 at r12 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
Do you have tests that cover this case? (more generally, have you checked for code coverage?)
Good question which prompted me to run code coverage. Almost the whole file was covered, except for the cases where rPositions.width > lPositions.width in the followed by case. Also, we were missing cases where prefix match fails. I've added tests for those cases.
pkg/util/tsearch/lex.go line 85 at r12 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
might help to explicitly mention that / itself could be the following character
Done.
pkg/util/tsearch/lex.go line 99 at r12 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
might also help to include a reference to the postgres docs (or your comment in tsvector.go which includes that ref)
Done.
pkg/util/tsearch/tsquery.go line 24 at r12 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
did you want to make this tsOperator?
Whoops, yes. Done.
pkg/util/tsearch/tsquery.go line 42 at r12 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
nit: next to -> followed by
Done.
pkg/util/tsearch/tsvector.go line 20 at r12 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
nit: TSVector
Done.
pkg/util/tsearch/tsvector.go line 28 at r12 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
nit: maybe add a link to explain snowball stemming (like this one? https://en.wikipedia.org/wiki/Snowball_(programming_language) )
Good point, thanks! It's this: https://snowballstem.org/
pkg/util/tsearch/tsvector.go line 44 at r12 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
nit: an -> a
Done.
|
TFTR! bors r+ |
|
This PR was included in a batch that was canceled, it will be automatically retried |
|
Build succeeded: |
Updates #41288.
Broken off from #83736
This PR adds a tsearch package which contains text search algorithms
for the tsvector/tsquery full text search capability.
See https://www.postgresql.org/docs/current/datatype-textsearch.html for details.
The package can:
!,&,|, and<n>wherenis a number or-which is equivalent to 1. Expressions can be further grouped with parentheses. The first 3 operators are ordinary boolean operators over whether a term exists in a document. The<n>operator returns true if the left argument isntokens to the left of the argument on the right in the document expressed by the vector.So far, this package is standalone and not hooked up to SQL at all.
Release note: None
Epic: None