Skip to content

util: add tsearch package#85185

Merged
craig[bot] merged 4 commits intocockroachdb:masterfrom
jordanlewis:tsearch
Oct 28, 2022
Merged

util: add tsearch package#85185
craig[bot] merged 4 commits intocockroachdb:masterfrom
jordanlewis:tsearch

Conversation

@jordanlewis
Copy link
Copy Markdown
Member

@jordanlewis jordanlewis commented Jul 28, 2022

Updates #41288.

Broken off from #83736

This PR adds a tsearch package which contains text search algorithms
for the tsvector/tsquery full text search capability.

See https://www.postgresql.org/docs/current/datatype-textsearch.html for details.

The package can:

  1. parse the tsvector language, which consists of a list of terms surrounded by single quotes, optionally suffixed with a list of positions corresponding to the term's original ordinal position within a document. Each position may optionally come with a "weight", which is a letter A-D (defaulting to D, the lowest weight) that can be used to customize how important a word is in a document.
  2. parse the tsquery language, which consists of a simple expression language with terms separating by operators !, &, |, and <n> where n is a number or - which is equivalent to 1. Expressions can be further grouped with parentheses. The first 3 operators are ordinary boolean operators over whether a term exists in a document. The <n> operator returns true if the left argument is n tokens to the left of the argument on the right in the document expressed by the vector.
  3. evaluate a tsquery against a tsvector, which returns a boolean indicating whether the vector satisfied the query.

So far, this package is standalone and not hooked up to SQL at all.

Release note: None
Epic: None

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@jordanlewis jordanlewis force-pushed the tsearch branch 2 times, most recently from 964d3ab to 02276a0 Compare July 29, 2022 14:26
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Jul 29, 2022

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

  • We notice you have more than one commit in your PR. We try break logical changes into separate commits, but commits such as "fix typo" or "address review commits" should be squashed into one commit and pushed with --force
  • When CI has completed, please ensure no errors have appeared.

I have added a few people who may be able to assist in reviewing:

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Jul 29, 2022
@jordanlewis jordanlewis marked this pull request as ready for review July 29, 2022 14:26
@mgartner mgartner self-requested a review July 29, 2022 15:00
@yuzefovich
Copy link
Copy Markdown
Member

Lol, @jordanlewis how come you're a community member? Did they take away your engineering permissions? :D

@jordanlewis jordanlewis force-pushed the tsearch branch 2 times, most recently from 588ea67 to a19197b Compare July 31, 2022 21:29
@rafiss rafiss removed the O-community Originated from the community label Oct 3, 2022
@jordanlewis jordanlewis requested a review from rytaft October 3, 2022 21:42
@jordanlewis jordanlewis requested a review from rafiss October 12, 2022 14:43
Copy link
Copy Markdown
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Impressive work!

Reviewed 3 of 5 files at r1, 6 of 6 files at r4, 3 of 3 files at r5, 3 of 3 files at r6, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis, @mgartner, and @rafiss)


pkg/util/tsearch/eval.go line 89 at r6 (raw file):

	positions []tsposition
	width     int
	invert    bool

These could use comments


pkg/util/tsearch/eval.go line 97 at r6 (raw file):

	emitMatches emitMode = 1 << iota
	emitLeftUnmatched
	emitRightUnmatched

These could use comments


pkg/util/tsearch/lex.go line 80 at r4 (raw file):

// If a word is not single-quote wrapped, the next term will begin if there is
// whitespace after the word. Whitespace and colons may be entered by escaping
// them with backslashes. All other uses of backslashes are no-ops.

When you say "no-op", do you mean the backslash is ignored, or just treated like a normal character? If ignored, can a backslash escape another backslash to treat it as a normal character?


pkg/util/tsearch/lex.go line 288 at r4 (raw file):

				} else {
					p.state = expectingPosDelimiter
				}

nit: add a comment to explain why this doesn't use the same format as the ones above


pkg/util/tsearch/tsquery.go line 23 at r5 (raw file):

// tsNode represents a single AST node within the tree of a TSQuery.
type tsNode struct {
	// Only one of term, op, or paren will be set.

paren doesn't exist


pkg/util/tsearch/tsvector.go line 46 at r6 (raw file):

	}
	if w&weightD != 0 {
		ret.WriteByte('D')

Should this be printed if it's the default?


pkg/util/tsearch/tsvector.go line 83 at r6 (raw file):

		return 1
	}
	panic(fmt.Sprintf("no precdence for operator %d", o))

AssertionFailedf?


pkg/util/tsearch/tsvector.go line 197 at r6 (raw file):

		// Make sure to sort and uniq the position list even if there's only 1
		// entry.
		latsIdx := len(ret) - 1

nit: latsIdx -> lastIdx


pkg/util/tsearch/eval_test.go line 80 at r6 (raw file):

		{`!a <-> !b`, `a:3`, true},

		//{`!a <-> !b`, ``, true},

leftover?


pkg/util/tsearch/eval_test.go line 124 at r6 (raw file):

	t.Run("ComparePG", func(t *testing.T) {
		skip.IgnoreLint(t, "need to manually enable")
		conn, err := pgx.Connect(context.Background(), "postgresql://jordan@localhost:5432")

Same question I had before -- how would someone else run this?


pkg/util/tsearch/tsquery_test.go line 199 at r5 (raw file):

	t.Run("ComparePG", func(t *testing.T) {
		skip.IgnoreLint(t, "need to manually enable")
		conn, err := pgx.Connect(context.Background(), "postgresql://jordan@localhost:5432")

How will other people run this? Can you give some instructions on how to run this test?


pkg/util/tsearch/tsvector_test.go line 39 at r4 (raw file):

		{`foo:3A`, `'foo':3A`},
		{`foo:3D`, `'foo':3`},
		{`foo:3D`, `'foo':3`},

duplicate test case

This commit adds a tsearch package which will contain text search algorithms
for the tsvector/tsquery full text search capability.

It also adds parsing of the tsvector "sub language".

Release note: None
Copy link
Copy Markdown
Member Author

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks much for the review! Reading the code again I can see that there was not much provided context, so I've gone through and added a lot of explanatory comments that hopefully make it a bit easier to read this code. I've also renamed a few structs to be more canonical (like tsOperator instead of tsoperator).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner, @rafiss, and @rytaft)


pkg/util/tsearch/eval.go line 89 at r6 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

These could use comments

Done.


pkg/util/tsearch/eval.go line 97 at r6 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

These could use comments

Done.


pkg/util/tsearch/lex.go line 80 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

When you say "no-op", do you mean the backslash is ignored, or just treated like a normal character? If ignored, can a backslash escape another backslash to treat it as a normal character?

Confusingly they're just ignored. A doubled backslash is treated as a backslash literal. Clarified the comment (and added the missing test cases for \ in tsvector and tsquery).

Extra confusingly, Postgres outputs single backslash literals as double backslashes (WTF), this behavior can be confirmed by converted the string to bytes and noticing that there's just a single backslash byte... so I changed the test that runs against Postgres to deal with this. Truly confusing and awful!


pkg/util/tsearch/tsquery.go line 23 at r5 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

paren doesn't exist

Done.


pkg/util/tsearch/tsvector.go line 46 at r6 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Should this be printed if it's the default?

It should - the reason we need to do this is that tsWeight is reused for both TSQuery and TSVector. In TSQuery, the default is not present, so D needs to be printed. In TSVector, the default is D, and it doesn't need to be printed - but we just store absence for the weight if D is seen in TSVector.

It's a little confusing but I think easier to keep it like this than introduce 2 separate tsWeight enums. I've added clarifying comments.


pkg/util/tsearch/tsvector.go line 83 at r6 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

AssertionFailedf?

Done.


pkg/util/tsearch/eval_test.go line 80 at r6 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

leftover?

Done.


pkg/util/tsearch/tsquery_test.go line 199 at r5 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

How will other people run this? Can you give some instructions on how to run this test?

Done.


pkg/util/tsearch/tsvector_test.go line 39 at r4 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

duplicate test case

Done.

Copy link
Copy Markdown
Member Author

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner, @rafiss, and @rytaft)


pkg/util/tsearch/eval_test.go line 124 at r6 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Same question I had before -- how would someone else run this?

Done.

This commit adds lexing and parsing for the tsquery sublanguage, which
is similar to tsvector but with a few different bells and whistles. The
tsquery is lexed and then parsed into a simple AST.

Release note: None
The eval package contains a function to evaluate a tsquery against a
tsvector, in memory.

Release note: None
Copy link
Copy Markdown
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Reviewed 6 of 6 files at r7, 8 of 8 files at r10, 3 of 3 files at r11, 8 of 8 files at r12, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jordanlewis, @mgartner, and @rafiss)


pkg/util/tsearch/eval.go line 106 at r12 (raw file):

	width int
	// invert, if true, indicates that this match should be inverted. It's used
	// to handle followed by matches within not operators.

Do you have tests that cover this case? (more generally, have you checked for code coverage?)


pkg/util/tsearch/lex.go line 85 at r12 (raw file):

// whitespace after the word. Whitespace and colons may be entered by escaping
// them with backslashes. All other uses of backslashes are skipped, allowing
// the following character to be included as a literal.

might help to explicitly mention that / itself could be the following character


pkg/util/tsearch/lex.go line 99 at r12 (raw file):

//     operator. For example, foo:3AC*
//
// See examples in tsvector_test.go and tsquery_test.go.

might also help to include a reference to the postgres docs (or your comment in tsvector.go which includes that ref)


pkg/util/tsearch/tsquery.go line 24 at r12 (raw file):

// tsoperator is an enum that represents the different operators within a
// TSQuery.
type tsoperator int

did you want to make this tsOperator?


pkg/util/tsearch/tsquery.go line 42 at r12 (raw file):

	not
	// followedby is the <-> operator. It can also be specified with a number like
	// <1> or <2> or <3>. It requires that the left operand is next to the right

nit: next to -> followed by


pkg/util/tsearch/tsvector.go line 20 at r12 (raw file):

)

// This file defines the tsvector data structure, which is used to implement

nit: TSVector


pkg/util/tsearch/tsvector.go line 28 at r12 (raw file):

// list of lexemes in the doc (typically without stop words like common or short
// words, and typically stemmed using a stemming algorithm like snowball
// stemming), along with an associated set of positions that those lexemes occur

nit: maybe add a link to explain snowball stemming (like this one? https://en.wikipedia.org/wiki/Snowball_(programming_language) )


pkg/util/tsearch/tsvector.go line 44 at r12 (raw file):

// - TSVector is a list of tsTerms, ordered by their lexeme.

// tsWeight is an bitfield that represents the weight of a given term. When

nit: an -> a

... and rename a few structs to be canonical.

Release note: None
Copy link
Copy Markdown
Member Author

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @mgartner, @rafiss, and @rytaft)


pkg/util/tsearch/eval.go line 106 at r12 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Do you have tests that cover this case? (more generally, have you checked for code coverage?)

Good question which prompted me to run code coverage. Almost the whole file was covered, except for the cases where rPositions.width > lPositions.width in the followed by case. Also, we were missing cases where prefix match fails. I've added tests for those cases.


pkg/util/tsearch/lex.go line 85 at r12 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

might help to explicitly mention that / itself could be the following character

Done.


pkg/util/tsearch/lex.go line 99 at r12 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

might also help to include a reference to the postgres docs (or your comment in tsvector.go which includes that ref)

Done.


pkg/util/tsearch/tsquery.go line 24 at r12 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

did you want to make this tsOperator?

Whoops, yes. Done.


pkg/util/tsearch/tsquery.go line 42 at r12 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: next to -> followed by

Done.


pkg/util/tsearch/tsvector.go line 20 at r12 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: TSVector

Done.


pkg/util/tsearch/tsvector.go line 28 at r12 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: maybe add a link to explain snowball stemming (like this one? https://en.wikipedia.org/wiki/Snowball_(programming_language) )

Good point, thanks! It's this: https://snowballstem.org/


pkg/util/tsearch/tsvector.go line 44 at r12 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: an -> a

Done.

@jordanlewis
Copy link
Copy Markdown
Member Author

TFTR!

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Oct 28, 2022

This PR was included in a batch that was canceled, it will be automatically retried

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Oct 28, 2022

Build succeeded:

@craig craig bot merged commit b76537e into cockroachdb:master Oct 28, 2022
@jordanlewis jordanlewis deleted the tsearch branch October 28, 2022 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

X-blathers-triaged blathers was able to find an owner

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants