Skip to content

util,sql: add tsvector and tsquery package and sql handling#83736

Closed
jordanlewis wants to merge 12 commits intocockroachdb:masterfrom
jordanlewis:tsvector
Closed

util,sql: add tsvector and tsquery package and sql handling#83736
jordanlewis wants to merge 12 commits intocockroachdb:masterfrom
jordanlewis:tsvector

Conversation

@jordanlewis
Copy link
Copy Markdown
Member

@jordanlewis jordanlewis commented Jul 2, 2022

Updates #41288.

This PR adds a subset of Postgres's full text search implementation, which uses the tsvector and tsquery datatypes. It aims to be as compatible as possible with Postgres. The documentation on Postgres's full text search lives here: https://www.postgresql.org/docs/current/textsearch.html

Most of the implementation lives in a new util/tsearch package.

Features (all pg-compatible, including the stop words and stemming for the supported languages):

  • tsquery datatype and query parsing (supports &, |, !, and <-> (followed by) operators)
  • tsvector datatype and vector parsing (supports position lists and priorities)
  • to_tsvector builtin
  • to_tsquery builtin
  • phraseto_tsquery builtin
  • plainto_tsquery builtin
  • tsquery @@ tsvector evaluation
  • the default_text_search_config session setting that defaults to pg_catalog.english
  • stopwords and snowball stemming for english, french, spanish, russian, norwegian, and swedish.
  • a "simple" text search configuration that simply normalizes words by lowercasing them
  • inverted indexing on tsvector
  • inverted index acceleration for tsvector @@ tsquery evaluation, which works both on stored computed tsvector columns and computed tsvector indexes

Release note: None

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@jordanlewis jordanlewis force-pushed the tsvector branch 2 times, most recently from bb3bcc9 to c5c50eb Compare July 16, 2022 00:09
@jordanlewis jordanlewis changed the title util: add ts package; tsvector parsing util,sql: add tsvector and tsquery package and sql handling Jul 22, 2022
This commit adds a tsearch package which will contain text search algorithms
for the tsvector/tsquery full text search capability.

It also adds parsing of the tsvector "sub language".

Release note: None
This commit adds lexing and parsing for the tsquery sublanguage, which
is similar to tsvector but with a few different bells and whistles. The
tsquery is lexed and then parsed into a simple AST.

Release note: None
The eval package contains a function to evaluate a tsquery against a
tsvector, in memory.

Release note: None
This commit adds value and pgwire encodings for tsvector and tsquery.

Release note: None
This commit adds the tsvector and tsquery types to sql, and makes them
available for use using the string::tsvector and string::tsquery casts.
It also implements the in-memory evaluation of @@ (the tsquery match
operator) by delegating to the tsearch.eval package.

Release note (sql change): add in-memory-only evaluation of tsvector and
tsquery datatypes and the @@ matches operator.
Release note (sql change): add the to_tsvector, to_tsquery,
phraseto_tsquery, and plainto_tsquery builtins which parse input
documents into tsvectors and tsqueries respectively.
craig bot pushed a commit that referenced this pull request Oct 28, 2022
85185: util: add tsearch package r=jordanlewis a=jordanlewis

Updates #41288.

Broken off from #83736

This PR adds a tsearch package which contains text search algorithms
for the tsvector/tsquery full text search capability.

See https://www.postgresql.org/docs/current/datatype-textsearch.html for details.

The package can:

1. parse the tsvector language, which consists of a list of terms surrounded by single quotes, optionally suffixed with a list of positions corresponding to the term's original ordinal position within a document. Each position may optionally come with a "weight", which is a letter A-D (defaulting to D, the lowest weight) that can be used to customize how important a word is in a document.
2. parse the tsquery language, which consists of a simple expression language with terms separating by operators `!`, `&`, `|`, and `<n>` where `n` is a number or `-` which is equivalent to 1. Expressions can be further grouped with parentheses. The first 3 operators are ordinary boolean operators over whether a term exists in a document. The `<n>` operator returns true if the left argument is `n` tokens to the left of the argument on the right in the document expressed by the vector.
3. evaluate a tsquery against a tsvector, which returns a boolean indicating whether the vector satisfied the query.

So far, this package is standalone and not hooked up to SQL at all.

Release note: None
Epic: None

89650: allocatorimpl: Prioritize non-voters in voter additions r=kvoli a=KaiSun314

Fix #63810

Previously, when adding a voter via the allocator, we would not prioritize stores with non-voters when selecting candidate stores.

This was inadequate because selecting a store without a non-voter would require sending a snapshot over the WAN in order to add a voter onto that store. On the other hand, selecting a store with a non-voter would only require promoting that non-voter to a voter, no snapshot needs to be sent over the WAN.

To address this, this patch determines if candidate stores have a non-voter replica and prioritize this status when sorting candidate stores (priority lower than balance score, but higher than range count). By prioritizing selecting a store with a non-voter, we can reduce the number of snapshots that need to be sent over the WAN.

Release note (ops change): We prioritized non-voters in voter additions, meaning that when selecting a store to add a voter on (in the allocator), we would prioritize candidate stores that contain a non-voter replica higher. This allows us to reduce the number of snapshots that need to be sent over the WAN.

Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com>
Co-authored-by: Kai Sun <kai.sun@cockroachlabs.com>
@chrisseto
Copy link
Copy Markdown
Contributor

I realize this is just a draft PR but I'm poking around at tsvector for funsies. I build this on an m1 by running ./dev build cockroach-short

You can trigger a panic on this branch by executing SELECT * FROM pg_catalog.pg_type;

* ERROR: a panic has occurred!
* oid 0 couldn't be mapped to array oid
* (1) attached stack trace
*   -- stack trace:
*   | runtime.gopanic
*   |   GOROOT/src/runtime/panic.go:1038
*   | [...repeated from below...]
* Wraps: (2) assertion failure
* Wraps: (3) attached stack trace
*   -- stack trace:
*   | github.com/cockroachdb/cockroach/pkg/sql/types.CalcArrayOid
*   |   github.com/cockroachdb/cockroach/pkg/sql/types/pkg/sql/types/oid.go:246
*   | github.com/cockroachdb/cockroach/pkg/sql.addPGTypeRow
*   |   github.com/cockroachdb/cockroach/pkg/sql/pg_catalog.go:2902
*   | github.com/cockroachdb/cockroach/pkg/sql.glob..func311.1
*   |   github.com/cockroachdb/cockroach/pkg/sql/pg_catalog.go:2987
*   | github.com/cockroachdb/cockroach/pkg/sql.forEachDatabaseDesc
*   |   github.com/cockroachdb/cockroach/pkg/sql/information_schema.go:2212
*   | github.com/cockroachdb/cockroach/pkg/sql.glob..func311
*   |   github.com/cockroachdb/cockroach/pkg/sql/pg_catalog.go:2981
*   | github.com/cockroachdb/cockroach/pkg/sql.(*virtualDefEntry).getPlanInfo.func1.1
*   |   github.com/cockroachdb/cockroach/pkg/sql/virtual_schema.go:573
*   | github.com/cockroachdb/cockroach/pkg/sql.setupGenerator.func3
*   |   github.com/cockroachdb/cockroach/pkg/sql/virtual_table.go:127
*   | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
*   |   github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:489
*   | runtime.goexit
*   |   GOROOT/src/runtime/asm_arm64.s:1133
* Wraps: (4) oid 0 couldn't be mapped to array oid
* Error types: (1) *withstack.withStack (2) *assert.withAssertionFailure (3) *withstack.withStack (4) *errutil.leafError
*

Atlas, the schema migration tool used by ent, runs some discovery queries to determine if any migrations are needed and triggers this panic.

craig bot pushed a commit that referenced this pull request Dec 2, 2022
90842: sql: add tsvector and tsquery types to SQL r=jordanlewis a=jordanlewis

Updates #41288.

Broken off from #83736

This PR adds the tsvector and tsquery types and pgwire encodings to sql, and makes them available for use using the string::tsvector and string::tsquery casts. It also implements the in-memory evaluation of `@@` (the tsquery match operator) by delegating to the tsearch.eval package.

Release note (sql change): implement in memory handling for the tsquery and tsvector Postgres datatypes, which provide text search capabilities. See https://www.postgresql.org/docs/current/datatype-textsearch.html for more details.

Epic: None

92611: upgrades: make a couple of permanent upgrades idempotent r=andreimatei a=andreimatei

Some of the permanent upgrades (opting into telemetry and initializing the cluster secret) were not idempotent; this patch makes them be. In particular, this is important since these upgrades will run on 23.1 cluster that are being upgraded from 22.2 as they've been turned from startupmigrations to upgrades in #91627. For the telemetry one, there's no "natural" way to make it idempotent, so I've added a check for the old startupmigration key.

Release note: None
Epic: None

92620: upgrademanager: clarify job running code r=andreimatei a=andreimatei

This patch makes the code clearer around running upgrade in jobs. Before this patch, the upgrade manager was looking for existing jobs corresponding to upgrades and, if it found them, it called jobsRegistry.Run() on them - just like it did for newly-created jobs. The behavior of Run() for jobs that are not already claimed by the registry is a bit under-specified. I believe it ended up printing an error [1] and then correctly waiting for the job to finish.

This patch no longer calls Run() for existing jobs; instead, it calls WaitForJobs(), which is more suggestive.

[1] https://github.com/cockroachdb/cockroach/blob/b2a6b80920324bd6b31cba9a6f622961979de600/pkg/jobs/adopt.go#L255

Release note: None
Epic: None

Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com>
Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
@jordanlewis
Copy link
Copy Markdown
Member Author

Superseded by a ton of individual PRs.

@jordanlewis jordanlewis deleted the tsvector branch March 22, 2023 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants