util,sql: add tsvector and tsquery package and sql handling by jordanlewis · Pull Request #83736 · cockroachdb/cockroach

jordanlewis · 2022-07-02T00:41:55Z

Updates #41288.

This PR adds a subset of Postgres's full text search implementation, which uses the tsvector and tsquery datatypes. It aims to be as compatible as possible with Postgres. The documentation on Postgres's full text search lives here: https://www.postgresql.org/docs/current/textsearch.html

Most of the implementation lives in a new util/tsearch package.

Features (all pg-compatible, including the stop words and stemming for the supported languages):

tsquery datatype and query parsing (supports &, |, !, and <-> (followed by) operators)
tsvector datatype and vector parsing (supports position lists and priorities)
to_tsvector builtin
to_tsquery builtin
phraseto_tsquery builtin
plainto_tsquery builtin
tsquery @@ tsvector evaluation
the default_text_search_config session setting that defaults to pg_catalog.english
stopwords and snowball stemming for english, french, spanish, russian, norwegian, and swedish.
a "simple" text search configuration that simply normalizes words by lowercasing them
inverted indexing on tsvector
inverted index acceleration for tsvector @@ tsquery evaluation, which works both on stored computed tsvector columns and computed tsvector indexes

Release note: None

cockroach-teamcity · 2022-07-02T00:42:03Z

This change is

This commit adds a tsearch package which will contain text search algorithms for the tsvector/tsquery full text search capability. It also adds parsing of the tsvector "sub language". Release note: None

This commit adds lexing and parsing for the tsquery sublanguage, which is similar to tsvector but with a few different bells and whistles. The tsquery is lexed and then parsed into a simple AST. Release note: None

The eval package contains a function to evaluate a tsquery against a tsvector, in memory. Release note: None

This commit adds value and pgwire encodings for tsvector and tsquery. Release note: None

This commit adds the tsvector and tsquery types to sql, and makes them available for use using the string::tsvector and string::tsquery casts. It also implements the in-memory evaluation of @@ (the tsquery match operator) by delegating to the tsearch.eval package. Release note (sql change): add in-memory-only evaluation of tsvector and tsquery datatypes and the @@ matches operator.

Release note (sql change): add the to_tsvector, to_tsquery, phraseto_tsquery, and plainto_tsquery builtins which parse input documents into tsvectors and tsqueries respectively.

Release note: None

85185: util: add tsearch package r=jordanlewis a=jordanlewis Updates #41288. Broken off from #83736 This PR adds a tsearch package which contains text search algorithms for the tsvector/tsquery full text search capability. See https://www.postgresql.org/docs/current/datatype-textsearch.html for details. The package can: 1. parse the tsvector language, which consists of a list of terms surrounded by single quotes, optionally suffixed with a list of positions corresponding to the term's original ordinal position within a document. Each position may optionally come with a "weight", which is a letter A-D (defaulting to D, the lowest weight) that can be used to customize how important a word is in a document. 2. parse the tsquery language, which consists of a simple expression language with terms separating by operators `!`, `&`, `|`, and `<n>` where `n` is a number or `-` which is equivalent to 1. Expressions can be further grouped with parentheses. The first 3 operators are ordinary boolean operators over whether a term exists in a document. The `<n>` operator returns true if the left argument is `n` tokens to the left of the argument on the right in the document expressed by the vector. 3. evaluate a tsquery against a tsvector, which returns a boolean indicating whether the vector satisfied the query. So far, this package is standalone and not hooked up to SQL at all. Release note: None Epic: None 89650: allocatorimpl: Prioritize non-voters in voter additions r=kvoli a=KaiSun314 Fix #63810 Previously, when adding a voter via the allocator, we would not prioritize stores with non-voters when selecting candidate stores. This was inadequate because selecting a store without a non-voter would require sending a snapshot over the WAN in order to add a voter onto that store. On the other hand, selecting a store with a non-voter would only require promoting that non-voter to a voter, no snapshot needs to be sent over the WAN. To address this, this patch determines if candidate stores have a non-voter replica and prioritize this status when sorting candidate stores (priority lower than balance score, but higher than range count). By prioritizing selecting a store with a non-voter, we can reduce the number of snapshots that need to be sent over the WAN. Release note (ops change): We prioritized non-voters in voter additions, meaning that when selecting a store to add a voter on (in the allocator), we would prioritize candidate stores that contain a non-voter replica higher. This allows us to reduce the number of snapshots that need to be sent over the WAN. Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com> Co-authored-by: Kai Sun <kai.sun@cockroachlabs.com>

chrisseto · 2022-12-02T02:51:49Z

I realize this is just a draft PR but I'm poking around at tsvector for funsies. I build this on an m1 by running ./dev build cockroach-short

You can trigger a panic on this branch by executing SELECT * FROM pg_catalog.pg_type;

* ERROR: a panic has occurred!
* oid 0 couldn't be mapped to array oid
* (1) attached stack trace
*   -- stack trace:
*   | runtime.gopanic
*   |   GOROOT/src/runtime/panic.go:1038
*   | [...repeated from below...]
* Wraps: (2) assertion failure
* Wraps: (3) attached stack trace
*   -- stack trace:
*   | github.com/cockroachdb/cockroach/pkg/sql/types.CalcArrayOid
*   |   github.com/cockroachdb/cockroach/pkg/sql/types/pkg/sql/types/oid.go:246
*   | github.com/cockroachdb/cockroach/pkg/sql.addPGTypeRow
*   |   github.com/cockroachdb/cockroach/pkg/sql/pg_catalog.go:2902
*   | github.com/cockroachdb/cockroach/pkg/sql.glob..func311.1
*   |   github.com/cockroachdb/cockroach/pkg/sql/pg_catalog.go:2987
*   | github.com/cockroachdb/cockroach/pkg/sql.forEachDatabaseDesc
*   |   github.com/cockroachdb/cockroach/pkg/sql/information_schema.go:2212
*   | github.com/cockroachdb/cockroach/pkg/sql.glob..func311
*   |   github.com/cockroachdb/cockroach/pkg/sql/pg_catalog.go:2981
*   | github.com/cockroachdb/cockroach/pkg/sql.(*virtualDefEntry).getPlanInfo.func1.1
*   |   github.com/cockroachdb/cockroach/pkg/sql/virtual_schema.go:573
*   | github.com/cockroachdb/cockroach/pkg/sql.setupGenerator.func3
*   |   github.com/cockroachdb/cockroach/pkg/sql/virtual_table.go:127
*   | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
*   |   github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:489
*   | runtime.goexit
*   |   GOROOT/src/runtime/asm_arm64.s:1133
* Wraps: (4) oid 0 couldn't be mapped to array oid
* Error types: (1) *withstack.withStack (2) *assert.withAssertionFailure (3) *withstack.withStack (4) *errutil.leafError
*

Atlas, the schema migration tool used by ent, runs some discovery queries to determine if any migrations are needed and triggers this panic.

90842: sql: add tsvector and tsquery types to SQL r=jordanlewis a=jordanlewis Updates #41288. Broken off from #83736 This PR adds the tsvector and tsquery types and pgwire encodings to sql, and makes them available for use using the string::tsvector and string::tsquery casts. It also implements the in-memory evaluation of `@@` (the tsquery match operator) by delegating to the tsearch.eval package. Release note (sql change): implement in memory handling for the tsquery and tsvector Postgres datatypes, which provide text search capabilities. See https://www.postgresql.org/docs/current/datatype-textsearch.html for more details. Epic: None 92611: upgrades: make a couple of permanent upgrades idempotent r=andreimatei a=andreimatei Some of the permanent upgrades (opting into telemetry and initializing the cluster secret) were not idempotent; this patch makes them be. In particular, this is important since these upgrades will run on 23.1 cluster that are being upgraded from 22.2 as they've been turned from startupmigrations to upgrades in #91627. For the telemetry one, there's no "natural" way to make it idempotent, so I've added a check for the old startupmigration key. Release note: None Epic: None 92620: upgrademanager: clarify job running code r=andreimatei a=andreimatei This patch makes the code clearer around running upgrade in jobs. Before this patch, the upgrade manager was looking for existing jobs corresponding to upgrades and, if it found them, it called jobsRegistry.Run() on them - just like it did for newly-created jobs. The behavior of Run() for jobs that are not already claimed by the registry is a bit under-specified. I believe it ended up printing an error [1] and then correctly waiting for the job to finish. This patch no longer calls Run() for existing jobs; instead, it calls WaitForJobs(), which is more suggestive. [1] https://github.com/cockroachdb/cockroach/blob/b2a6b80920324bd6b31cba9a6f622961979de600/pkg/jobs/adopt.go#L255 Release note: None Epic: None Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com> Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>

jordanlewis · 2023-03-22T15:59:54Z

Superseded by a ton of individual PRs.

jordanlewis force-pushed the tsvector branch 2 times, most recently from bb3bcc9 to c5c50eb Compare July 16, 2022 00:09

jordanlewis force-pushed the tsvector branch from c5c50eb to db5d508 Compare July 22, 2022 22:49

jordanlewis changed the title ~~util: add ts package; tsvector parsing~~ util,sql: add tsvector and tsquery package and sql handling Jul 22, 2022

jordanlewis force-pushed the tsvector branch from db5d508 to 34520ce Compare July 24, 2022 02:25

jordanlewis added 12 commits July 24, 2022 21:51

util: add tsearch package; tsvector parsing

dd5c9c6

This commit adds a tsearch package which will contain text search algorithms for the tsvector/tsquery full text search capability. It also adds parsing of the tsvector "sub language". Release note: None

ts: add parsing for tsquery sublanguage

9cf2428

This commit adds lexing and parsing for the tsquery sublanguage, which is similar to tsvector but with a few different bells and whistles. The tsquery is lexed and then parsed into a simple AST. Release note: None

tsearch: add eval package

2c5ad20

The eval package contains a function to evaluate a tsquery against a tsvector, in memory. Release note: None

tsearch: add encodings for tsvector/tsquery

3188435

This commit adds value and pgwire encodings for tsvector and tsquery. Release note: None

builtins: add to_tsvector {phrase,plain,}to_tsquery

d38b4a4

Release note (sql change): add the to_tsvector, to_tsquery, phraseto_tsquery, and plainto_tsquery builtins which parse input documents into tsvectors and tsqueries respectively.

sql: permite inverted idx on tsvector

2248f67

Release note: None

sql: add regconfig/regdictionary

4f3db2c

Release note: None

tsearch: add stopwords

eb1c5cc

tsearch: add blevesearch snowball stemming

e67e161

sql: add default_text_search_config

cf59d2c

sql: inv idx accelerate tsvector@@tsquery queries

eb1429a

jordanlewis force-pushed the tsvector branch from 34520ce to eb1429a Compare July 25, 2022 01:53

jordanlewis mentioned this pull request Jul 28, 2022

util: add tsearch package #85185

Merged

jordanlewis mentioned this pull request Oct 28, 2022

sql: add tsvector and tsquery types to SQL #90842

Merged

jordanlewis closed this Mar 22, 2023

jordanlewis deleted the tsvector branch March 22, 2023 16:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

util,sql: add tsvector and tsquery package and sql handling#83736

util,sql: add tsvector and tsquery package and sql handling#83736
jordanlewis wants to merge 12 commits intocockroachdb:masterfrom
jordanlewis:tsvector

jordanlewis commented Jul 2, 2022 •

edited

Loading

Uh oh!

cockroach-teamcity commented Jul 2, 2022

Uh oh!

chrisseto commented Dec 2, 2022

Uh oh!

jordanlewis commented Mar 22, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jordanlewis commented Jul 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Jul 2, 2022

Uh oh!

chrisseto commented Dec 2, 2022

Uh oh!

jordanlewis commented Mar 22, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jordanlewis commented Jul 2, 2022 •

edited

Loading