util,sql: add tsvector and tsquery package and sql handling#83736
Closed
jordanlewis wants to merge 12 commits intocockroachdb:masterfrom
Closed
util,sql: add tsvector and tsquery package and sql handling#83736jordanlewis wants to merge 12 commits intocockroachdb:masterfrom
jordanlewis wants to merge 12 commits intocockroachdb:masterfrom
Conversation
Member
bb3bcc9 to
c5c50eb
Compare
This commit adds a tsearch package which will contain text search algorithms for the tsvector/tsquery full text search capability. It also adds parsing of the tsvector "sub language". Release note: None
This commit adds lexing and parsing for the tsquery sublanguage, which is similar to tsvector but with a few different bells and whistles. The tsquery is lexed and then parsed into a simple AST. Release note: None
The eval package contains a function to evaluate a tsquery against a tsvector, in memory. Release note: None
This commit adds value and pgwire encodings for tsvector and tsquery. Release note: None
This commit adds the tsvector and tsquery types to sql, and makes them available for use using the string::tsvector and string::tsquery casts. It also implements the in-memory evaluation of @@ (the tsquery match operator) by delegating to the tsearch.eval package. Release note (sql change): add in-memory-only evaluation of tsvector and tsquery datatypes and the @@ matches operator.
Release note (sql change): add the to_tsvector, to_tsquery, phraseto_tsquery, and plainto_tsquery builtins which parse input documents into tsvectors and tsqueries respectively.
Release note: None
Release note: None
craig bot
pushed a commit
that referenced
this pull request
Oct 28, 2022
85185: util: add tsearch package r=jordanlewis a=jordanlewis Updates #41288. Broken off from #83736 This PR adds a tsearch package which contains text search algorithms for the tsvector/tsquery full text search capability. See https://www.postgresql.org/docs/current/datatype-textsearch.html for details. The package can: 1. parse the tsvector language, which consists of a list of terms surrounded by single quotes, optionally suffixed with a list of positions corresponding to the term's original ordinal position within a document. Each position may optionally come with a "weight", which is a letter A-D (defaulting to D, the lowest weight) that can be used to customize how important a word is in a document. 2. parse the tsquery language, which consists of a simple expression language with terms separating by operators `!`, `&`, `|`, and `<n>` where `n` is a number or `-` which is equivalent to 1. Expressions can be further grouped with parentheses. The first 3 operators are ordinary boolean operators over whether a term exists in a document. The `<n>` operator returns true if the left argument is `n` tokens to the left of the argument on the right in the document expressed by the vector. 3. evaluate a tsquery against a tsvector, which returns a boolean indicating whether the vector satisfied the query. So far, this package is standalone and not hooked up to SQL at all. Release note: None Epic: None 89650: allocatorimpl: Prioritize non-voters in voter additions r=kvoli a=KaiSun314 Fix #63810 Previously, when adding a voter via the allocator, we would not prioritize stores with non-voters when selecting candidate stores. This was inadequate because selecting a store without a non-voter would require sending a snapshot over the WAN in order to add a voter onto that store. On the other hand, selecting a store with a non-voter would only require promoting that non-voter to a voter, no snapshot needs to be sent over the WAN. To address this, this patch determines if candidate stores have a non-voter replica and prioritize this status when sorting candidate stores (priority lower than balance score, but higher than range count). By prioritizing selecting a store with a non-voter, we can reduce the number of snapshots that need to be sent over the WAN. Release note (ops change): We prioritized non-voters in voter additions, meaning that when selecting a store to add a voter on (in the allocator), we would prioritize candidate stores that contain a non-voter replica higher. This allows us to reduce the number of snapshots that need to be sent over the WAN. Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com> Co-authored-by: Kai Sun <kai.sun@cockroachlabs.com>
Contributor
|
I realize this is just a draft PR but I'm poking around at tsvector for funsies. I build this on an m1 by running You can trigger a panic on this branch by executing Atlas, the schema migration tool used by ent, runs some discovery queries to determine if any migrations are needed and triggers this panic. |
craig bot
pushed a commit
that referenced
this pull request
Dec 2, 2022
90842: sql: add tsvector and tsquery types to SQL r=jordanlewis a=jordanlewis Updates #41288. Broken off from #83736 This PR adds the tsvector and tsquery types and pgwire encodings to sql, and makes them available for use using the string::tsvector and string::tsquery casts. It also implements the in-memory evaluation of `@@` (the tsquery match operator) by delegating to the tsearch.eval package. Release note (sql change): implement in memory handling for the tsquery and tsvector Postgres datatypes, which provide text search capabilities. See https://www.postgresql.org/docs/current/datatype-textsearch.html for more details. Epic: None 92611: upgrades: make a couple of permanent upgrades idempotent r=andreimatei a=andreimatei Some of the permanent upgrades (opting into telemetry and initializing the cluster secret) were not idempotent; this patch makes them be. In particular, this is important since these upgrades will run on 23.1 cluster that are being upgraded from 22.2 as they've been turned from startupmigrations to upgrades in #91627. For the telemetry one, there's no "natural" way to make it idempotent, so I've added a check for the old startupmigration key. Release note: None Epic: None 92620: upgrademanager: clarify job running code r=andreimatei a=andreimatei This patch makes the code clearer around running upgrade in jobs. Before this patch, the upgrade manager was looking for existing jobs corresponding to upgrades and, if it found them, it called jobsRegistry.Run() on them - just like it did for newly-created jobs. The behavior of Run() for jobs that are not already claimed by the registry is a bit under-specified. I believe it ended up printing an error [1] and then correctly waiting for the job to finish. This patch no longer calls Run() for existing jobs; instead, it calls WaitForJobs(), which is more suggestive. [1] https://github.com/cockroachdb/cockroach/blob/b2a6b80920324bd6b31cba9a6f622961979de600/pkg/jobs/adopt.go#L255 Release note: None Epic: None Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com> Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Member
Author
|
Superseded by a ton of individual PRs. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Updates #41288.
This PR adds a subset of Postgres's full text search implementation, which uses the tsvector and tsquery datatypes. It aims to be as compatible as possible with Postgres. The documentation on Postgres's full text search lives here: https://www.postgresql.org/docs/current/textsearch.html
Most of the implementation lives in a new
util/tsearchpackage.Features (all pg-compatible, including the stop words and stemming for the supported languages):
tsquerydatatype and query parsing (supports&,|,!, and<->(followed by) operators)tsvectordatatype and vector parsing (supports position lists and priorities)to_tsvectorbuiltinto_tsquerybuiltinphraseto_tsquerybuiltinplainto_tsquerybuiltintsquery@@tsvectorevaluationdefault_text_search_configsession setting that defaults topg_catalog.englishtsvector @@ tsqueryevaluation, which works both on stored computed tsvector columns and computed tsvector indexesRelease note: None