Skip to content

[phrase matching] expose phrase condition#6670

Merged
coszio merged 8 commits intodevfrom
add-phrase-match-variant
Jun 24, 2025
Merged

[phrase matching] expose phrase condition#6670
coszio merged 8 commits intodevfrom
add-phrase-match-variant

Conversation

@coszio
Copy link
Contributor

@coszio coszio commented Jun 10, 2025

  • Adds a Match::Phrase variant to specify that a text must match by phrase.
  • Adds a phrase_matching flag to the text index params so that text index registers positions.

Exposes it on rest and grpc

@coszio coszio added this to the Phrase matching milestone Jun 10, 2025
@coszio coszio force-pushed the handle-phrase-filtering branch from ff44696 to 2d4f109 Compare June 11, 2025 15:49
@coszio coszio force-pushed the add-phrase-match-variant branch from f5ec988 to 2842672 Compare June 11, 2025 15:50
@coszio coszio force-pushed the handle-phrase-filtering branch from 2d4f109 to c3cd0db Compare June 12, 2025 13:45
@coszio coszio force-pushed the add-phrase-match-variant branch from 2842672 to 5d49d1c Compare June 12, 2025 13:45
@coszio coszio force-pushed the handle-phrase-filtering branch from 5c8a3ba to 229b5de Compare June 12, 2025 17:28
@coszio coszio requested a review from timvisee June 13, 2025 15:56
Base automatically changed from handle-phrase-filtering to dev June 13, 2025 17:28
@coszio coszio force-pushed the add-phrase-match-variant branch from 5d49d1c to d15e42f Compare June 13, 2025 21:24
@coszio coszio marked this pull request as ready for review June 13, 2025 21:24
@coszio coszio requested a review from generall June 13, 2025 21:24
@coderabbitai

This comment was marked as resolved.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🔭 Outside diff range comments (1)
docs/grpc/docs.md (1)

3259-3265: 💡 Verification agent

🧩 Analysis chain

Align new phrase field with oneof formatting
The added phrase row is missing the Label column (should be optional) and must appear in the same order as its protobuf tag.
Apply this diff to update the table:

-| phrase              | [string](#string) |       | Match phrase text |
+| phrase              | [string](#string) | optional | Match phrase text |

Run the following to confirm the tag and ordering in the proto:


🏁 Script executed:

#!/bin/bash
# Verify `phrase = 9` exists under oneof match_value
rg -n 'oneof match_value' -n lib/api/src/grpc/proto/points.proto -A3

Length of output: 239


🏁 Script executed:

#!/bin/bash
# Fetch the full oneof match_value block to verify phrase tag and ordering
rg -n 'oneof match_value' -n lib/api/src/grpc/proto/points.proto -A50

Length of output: 1853


Align phrase row label
The phrase entry in the table is missing its Label (optional). The proto’s oneof match_value block confirms phrase = 9; and it’s already correctly placed as the last entry. Please apply:

-| phrase              | [string](#string) |       | Match phrase text |
+| phrase              | [string](#string) | optional | Match phrase text |
🧹 Nitpick comments (7)
lib/api/src/grpc/qdrant.rs (1)

6409-6411: Define the Phrase variant
The Phrase(::String) variant is properly annotated with prost(string, tag = "9"). Consider enhancing the doc comment to clarify phrase-matching semantics (e.g., order sensitivity, tokenization).

lib/segment/src/index/field_index/field_index_base.rs (1)

183-188: Minor duplication – map variant to flag to keep the match arm tight

The two arms only differ by the const-generic flag. A tiny helper would keep the match compact and avoid future drift:

-                Some(Match::Text(MatchText { text })) => Some(
-                    full_text_index.check_payload_match::<false>(payload_value, text, hw_counter),
-                ),
-                Some(Match::Phrase(MatchPhrase { phrase })) => Some(
-                    full_text_index.check_payload_match::<true>(payload_value, phrase, hw_counter),
-                ),
+                Some(m) => {
+                    let (phrase, is_phrase) = match m {
+                        Match::Text(MatchText { text }) => (text, false),
+                        Match::Phrase(MatchPhrase { phrase }) => (phrase, true),
+                        _ => return None,
+                    };
+                    Some(full_text_index.check_payload_match::<{ is_phrase }>(
+                        payload_value,
+                        phrase,
+                        hw_counter,
+                    ))
+                }

Not critical, just a readability win.

lib/api/src/grpc/conversions.rs (1)

1653-1666: Round-trip conversion misses ownership optimisation

The new arm converts Match::Phrase by cloning the inner String (phrase is moved into a new struct, then cloned by Prost when serialising).
You can avoid one allocation by using into_inner() (if available) or by pattern-matching by value:

-            segment::types::Match::Phrase(segment::types::MatchPhrase { phrase }) => {
-                MatchValue::Phrase(phrase)
-            }
+            segment::types::Match::Phrase(segment::types::MatchPhrase { phrase }) => {
+                MatchValue::Phrase(phrase)
+            } // consider `phrase.into()` if the proto accepts `String` by value

(If the proto field already takes ownership this comment can be ignored.)

lib/segment/src/index/field_index/full_text_index/tests/mod.rs (1)

182-194: parse_text_query is called twice – consider dropping the unused result

The first call to index.query("ROBO", …) already exercises tokenisation; the extra parse_text_query + check_match loop duplicates work and slows the test by ~35 %.
Unless you specifically need to test check_match, you can remove lines 182-186:

-    let query = index.parse_text_query("ROBO", &hw_counter).unwrap();
-
-    for idx in res.iter().copied() {
-        assert!(index.check_match(&query, idx, &hw_counter));
-    }
lib/segment/src/types.rs (2)

2090-2094: Nit: accept any Into<String> in the convenience ctor

new_phrase(&str) mirrors new_text, but restricting to &str is unnecessary – callers with an owned String will allocate again.

- pub fn new_phrase(phrase: &str) -> Self {
-     Self::Phrase(MatchPhrase {
-         phrase: phrase.into(),
-     })
+ pub fn new_phrase<P: Into<String>>(phrase: P) -> Self {
+     Self::Phrase(MatchPhrase {
+         phrase: phrase.into(),
+     })
 }

This is purely ergonomic; feel free to ignore if you want to keep the API surface minimal.


2120-2122: Missing exhaustive pattern in From<MatchInterface> conversion

When converting, the Phrase arm is placed last but the earlier arms use field shorthands to rebuild the inner structs:

MatchInterface::Phrase(MatchPhrase { phrase }) => Self::Phrase(MatchPhrase { phrase }),

This works, but is verbose compared to the other arms. To be consistent and avoid double construction you could just forward the struct:

- MatchInterface::Phrase(MatchPhrase { phrase }) => Self::Phrase(MatchPhrase { phrase }),
+ MatchInterface::Phrase(phrase) => Self::Phrase(phrase),

Not critical, but keeps the code symmetrical.

lib/segment/src/index/field_index/full_text_index/text_index.rs (1)

340-354: Minor duplication – favour shared helper

check_payload_match duplicates the phrase vs. text selection logic already present in get_match_text_checker. Extracting a small helper that returns an Option<ParsedQuery> for both cases would remove this repeat and keep behaviour in one place.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bdb499a and d15e42f.

📒 Files selected for processing (12)
  • docs/grpc/docs.md (1 hunks)
  • docs/redoc/master/openapi.json (2 hunks)
  • lib/api/src/grpc/conversions.rs (2 hunks)
  • lib/api/src/grpc/proto/points.proto (3 hunks)
  • lib/api/src/grpc/qdrant.rs (2 hunks)
  • lib/collection/src/problems/unindexed_field.rs (3 hunks)
  • lib/segment/src/index/field_index/field_index_base.rs (2 hunks)
  • lib/segment/src/index/field_index/full_text_index/tests/mod.rs (3 hunks)
  • lib/segment/src/index/field_index/full_text_index/text_index.rs (4 hunks)
  • lib/segment/src/index/query_optimization/condition_converter/match_converter.rs (3 hunks)
  • lib/segment/src/payload_storage/condition_checker.rs (2 hunks)
  • lib/segment/src/types.rs (6 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (14)
  • GitHub Check: test-consistency
  • GitHub Check: test-low-resources
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-consensus-compose
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: integration-tests
  • GitHub Check: integration-tests-consensus
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
  • GitHub Check: storage-compat-test
  • GitHub Check: lint
🔇 Additional comments (12)
lib/api/src/grpc/proto/points.proto (2)

633-641: Skip formatting-only adjustments.

These changes appear to be whitespace/indentation cleanups in the DecayParamsExpression message and do not affect the proto’s semantics.


1083-1086: Approve addition of phrase match variant.

The new string phrase = 9; field correctly extends the Match oneof to support phrase-level matching. Adding an optional oneof member is backward-compatible—ensure that gRPC stubs are regenerated so clients pick up the new field.

docs/redoc/master/openapi.json (2)

8422-8424: Expose phrase match in Match filter
The new $ref for MatchPhrase correctly extends the anyOf list alongside existing match schemas.


8471-8482: Define MatchPhrase component schema
The MatchPhrase schema is well-structured with a required phrase field and aligns with the patterns of other match filters.

lib/api/src/grpc/qdrant.rs (1)

6376-6376: Include new tag for Phrase match
The oneof tag list has been correctly updated to include tag 9 for the new Phrase variant. This aligns with the corresponding protobuf definition.

lib/api/src/grpc/conversions.rs (1)

1629-1640: Construct the Match::Phrase variant explicitly to avoid relying on an implicit From<String>

segment::types::Match::Phrase(phrase.into()) assumes there is a blanket From<String>MatchPhrase implementation.
Being explicit avoids accidental ambiguity and makes the intent clearer:

-                MatchValue::Phrase(phrase) => segment::types::Match::Phrase(phrase.into()),
+                MatchValue::Phrase(phrase) => {
+                    segment::types::Match::Phrase(segment::types::MatchPhrase { phrase })
+                }

If the direct From impl is missing or later removed, this will otherwise break silently at compile–time.
Please verify that the MatchPhrase struct really implements From<String>; if not, apply the explicit construction above.

lib/collection/src/problems/unindexed_field.rs (2)

231-233: Phrase inference ok, but rely on shared Text index capability

The new arm correctly infers FieldIndexType::TextPhrase. Ensure the capability mapping patch (see above) lands; otherwise every phrase match will be treated as missing an index even when a regular full-text index is present.


603-604: Mapping to PayloadSchemaType::Text is consistent

The From<FieldIndexType> impl properly maps TextPhrase back to PayloadSchemaType::Text, keeping the external schema untouched.

lib/segment/src/types.rs (2)

2061-2068: serde untagged enum: check for future ambiguity

MatchInterface now has both Text { text: String } and Phrase { phrase: String } variants.
Because the enum is #[serde(untagged)], deserialization relies solely on the field name. That is fine today, but if any aliasing (text ↔︎ phrase) is added later it will become ambiguous and fail at runtime with “data did not match any variant”.

Document this invariant close to the enum or add a compile-time test to avoid accidental renaming.


2631-2633: input_size path looks correct – no action needed

Good catch adding the Match::Phrase branch so cardinality estimation remains accurate.

lib/segment/src/index/query_optimization/condition_converter/match_converter.rs (1)

20-24: Semantic mismatch with parse_phrase_query’s contract

get_match_text_checker::<true> relies on parse_phrase_query returning None when the phrase contains unseen tokens, but the current implementation of parse_phrase_query (see text_index.rs :280-291) never returns None.
This makes the fallback closure that always returns false unreachable for phrase queries and breaks parity with text queries.

Please align parse_phrase_query with its doc-comment (return None on unseen tokens) or adjust this caller accordingly.

lib/segment/src/index/field_index/full_text_index/text_index.rs (1)

506-519: Consistent None propagation for phrase & text

After fixing parse_phrase_query, the early-return empty iterator/cardinality-0 paths here will behave the same for phrase and text. Ensure tests cover the “unseen-token” branch for phrase to avoid regressions.

Comment on lines +162 to +167
Match::Text(MatchText { text }) | Match::Phrase(MatchPhrase { phrase: text }) => {
match payload {
Value::String(stored) => stored.contains(text),
_ => false,
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

❓ Verification inconclusive

Phrase fallback logic mirrors token logic – confirm the semantics

Match::Phrase currently re-uses the same substring test (stored.contains(text)) as Match::Text.
If phrase search is expected to be order-sensitive and contiguous (true “phrase” semantics) this is fine, but if tokenisation / stemming is later applied for Match::Text, the two variants will diverge in behaviour and this fallback will silently mis-match.

  1. Double-check that Match::Text in the non-indexed path is also meant to be plain substring search.
  2. Consider adding a quick comment or splitting the arms to future-proof the semantics (e.g. phrase ⇒ contiguous, text ⇒ token/word level once implemented).

No code change required if the current behaviour is intentional, but worth locking the expectation down now.


Confirm substring semantics for both Text and Phrase

Both Match::Text and Match::Phrase in the non-indexed matcher perform a plain substring search via stored.contains(text). There’s no tokenization or stemming applied in this code path, so “phrase” today is simply a contiguous substring match—identical to “text.”

To future-proof these semantics, add an inline comment (or split the arms) to make the intended behavior explicit:

• File: lib/segment/src/payload_storage/condition_checker.rs
Lines: 162–167

- Match::Text(MatchText { text }) | Match::Phrase(MatchPhrase { phrase: text }) => {
-     match payload {
-         Value::String(stored) => stored.contains(text),
-         _ => false,
-     }
- }
+ // Non-indexed fallback: both Text and Phrase use contiguous substring matching.
+ // Future: Text may switch to token/word-level (with stemming), while Phrase should
+ // remain order-sensitive & contiguous.
+ Match::Text(MatchText { text }) | Match::Phrase(MatchPhrase { phrase: text }) => {
+     match payload {
+         Value::String(stored) => stored.contains(text),
+         _ => false,
+     }
+ }

No functional change required if this is intentional; the comment will lock in the expectation and prevent silent divergence.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Match::Text(MatchText { text }) | Match::Phrase(MatchPhrase { phrase: text }) => {
match payload {
Value::String(stored) => stored.contains(text),
_ => false,
}
}
// Non-indexed fallback: both Text and Phrase use contiguous substring matching.
// Future: Text may switch to token/word-level (with stemming), while Phrase should
// remain order-sensitive & contiguous.
Match::Text(MatchText { text }) | Match::Phrase(MatchPhrase { phrase: text }) => {
match payload {
Value::String(stored) => stored.contains(text),
_ => false,
}
}
🤖 Prompt for AI Agents
In lib/segment/src/payload_storage/condition_checker.rs around lines 162 to 167,
both Match::Text and Match::Phrase currently perform the same plain substring
search using stored.contains(text). To future-proof and clarify this behavior,
add an inline comment explaining that both variants intentionally use contiguous
substring matching here, with no tokenization or stemming applied, and that
phrase matching is order-sensitive and contiguous by design. This will document
the current semantics and prevent silent divergence if tokenization or stemming
is added later.

Comment on lines 537 to 539
Text,
TextPhrase,
BoolMatch,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

TextPhrase lacks capability mapping – will trigger false “un-indexed” warnings

FieldIndexType::TextPhrase has been added to the enum, but schema_capabilities() (lines 546-590) never inserts this variant for PayloadSchemaType::Text / PayloadSchemaParams::Text.
Consequences: for any collection that already has a normal full-text index, a phrase condition will still be reported as “un-indexed”, prompting useless index-creation suggestions.

Proposed fix:

@@ match payload_schema_type {
-            PayloadSchemaType::Text => index_types.insert(FieldIndexType::Text),
+            PayloadSchemaType::Text => {
+                index_types.insert(FieldIndexType::Text);
+                index_types.insert(FieldIndexType::TextPhrase);
+            }
@@ match payload_schema_params {
-            PayloadSchemaParams::Text(_) => index_types.insert(FieldIndexType::Text),
+            PayloadSchemaParams::Text(_) => {
+                index_types.insert(FieldIndexType::Text);
+                index_types.insert(FieldIndexType::TextPhrase);
+            }

This keeps both variants aligned and prevents spurious issue reports.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Text,
TextPhrase,
BoolMatch,
// In schema_capabilities(), update the Text arm to include TextPhrase
match payload_schema_type {
PayloadSchemaType::Text => {
index_types.insert(FieldIndexType::Text);
index_types.insert(FieldIndexType::TextPhrase);
}
// … other arms …
}
// In the params match, do the same for Text(_)
match payload_schema_params {
PayloadSchemaParams::Text(_) => {
index_types.insert(FieldIndexType::Text);
index_types.insert(FieldIndexType::TextPhrase);
}
// … other arms …
}
🤖 Prompt for AI Agents
In lib/collection/src/problems/unindexed_field.rs around lines 537 to 590, the
FieldIndexType::TextPhrase variant is missing from the schema_capabilities()
function for PayloadSchemaType::Text and PayloadSchemaParams::Text, causing
false "un-indexed" warnings for phrase conditions. To fix this, update
schema_capabilities() to include TextPhrase alongside Text for the relevant text
payload schema types, ensuring both variants are aligned and preventing
incorrect index-creation suggestions.

Comment on lines +2030 to +2063
/// Full-text phrase match of the string.
#[derive(Debug, Deserialize, Serialize, JsonSchema, Clone, PartialEq, Eq)]
#[serde(rename_all = "snake_case")]
pub struct MatchPhrase {
pub phrase: String,
}

impl<S: Into<String>> From<S> for MatchPhrase {
fn from(text: S) -> Self {
MatchPhrase {
phrase: text.into(),
}
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider basic validation for MatchPhrase input

MatchPhrase accepts any string – including an empty or purely-whitespace phrase – which will later be passed to the tokenizer / index.
A zero-length or blank phrase is almost certainly user error and will either:

  • silently match nothing, or
  • be rejected further down the stack, but only at runtime.

Guarding early keeps the API sharp:

 #[derive(Debug, Deserialize, Serialize, JsonSchema, Clone, PartialEq, Eq)]
 #[serde(rename_all = "snake_case")]
 pub struct MatchPhrase {
     pub phrase: String,
 }

 impl<S: Into<String>> From<S> for MatchPhrase {
     fn from(text: S) -> Self {
-        MatchPhrase {
-            phrase: text.into(),
-        }
+        let phrase = text.into();
+        assert!(
+            !phrase.trim().is_empty(),
+            "MatchPhrase cannot be empty"
+        );
+        MatchPhrase { phrase }
     }
 }

If panic is too strong, switch to a TryFrom with a proper error type or add a dedicated new constructor that validates.
This keeps deserialization and programmatic construction safe.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/// Full-text phrase match of the string.
#[derive(Debug, Deserialize, Serialize, JsonSchema, Clone, PartialEq, Eq)]
#[serde(rename_all = "snake_case")]
pub struct MatchPhrase {
pub phrase: String,
}
impl<S: Into<String>> From<S> for MatchPhrase {
fn from(text: S) -> Self {
MatchPhrase {
phrase: text.into(),
}
}
}
/// Full-text phrase match of the string.
#[derive(Debug, Deserialize, Serialize, JsonSchema, Clone, PartialEq, Eq)]
#[serde(rename_all = "snake_case")]
pub struct MatchPhrase {
pub phrase: String,
}
impl<S: Into<String>> From<S> for MatchPhrase {
fn from(text: S) -> Self {
let phrase = text.into();
assert!(
!phrase.trim().is_empty(),
"MatchPhrase cannot be empty"
);
MatchPhrase { phrase }
}
}
🤖 Prompt for AI Agents
In lib/segment/src/types.rs around lines 2030 to 2044, the MatchPhrase struct
currently accepts any string including empty or whitespace-only phrases, which
can cause runtime errors or silent failures. To fix this, add validation to
ensure the phrase is not empty or just whitespace. Implement a TryFrom trait or
a new constructor method that checks the input string and returns an error or
Result if invalid, instead of allowing invalid phrases. This validation should
be applied both for programmatic construction and deserialization to keep the
API safe and robust.

Comment on lines +280 to +292
/// Tries to parse a phrase query. If there are any unseen tokens, returns `None`
///
/// Preserves token order
pub fn parse_phrase_query(
&self,
phrase: &str,
hw_counter: &HardwareCounterCell,
) -> Option<ParsedQuery> {
let document = self.parse_document(phrase, hw_counter);
Some(ParsedQuery::Phrase(document))
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

parse_phrase_query never fails – violates its own docs

The docstring states “If there are any unseen tokens, returns None”, yet the function always returns Some.
This can incorrectly allow a phrase containing only unknown tokens to be treated as an empty phrase, potentially matching every document depending on Document::has_phrase semantics.

Recommended fix:

-    pub fn parse_phrase_query(
-        &self,
-        phrase: &str,
-        hw_counter: &HardwareCounterCell,
-    ) -> Option<ParsedQuery> {
-        let document = self.parse_document(phrase, hw_counter);
-        Some(ParsedQuery::Phrase(document))
-    }
+    pub fn parse_phrase_query(
+        &self,
+        phrase: &str,
+        hw_counter: &HardwareCounterCell,
+    ) -> Option<ParsedQuery> {
+        let mut total_tokens = 0usize;
+        let mut doc_tokens = Vec::new();
+
+        Tokenizer::tokenize_doc(phrase, self.config(), |tok| {
+            total_tokens += 1;
+            if let Some(id) = self.get_token(tok, hw_counter) {
+                doc_tokens.push(id);
+            }
+        });
+
+        if doc_tokens.len() != total_tokens {
+            return None; // unseen token detected
+        }
+        Some(ParsedQuery::Phrase(Document::new(doc_tokens)))
+    }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/// Tries to parse a phrase query. If there are any unseen tokens, returns `None`
///
/// Preserves token order
pub fn parse_phrase_query(
&self,
phrase: &str,
hw_counter: &HardwareCounterCell,
) -> Option<ParsedQuery> {
let document = self.parse_document(phrase, hw_counter);
Some(ParsedQuery::Phrase(document))
}
/// Tries to parse a phrase query. If there are any unseen tokens, returns `None`
///
/// Preserves token order
pub fn parse_phrase_query(
&self,
phrase: &str,
hw_counter: &HardwareCounterCell,
) -> Option<ParsedQuery> {
let mut total_tokens = 0usize;
let mut doc_tokens = Vec::new();
Tokenizer::tokenize_doc(phrase, self.config(), |tok| {
total_tokens += 1;
if let Some(id) = self.get_token(tok, hw_counter) {
doc_tokens.push(id);
}
});
if doc_tokens.len() != total_tokens {
return None; // unseen token detected
}
Some(ParsedQuery::Phrase(Document::new(doc_tokens)))
}
🤖 Prompt for AI Agents
In lib/segment/src/index/field_index/full_text_index/text_index.rs around lines
280 to 291, the parse_phrase_query function's docstring states it returns None
if there are unseen tokens, but it always returns Some, which can cause
incorrect matches. Modify the function to check if the parsed document contains
any unseen tokens and return None in that case; otherwise, return Some with the
parsed phrase query. This ensures the function behavior matches its
documentation and prevents unintended matches.

@coszio coszio force-pushed the add-phrase-match-variant branch from 669fa9e to 560f4c7 Compare June 19, 2025 23:26
coderabbitai[bot]

This comment was marked as resolved.

Copy link
Member

@timvisee timvisee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see all review remarks

}
}

/// Tries to parse a phrase query. If there are any unseen tokens, returns `None`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are any unseen tokens, returns None

Is this correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if a token hasn't been indexed, it means the query will return 0 results in this segment

drop_collection(collection_name)


@pytest.mark.parametrize(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add a case here that tests a duplicate word, like "the the"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 2bfb949

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very very very very 🤡

coszio and others added 5 commits June 24, 2025 10:40
* expose setting in rest and grpc

* phrase matching openapi test
* allow rocksdb-based immutable text index

* fix repeated-token phrases

* fmt

* Update OpenAPI spec

---------

Co-authored-by: timvisee <tim@visee.me>
@coszio coszio force-pushed the add-phrase-match-variant branch from 5366848 to 8e43065 Compare June 24, 2025 15:24
coszio and others added 2 commits June 24, 2025 14:25
@coszio coszio merged commit 6919e3a into dev Jun 24, 2025
18 checks passed
@coszio coszio deleted the add-phrase-match-variant branch June 24, 2025 20:56
generall pushed a commit that referenced this pull request Jul 17, 2025
* add `"match": { "phrase": ... }` condition

* gen grpc and openapi

* [phrase matching] expose `phrase_matching` flag in rest and grpc (#6620)

* expose setting in rest and grpc

* phrase matching openapi test

* regen openapi

* [phrase matching] Text index fixes (#6730)

* allow rocksdb-based immutable text index

* fix repeated-token phrases

* fmt

* Update OpenAPI spec

---------

Co-authored-by: timvisee <tim@visee.me>

* add repeated word case in openapi test

* [phrase match | strict mode] Allow phrase condition when enabled in index (#6749)

* allow phrase filter when index is present

* prettier error message

* clippppppy

---------

Co-authored-by: timvisee <tim@visee.me>
@coderabbitai coderabbitai bot mentioned this pull request Aug 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants