[phrase match] `ParsedQuery` as enum by coszio · Pull Request #6617 · qdrant/qdrant

coszio · 2025-05-30T16:12:49Z

Builds on top of #6486

This is merely a refactoring to allow (with less changes) introducing phrase filtering as a separate way to use the full-text index.

lib/segment/src/index/field_index/full_text_index/inverted_index.rs

lib/segment/src/index/field_index/full_text_index/immutable_inverted_index.rs

coderabbitai · 2025-06-02T17:29:21Z

📝 Walkthrough

Walkthrough

This set of changes refactors the full text inverted index implementation by introducing a new TokenSet abstraction and updating the query representation from a struct with a token vector to an enum with a TokenSet variant. Methods for filtering and matching documents or points based on token subsets are modularized into dedicated helper functions (filter_has_subset, check_has_subset) across the immutable, mmap, and mutable inverted index types. The intersection logic for posting lists is generalized to work with different posting value types. Test and helper functions are updated to use the new abstractions, with no changes to public API signatures.

Possibly related PRs

Full text index with generic posting list #6565: Refactors the full text index's immutable and mmap inverted index implementations to use the new generic posting list abstraction, which is directly related to the modularization and generalization of posting list intersection in this PR.

Suggested reviewers

timvisee
generall

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 31384f1 and 1ecfcf9.

📒 Files selected for processing (6)

lib/segment/src/index/field_index/full_text_index/immutable_inverted_index.rs (4 hunks)
lib/segment/src/index/field_index/full_text_index/inverted_index.rs (4 hunks)
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs (4 hunks)
lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs (2 hunks)
lib/segment/src/index/field_index/full_text_index/postings_iterator.rs (2 hunks)
lib/segment/src/index/field_index/full_text_index/text_index.rs (2 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (4)

lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs (4)

lib/segment/src/index/field_index/full_text_index/immutable_inverted_index.rs (1)

filter_has_subset (24-57)

lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs (1)

filter_has_subset (148-182)

lib/segment/src/index/field_index/full_text_index/inverted_index.rs (4)

tokens (29-31)

new (70-72)

iter (52-56)

query (299-302)

lib/segment/src/index/field_index/full_text_index/postings_iterator.rs (2)

postings (37-40)

intersect_postings_iterator (6-22)

lib/segment/src/index/field_index/full_text_index/immutable_inverted_index.rs (5)

lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs (2)

filter_has_subset (74-97)

filter (202-210)

lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs (4)

filter_has_subset (148-182)

filter (301-309)

intersection (156-179)

check_has_subset (184-218)

lib/segment/src/index/field_index/full_text_index/inverted_index.rs (5)

tokens (29-31)

filter (144-148)

new (70-72)

iter (52-56)

query (299-302)

lib/segment/src/index/field_index/full_text_index/postings_iterator.rs (2)

postings (37-40)

intersect_compressed_postings_iterator (24-58)

lib/posting_list/src/posting_list.rs (2)

view (89-107)

iter (114-116)

lib/segment/src/index/field_index/full_text_index/inverted_index.rs (8)

lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs (2)

check_match (224-235)

points_count (246-248)

lib/segment/src/index/field_index/full_text_index/immutable_inverted_index.rs (2)

check_match (143-152)

points_count (167-169)

lib/segment/src/index/field_index/full_text_index/text_index.rs (5)

check_match (173-190)

query (274-283)

query (617-620)

tokens (259-259)

points_count (114-120)

lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs (2)

check_match (329-338)

points_count (362-364)

lib/segment/src/index/field_index/geo_index/mutable_geo_index.rs (1)

points_count (148-150)

lib/segment/src/index/field_index/geo_index/immutable_geo_index.rs (1)

points_count (284-286)

lib/segment/src/index/field_index/geo_index/mmap_geo_index.rs (1)

points_count (374-378)

lib/segment/src/index/field_index/geo_index/mod.rs (1)

points_count (89-95)

lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs (4)

lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs (2)

filter_has_subset (74-97)

filter (202-210)

lib/segment/src/index/field_index/full_text_index/inverted_index.rs (5)

tokens (29-31)

filter (144-148)

new (70-72)

iter (52-56)

query (299-302)

lib/segment/src/index/field_index/full_text_index/postings_iterator.rs (2)

postings (37-40)

intersect_compressed_postings_iterator (24-58)

lib/common/common/src/counter/hardware_counter.rs (1)

new (42-55)

⏰ Context from checks skipped due to timeout of 90000ms (13)

GitHub Check: lint
GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: test-low-resources
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: storage-compat-test
GitHub Check: integration-tests-consensus
GitHub Check: test-consistency
GitHub Check: integration-tests
GitHub Check: test-consensus-compose
GitHub Check: Basic TLS/HTTPS tests
GitHub Check: rust-tests (windows-latest)
GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: rust-tests (macos-latest)

🔇 Additional comments (21)

lib/segment/src/index/field_index/full_text_index/text_index.rs (2)

259-261: LGTM! Correctly implements the new ParsedQuery enum structure.

The conversion from collecting tokens into a vector to using TokenSet and wrapping in ParsedQuery::Tokens aligns perfectly with the refactoring objectives. The optional collection pattern handles missing tokens appropriately.

620-622: LGTM! Test helper updated consistently.

The test helper function is updated to match the same pattern as the main implementation, maintaining consistency across the codebase.

lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs (2)

74-97: LGTM! Well-structured subset filtering logic.

The new filter_has_subset method correctly:

Handles unseen tokens by returning empty iterator

Handles empty token sets by returning empty iterator

Uses intersect_postings_iterator for efficient intersection

Includes helpful comments explaining the logic

This centralizes the filtering logic and improves code organization.

207-209: LGTM! Clean delegation to the new subset filtering method.

The filter method correctly matches on the ParsedQuery::Tokens variant and delegates to the new filter_has_subset method, maintaining the same behavior while improving code structure.

lib/segment/src/index/field_index/full_text_index/postings_iterator.rs (2)

2-2: LGTM! Updated imports for the generic refactoring.

The import updates correctly include the types needed for the generic intersect_compressed_postings_iterator function.

24-25: LGTM! Good generalization of the intersection function.

Making intersect_compressed_postings_iterator generic over PostingValue improves reusability across different posting list types while maintaining the same intersection logic. The generic constraint V: PostingValue + 'a is appropriate.

lib/segment/src/index/field_index/full_text_index/immutable_inverted_index.rs (4)

23-57: LGTM! Comprehensive subset filtering implementation.

The filter_has_subset method correctly:

Filters out deleted documents using point_to_tokens_count

Handles unseen tokens by returning empty iterator

Handles empty token sets by returning empty iterator

Uses intersect_compressed_postings_iterator for efficient intersection

Maintains consistency with the mutable implementation pattern

The nested intersection function provides good separation of concerns.

59-83: LGTM! Efficient subset checking logic.

The check_has_subset method efficiently checks if a specific point contains all tokens by:

Handling empty token sets appropriately

Checking document presence first

Using posting_list.visitor().contains() for efficient lookup

Leveraging the nested check_intersection function for clarity

This provides a more efficient alternative to filtering when checking a single point.

126-128: LGTM! Clean delegation pattern.

The filter method correctly delegates to filter_has_subset for the ParsedQuery::Tokens variant, maintaining consistency with other inverted index implementations.

149-151: LGTM! Consistent delegation for match checking.

The check_match method correctly delegates to check_has_subset for the ParsedQuery::Tokens variant, providing efficient single-point subset checking.

lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs (5)

20-20: LGTM: Import addition is correct.

The TokenSet import is properly added to support the new abstraction.

148-182: LGTM: Well-implemented token subset filtering.

The filter_has_subset method correctly:

Filters out deleted points using the is_active check

Handles edge cases (empty queries, missing tokens) by returning empty iterators

Uses the established intersect_compressed_postings_iterator for efficient intersection

Properly threads the hardware counter through to posting list access

The implementation is consistent with similar methods in other index implementations.

184-218: LGTM: Correct implementation of subset checking.

The check_has_subset method properly:

Validates non-empty queries and point existence

Uses efficient contains checks on posting list visitors

Handles the hardware counter correctly for posting list access

Follows the same pattern as implementations in other index types

The use of &TokenSet (by reference) is appropriate for this read-only check operation.

306-308: LGTM: Correct delegation to the new helper method.

The filter method properly matches on the ParsedQuery enum and delegates to filter_has_subset for the Tokens variant, maintaining the existing interface while using the new abstraction.

335-337: LGTM: Consistent enum handling and delegation.

The check_match method correctly matches on the ParsedQuery enum and delegates to check_has_subset, maintaining consistency with the filter method's approach.

lib/segment/src/index/field_index/full_text_index/inverted_index.rs (6)

37-39: LGTM: Efficient and correct subset checking.

The has_subset method efficiently checks subset containment using all() and the existing contains() method (which uses binary search). The implementation is clear and performant.

76-82: LGTM: Well-designed enum preparing for future functionality.

The refactoring of ParsedQuery to an enum with the Tokens(TokenSet) variant is well-structured and clearly prepares for future phrase matching capabilities. The documentation clearly explains the current variant's behavior.

85-96: LGTM: Clean implementation using the new abstraction.

The updated check_match method correctly:

Matches on the ParsedQuery enum

Handles empty query edge cases appropriately

Leverages the new has_subset method for clean, readable logic

158-164: LGTM: Proper delegation pattern.

The estimate_cardinality method correctly matches on the ParsedQuery enum and delegates to the appropriate estimation method, maintaining clean separation of concerns.

166-171: LGTM: Consistent API usage and clear naming.

The method is appropriately renamed to estimate_has_subset_cardinality for clarity, and correctly uses tokens.tokens() to access the underlying token vector through the TokenSet API.

Also applies to: 174-176

302-303: LGTM: Test helper correctly updated for new enum.

The test helper function is properly updated to construct ParsedQuery::Tokens(tokens) instead of the previous struct format, maintaining test compatibility with the new enum structure.

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

* parsed query now specifies intersection match * use tokenset in parsed query * review nits

coszio mentioned this pull request May 30, 2025

[phrase match] Generic mmap postings #6619

Merged

coszio added this to the phrase match milestone May 30, 2025

coszio changed the title ~~ParsedQuery as enum~~ [phrase match] ParsedQuery as enum May 30, 2025

timvisee approved these changes Jun 2, 2025

View reviewed changes

lib/segment/src/index/field_index/full_text_index/inverted_index.rs Outdated Show resolved Hide resolved

lib/segment/src/index/field_index/full_text_index/immutable_inverted_index.rs Outdated Show resolved Hide resolved

coszio force-pushed the introduce-ordered-document branch 2 times, most recently from c06a6f7 to 6322f4b Compare June 2, 2025 17:02

coszio force-pushed the ParsedQuery-enum branch from 5637302 to 5391a3e Compare June 2, 2025 17:22

coszio marked this pull request as ready for review June 2, 2025 17:23

Base automatically changed from introduce-ordered-document to dev June 2, 2025 17:24

coszio added 3 commits June 2, 2025 13:25

parsed query now specifies intersection match

1825e28

use tokenset in parsed query

3f4b207

review nits

1ecfcf9

coszio force-pushed the ParsedQuery-enum branch from 5391a3e to 1ecfcf9 Compare June 2, 2025 17:26

coszio merged commit f0525a3 into dev Jun 2, 2025
24 of 25 checks passed

coszio deleted the ParsedQuery-enum branch June 2, 2025 18:26

coderabbitai bot mentioned this pull request Jun 3, 2025

[phrase match] Mmap full text index for phrase #6602

Merged

This was referenced Jun 12, 2025

[phrase matching] Handle phrase filtering #6654

Merged

[phrase matching] expose phrase condition #6670

Merged

coderabbitai bot mentioned this pull request Jun 27, 2025

Optimize Text payload matcher #6766

Closed

coderabbitai bot mentioned this pull request Jul 16, 2025

Replace some Box<T> occurrences with Either #6887

Merged

generall pushed a commit that referenced this pull request Jul 17, 2025

[phrase match] ParsedQuery as enum (#6617)

30ad86d

* parsed query now specifies intersection match * use tokenset in parsed query * review nits

coderabbitai bot mentioned this pull request Aug 19, 2025

filter text any #7100

Merged

coderabbitai bot mentioned this pull request Sep 15, 2025

Fix phrase match to not match with unknown tokens #7252

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[phrase match] `ParsedQuery` as enum#6617

[phrase match] `ParsedQuery` as enum#6617
coszio merged 3 commits intodevfrom
ParsedQuery-enum

coszio commented May 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Jun 2, 2025

Walkthrough

Possibly related PRs

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

coszio commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Jun 2, 2025

Walkthrough

Possibly related PRs

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coszio commented May 30, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)