Skip to content

[phrase match] ParsedQuery as enum#6617

Merged
coszio merged 3 commits intodevfrom
ParsedQuery-enum
Jun 2, 2025
Merged

[phrase match] ParsedQuery as enum#6617
coszio merged 3 commits intodevfrom
ParsedQuery-enum

Conversation

@coszio
Copy link
Contributor

@coszio coszio commented May 30, 2025

Builds on top of #6486

This is merely a refactoring to allow (with less changes) introducing phrase filtering as a separate way to use the full-text index.

@coszio coszio added this to the phrase match milestone May 30, 2025
@coszio coszio changed the title ParsedQuery as enum [phrase match] ParsedQuery as enum May 30, 2025
@coszio coszio force-pushed the introduce-ordered-document branch 2 times, most recently from c06a6f7 to 6322f4b Compare June 2, 2025 17:02
@coszio coszio force-pushed the ParsedQuery-enum branch from 5637302 to 5391a3e Compare June 2, 2025 17:22
@coszio coszio marked this pull request as ready for review June 2, 2025 17:23
Base automatically changed from introduce-ordered-document to dev June 2, 2025 17:24
@coszio coszio force-pushed the ParsedQuery-enum branch from 5391a3e to 1ecfcf9 Compare June 2, 2025 17:26
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jun 2, 2025

📝 Walkthrough

Walkthrough

This set of changes refactors the full text inverted index implementation by introducing a new TokenSet abstraction and updating the query representation from a struct with a token vector to an enum with a TokenSet variant. Methods for filtering and matching documents or points based on token subsets are modularized into dedicated helper functions (filter_has_subset, check_has_subset) across the immutable, mmap, and mutable inverted index types. The intersection logic for posting lists is generalized to work with different posting value types. Test and helper functions are updated to use the new abstractions, with no changes to public API signatures.

Possibly related PRs

  • Full text index with generic posting list #6565: Refactors the full text index's immutable and mmap inverted index implementations to use the new generic posting list abstraction, which is directly related to the modularization and generalization of posting list intersection in this PR.

Suggested reviewers

  • timvisee
  • generall

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 31384f1 and 1ecfcf9.

📒 Files selected for processing (6)
  • lib/segment/src/index/field_index/full_text_index/immutable_inverted_index.rs (4 hunks)
  • lib/segment/src/index/field_index/full_text_index/inverted_index.rs (4 hunks)
  • lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs (4 hunks)
  • lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs (2 hunks)
  • lib/segment/src/index/field_index/full_text_index/postings_iterator.rs (2 hunks)
  • lib/segment/src/index/field_index/full_text_index/text_index.rs (2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (4)
lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs (4)
lib/segment/src/index/field_index/full_text_index/immutable_inverted_index.rs (1)
  • filter_has_subset (24-57)
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs (1)
  • filter_has_subset (148-182)
lib/segment/src/index/field_index/full_text_index/inverted_index.rs (4)
  • tokens (29-31)
  • new (70-72)
  • iter (52-56)
  • query (299-302)
lib/segment/src/index/field_index/full_text_index/postings_iterator.rs (2)
  • postings (37-40)
  • intersect_postings_iterator (6-22)
lib/segment/src/index/field_index/full_text_index/immutable_inverted_index.rs (5)
lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs (2)
  • filter_has_subset (74-97)
  • filter (202-210)
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs (4)
  • filter_has_subset (148-182)
  • filter (301-309)
  • intersection (156-179)
  • check_has_subset (184-218)
lib/segment/src/index/field_index/full_text_index/inverted_index.rs (5)
  • tokens (29-31)
  • filter (144-148)
  • new (70-72)
  • iter (52-56)
  • query (299-302)
lib/segment/src/index/field_index/full_text_index/postings_iterator.rs (2)
  • postings (37-40)
  • intersect_compressed_postings_iterator (24-58)
lib/posting_list/src/posting_list.rs (2)
  • view (89-107)
  • iter (114-116)
lib/segment/src/index/field_index/full_text_index/inverted_index.rs (8)
lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs (2)
  • check_match (224-235)
  • points_count (246-248)
lib/segment/src/index/field_index/full_text_index/immutable_inverted_index.rs (2)
  • check_match (143-152)
  • points_count (167-169)
lib/segment/src/index/field_index/full_text_index/text_index.rs (5)
  • check_match (173-190)
  • query (274-283)
  • query (617-620)
  • tokens (259-259)
  • points_count (114-120)
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs (2)
  • check_match (329-338)
  • points_count (362-364)
lib/segment/src/index/field_index/geo_index/mutable_geo_index.rs (1)
  • points_count (148-150)
lib/segment/src/index/field_index/geo_index/immutable_geo_index.rs (1)
  • points_count (284-286)
lib/segment/src/index/field_index/geo_index/mmap_geo_index.rs (1)
  • points_count (374-378)
lib/segment/src/index/field_index/geo_index/mod.rs (1)
  • points_count (89-95)
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs (4)
lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs (2)
  • filter_has_subset (74-97)
  • filter (202-210)
lib/segment/src/index/field_index/full_text_index/inverted_index.rs (5)
  • tokens (29-31)
  • filter (144-148)
  • new (70-72)
  • iter (52-56)
  • query (299-302)
lib/segment/src/index/field_index/full_text_index/postings_iterator.rs (2)
  • postings (37-40)
  • intersect_compressed_postings_iterator (24-58)
lib/common/common/src/counter/hardware_counter.rs (1)
  • new (42-55)
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: lint
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: storage-compat-test
  • GitHub Check: integration-tests-consensus
  • GitHub Check: test-consistency
  • GitHub Check: integration-tests
  • GitHub Check: test-consensus-compose
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: rust-tests (macos-latest)
🔇 Additional comments (21)
lib/segment/src/index/field_index/full_text_index/text_index.rs (2)

259-261: LGTM! Correctly implements the new ParsedQuery enum structure.

The conversion from collecting tokens into a vector to using TokenSet and wrapping in ParsedQuery::Tokens aligns perfectly with the refactoring objectives. The optional collection pattern handles missing tokens appropriately.


620-622: LGTM! Test helper updated consistently.

The test helper function is updated to match the same pattern as the main implementation, maintaining consistency across the codebase.

lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs (2)

74-97: LGTM! Well-structured subset filtering logic.

The new filter_has_subset method correctly:

  • Handles unseen tokens by returning empty iterator
  • Handles empty token sets by returning empty iterator
  • Uses intersect_postings_iterator for efficient intersection
  • Includes helpful comments explaining the logic

This centralizes the filtering logic and improves code organization.


207-209: LGTM! Clean delegation to the new subset filtering method.

The filter method correctly matches on the ParsedQuery::Tokens variant and delegates to the new filter_has_subset method, maintaining the same behavior while improving code structure.

lib/segment/src/index/field_index/full_text_index/postings_iterator.rs (2)

2-2: LGTM! Updated imports for the generic refactoring.

The import updates correctly include the types needed for the generic intersect_compressed_postings_iterator function.


24-25: LGTM! Good generalization of the intersection function.

Making intersect_compressed_postings_iterator generic over PostingValue improves reusability across different posting list types while maintaining the same intersection logic. The generic constraint V: PostingValue + 'a is appropriate.

lib/segment/src/index/field_index/full_text_index/immutable_inverted_index.rs (4)

23-57: LGTM! Comprehensive subset filtering implementation.

The filter_has_subset method correctly:

  • Filters out deleted documents using point_to_tokens_count
  • Handles unseen tokens by returning empty iterator
  • Handles empty token sets by returning empty iterator
  • Uses intersect_compressed_postings_iterator for efficient intersection
  • Maintains consistency with the mutable implementation pattern

The nested intersection function provides good separation of concerns.


59-83: LGTM! Efficient subset checking logic.

The check_has_subset method efficiently checks if a specific point contains all tokens by:

  • Handling empty token sets appropriately
  • Checking document presence first
  • Using posting_list.visitor().contains() for efficient lookup
  • Leveraging the nested check_intersection function for clarity

This provides a more efficient alternative to filtering when checking a single point.


126-128: LGTM! Clean delegation pattern.

The filter method correctly delegates to filter_has_subset for the ParsedQuery::Tokens variant, maintaining consistency with other inverted index implementations.


149-151: LGTM! Consistent delegation for match checking.

The check_match method correctly delegates to check_has_subset for the ParsedQuery::Tokens variant, providing efficient single-point subset checking.

lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs (5)

20-20: LGTM: Import addition is correct.

The TokenSet import is properly added to support the new abstraction.


148-182: LGTM: Well-implemented token subset filtering.

The filter_has_subset method correctly:

  • Filters out deleted points using the is_active check
  • Handles edge cases (empty queries, missing tokens) by returning empty iterators
  • Uses the established intersect_compressed_postings_iterator for efficient intersection
  • Properly threads the hardware counter through to posting list access

The implementation is consistent with similar methods in other index implementations.


184-218: LGTM: Correct implementation of subset checking.

The check_has_subset method properly:

  • Validates non-empty queries and point existence
  • Uses efficient contains checks on posting list visitors
  • Handles the hardware counter correctly for posting list access
  • Follows the same pattern as implementations in other index types

The use of &TokenSet (by reference) is appropriate for this read-only check operation.


306-308: LGTM: Correct delegation to the new helper method.

The filter method properly matches on the ParsedQuery enum and delegates to filter_has_subset for the Tokens variant, maintaining the existing interface while using the new abstraction.


335-337: LGTM: Consistent enum handling and delegation.

The check_match method correctly matches on the ParsedQuery enum and delegates to check_has_subset, maintaining consistency with the filter method's approach.

lib/segment/src/index/field_index/full_text_index/inverted_index.rs (6)

37-39: LGTM: Efficient and correct subset checking.

The has_subset method efficiently checks subset containment using all() and the existing contains() method (which uses binary search). The implementation is clear and performant.


76-82: LGTM: Well-designed enum preparing for future functionality.

The refactoring of ParsedQuery to an enum with the Tokens(TokenSet) variant is well-structured and clearly prepares for future phrase matching capabilities. The documentation clearly explains the current variant's behavior.


85-96: LGTM: Clean implementation using the new abstraction.

The updated check_match method correctly:

  • Matches on the ParsedQuery enum
  • Handles empty query edge cases appropriately
  • Leverages the new has_subset method for clean, readable logic

158-164: LGTM: Proper delegation pattern.

The estimate_cardinality method correctly matches on the ParsedQuery enum and delegates to the appropriate estimation method, maintaining clean separation of concerns.


166-171: LGTM: Consistent API usage and clear naming.

The method is appropriately renamed to estimate_has_subset_cardinality for clarity, and correctly uses tokens.tokens() to access the underlying token vector through the TokenSet API.

Also applies to: 174-176


302-303: LGTM: Test helper correctly updated for new enum.

The test helper function is properly updated to construct ParsedQuery::Tokens(tokens) instead of the previous struct format, maintaining test compatibility with the new enum structure.

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@coszio coszio merged commit f0525a3 into dev Jun 2, 2025
24 of 25 checks passed
@coszio coszio deleted the ParsedQuery-enum branch June 2, 2025 18:26
generall pushed a commit that referenced this pull request Jul 17, 2025
* parsed query now specifies intersection match

* use tokenset in parsed query

* review nits
@coderabbitai coderabbitai bot mentioned this pull request Aug 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants