Skip to content

[phrase match] Mmap full text index for phrase#6602

Merged
coszio merged 2 commits intodevfrom
mmap-full-text-index-for-phrase
Jun 3, 2025
Merged

[phrase match] Mmap full text index for phrase#6602
coszio merged 2 commits intodevfrom
mmap-full-text-index-for-phrase

Conversation

@coszio
Copy link
Contributor

@coszio coszio commented May 27, 2025

Builds on top of #6619

Enables positions in postings by introducing two new enums:

  • ImmutablePostings for immutable index.
  • MmapPostingsEnum for mmap index.

@coszio coszio added this to the phrase match milestone May 27, 2025
@coszio coszio force-pushed the introduce-ordered-document branch from 74539b7 to ec4561c Compare May 28, 2025 13:30
@coszio coszio force-pushed the mmap-full-text-index-for-phrase branch from b4f8034 to 8513c9e Compare May 28, 2025 13:30
@coszio coszio force-pushed the introduce-ordered-document branch from ec4561c to e2452d0 Compare May 28, 2025 19:59
@coszio coszio force-pushed the mmap-full-text-index-for-phrase branch 2 times, most recently from 0ca392d to 651d0a8 Compare May 30, 2025 16:32
@coszio coszio changed the base branch from introduce-ordered-document to ParsedQuery-enum May 30, 2025 19:09
@coszio coszio force-pushed the mmap-full-text-index-for-phrase branch 2 times, most recently from 1728a44 to 6834d87 Compare May 30, 2025 20:27
@coszio coszio changed the base branch from ParsedQuery-enum to generic-mmap-postings May 30, 2025 20:31
@coszio coszio force-pushed the generic-mmap-postings branch 2 times, most recently from 9ef8a46 to 999c924 Compare June 2, 2025 19:13
@coszio coszio force-pushed the mmap-full-text-index-for-phrase branch from 6834d87 to e741a8f Compare June 2, 2025 19:13
#[cfg_attr(test, derive(Clone))]
#[derive(Default, Debug)]
#[derive(Debug)]
pub enum ImmutablePostings {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put in a new file

Copy link
Member

@generall generall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's add a comment for filter_has_subset function to describe exacly what it is doing
nit: let's try to move big structures with implementations into a dedicated files

@coszio coszio force-pushed the mmap-full-text-index-for-phrase branch from e741a8f to da61ccf Compare June 3, 2025 16:35
Base automatically changed from generic-mmap-postings to dev June 3, 2025 19:21
@coszio coszio force-pushed the mmap-full-text-index-for-phrase branch from 3a51585 to d701880 Compare June 3, 2025 19:28
@coszio coszio marked this pull request as ready for review June 3, 2025 19:28
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jun 3, 2025

📝 Walkthrough

Walkthrough

This change introduces a new optional boolean field, phrase_matching, to the TextIndexParams struct, annotated to be skipped during serialization for now. The conversion logic between internal and API representations of text index parameters is updated to handle this new field, though it is not yet actively used. The full text index subsystem is refactored to support postings with or without positional information by introducing new enums (ImmutablePostings, MmapPostingsEnum) and updating the construction, opening, and query logic accordingly. Several test cases and method signatures are updated to accommodate the new field and refactored types.

Possibly related PRs

  • qdrant/qdrant#6617: Refactors ParsedQuery and centralizes subset filtering logic in inverted indexes to prepare for phrase matching, closely related to the introduction of the phrase_matching field and phrase matching support.
  • qdrant/qdrant#6619: Generalizes MmapPostings to be generic over the stored value type (including positions), directly related to the enum-based approach for postings with or without positions introduced in this PR.
  • qdrant/qdrant#6486: Introduces ordered Document and TokenSet abstractions in the full-text inverted index, related to enhancements in full-text indexing data structures and phrase matching support.

Suggested reviewers

  • timvisee
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (5)
lib/segment/src/index/field_index/full_text_index/immutable_postings_enum.rs (1)

23-32: Add bounds checking documentation.

The method correctly handles bounds checking by using Option::map, but this behavior should be documented.

Add documentation to clarify the bounds checking behavior:

+    /// Returns the length of the posting list for the given token.
+    /// Returns `None` if the token ID is out of bounds.
     pub fn posting_len(&self, token: TokenId) -> Option<usize> {
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings_enum.rs (1)

37-56: Consider extracting common iterator pattern.

The iter_ids implementation is very similar to the one in ImmutablePostings. Consider if a trait or macro could reduce this duplication.

Both enum implementations have nearly identical iter_ids methods. Consider creating a trait:

trait PostingsIterator {
    fn iter_ids(&self, token_id: TokenId) -> Option<Box<dyn Iterator<Item = PointOffsetType> + '_>>;
}

This would reduce code duplication and ensure consistent behavior across implementations.

lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs (1)

28-28: Consider extracting the duplicate logic into a helper method.

The logic config.phrase_matching == Some(true) is duplicated in both the open and finalize methods. Consider extracting this into a helper method for better maintainability.

Add a helper method to the TextIndexParams struct:

impl TextIndexParams {
    pub fn has_positions(&self) -> bool {
        self.phrase_matching == Some(true)
    }
}

Then use it as:

-let has_positions = config.phrase_matching == Some(true);
+let has_positions = config.has_positions();

Also applies to: 182-182

lib/segment/src/index/field_index/full_text_index/immutable_inverted_index.rs (2)

244-276: Consider adding documentation for this helper function.

The implementation is solid, but it would benefit from documentation explaining its purpose and the remapping logic.

+/// Filters out empty postings and remaps token IDs to create a compact representation.
+/// Returns the filtered postings, updated vocab with new token IDs, and a mapping from original to new token IDs.
 fn optimized_postings_and_vocab(

293-334: Complex but correct position extraction logic.

The function correctly builds postings with positions. The expect() at lines 326-328 is justified by the invariant that any document in a token's posting list must contain that token at least once.

Consider adding documentation to explain the algorithm.

+/// Creates compressed postings with position information.
+/// Processes documents to extract token positions and builds posting lists that include
+/// these positions for each token.
 fn create_compressed_postings_with_positions(
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 96ca31a and d701880.

📒 Files selected for processing (16)
  • lib/api/src/grpc/conversions.rs (2 hunks)
  • lib/segment/src/data_types/index.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/immutable_inverted_index.rs (5 hunks)
  • lib/segment/src/index/field_index/full_text_index/immutable_postings_enum.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/immutable_text_index.rs (4 hunks)
  • lib/segment/src/index/field_index/full_text_index/inverted_index.rs (6 hunks)
  • lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings_enum.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs (8 hunks)
  • lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs (3 hunks)
  • lib/segment/src/index/field_index/full_text_index/mod.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/positions.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/tests/mod.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/text_index.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/tokenizers.rs (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: test-low-resources
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-consistency
  • GitHub Check: test-consensus-compose
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: integration-tests
  • GitHub Check: integration-tests-consensus
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: storage-compat-test
  • GitHub Check: lint
🔇 Additional comments (22)
lib/segment/src/index/field_index/full_text_index/tests/mod.rs (1)

167-167: LGTM: Test configuration updated correctly.

The addition of phrase_matching: None to the TextIndexParams struct initialization is consistent with the new field introduced in the broader phrase matching feature. This ensures the test remains compatible with the updated struct definition.

lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs (1)

74-74: LGTM: Helpful documentation added.

The documentation comment clearly describes the purpose of the filter_has_subset method, improving code readability and maintainability.

lib/segment/src/index/field_index/full_text_index/mod.rs (1)

1-1: LGTM: New module addition supports phrase matching functionality.

The addition of the immutable_postings_enum module is well-aligned with the PR objectives to introduce phrase matching capabilities through postings with positional information. This provides a clean architectural foundation for the new ImmutablePostings enum.

lib/segment/src/index/field_index/full_text_index/tokenizers.rs (1)

258-258: LGTM: Test configuration consistently updated.

The addition of phrase_matching: None maintains test compatibility with the updated TextIndexParams struct definition and follows the consistent pattern applied across other test files.

lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs (1)

154-154: LGTM: Test configuration update completes consistent pattern.

The addition of phrase_matching: None to the TextIndexParams struct initialization maintains compatibility with the updated struct definition. This change, along with similar updates across other test files, demonstrates a well-coordinated and consistent approach to integrating the new phrase matching feature.

lib/segment/src/data_types/index.rs (1)

197-200: New field added for phrase matching support.

The addition of the phrase_matching field aligns with the PR objectives for introducing phrase matching capabilities. The implementation approach is sound:

  • Using Option<bool> maintains backwards compatibility
  • The #[serde(skip)] annotation prevents premature exposure in the API
  • Field placement at the end preserves struct layout

Remember to update the serialization behavior when removing the skip attribute in the future, and ensure proper API documentation and validation are added.

lib/segment/src/index/field_index/full_text_index/positions.rs (2)

5-5: Good addition of Default trait derivation.

Deriving Default for the Positions struct enables easy initialization of empty position lists, which is essential for building postings with positional information.


8-12: Clean API for building position lists.

The push method provides a straightforward interface for appending positions to the internal vector. The implementation correctly delegates to the underlying Vec<u32>.

lib/segment/src/index/field_index/full_text_index/text_index.rs (1)

369-372:

Details

✅ Verification successful

Method signature alignment for mmap remove_point.

The change aligns with the updated MmapFullTextIndex::remove_point method signature that now returns () instead of a Result. The explicit call followed by Ok(()) maintains the expected interface.

Verify that error handling is consistent across all remove_point implementations:


🏁 Script executed:

#!/bin/bash
# Description: Check remove_point method signatures across full text index implementations
# Expected: Consistent error handling patterns

# Search for remove_point method definitions in full text index modules
ast-grep --pattern $'impl $_ {
  $$$
  fn remove_point(&mut self, $_) -> $_ {
    $$$
  }
  $$$
}'

Length of output: 99


Couldn’t detect any remove_point definitions via AST—let’s fallback to a regex search to ensure we catch all implementations and match arms:


🏁 Script executed:

#!/bin/bash
# List all remove_point method definitions in full_text_index modules
rg -n "fn remove_point" -C2 lib/segment/src/index/field_index/full_text_index

# Show all match arms invoking remove_point in text_index.rs
rg -n "FullTextIndex::.*remove_point" -C2 lib/segment/src/index/field_index/full_text_index/text_index.rs

Length of output: 3139


Method signature alignment for mmap remove_point

Verified that:

  • MutableTextIndex::remove_point and ImmutableTextIndex::remove_point both return OperationResult<()>.
  • MmapTextIndex::remove_point is inherently infallible (returns ()), so the wrapper in FullTextIndex::Mmap correctly calls index.remove_point(id); and then returns Ok(()).

No inconsistencies found—error‐handling remains uniform across all variants.

lib/segment/src/index/field_index/full_text_index/immutable_postings_enum.rs (2)

8-13: Well-designed enum for positional postings support.

The enum provides a clean abstraction for handling postings with or without positional information. The conditional Clone derive for tests is a good practice.


34-52: Excellent test-only iterator implementation.

The iter_ids method provides a clean interface for testing by abstracting over both enum variants. The trait object approach allows uniform handling in tests.

lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings_enum.rs (2)

9-12: Consistent enum design with ImmutablePostings.

The enum structure mirrors ImmutablePostings, providing consistency across different storage backends.


22-35: Proper hardware counter integration.

The method correctly passes the hardware counter to the underlying mmap operations for performance monitoring. The Option::map pattern handles bounds checking appropriately.

lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs (1)

56-58: Verify the breaking API change is intentional.

The remove_point method signature has changed from returning OperationResult<()> to returning nothing. This is a breaking change that could affect existing callers expecting a Result type.

lib/segment/src/index/field_index/full_text_index/immutable_text_index.rs (1)

47-52: LGTM! Clean pattern matching for postings types.

The pattern matching correctly initializes the appropriate inverted index type based on whether the postings include positions or just IDs.

lib/segment/src/index/field_index/full_text_index/inverted_index.rs (1)

73-85: Good additions to the Document struct.

The len() method and IntoIterator implementation enhance the usability of the Document struct and enable more idiomatic Rust patterns.

lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs (2)

105-112: Clean implementation of postings type selection.

The conditional logic correctly instantiates the appropriate postings variant based on the has_positions flag.


163-194: Well-structured polymorphic handling with helper functions.

The use of the generic intersection helper function effectively reduces code duplication while maintaining type safety across different postings variants. This is a good example of DRY principle application.

lib/segment/src/index/field_index/full_text_index/immutable_inverted_index.rs (4)

30-47: LGTM! Clean factory methods for different posting types.

The two constructors provide clear, type-safe ways to create empty instances for different posting types.


49-85: Excellent refactoring to support multiple posting types.

The generic inner function elegantly handles both posting types without code duplication. The match statement provides clean dispatch based on the enum variant.


209-241: Well-structured conversion with proper handling of positions.

The implementation correctly:

  • Optimizes postings and vocab by filtering empty entries
  • Creates the appropriate posting type based on whether positions are available
  • Maintains consistency between data structures

336-415: Clean mmap conversion with good error checking.

The implementation correctly handles both posting types and includes helpful debug assertions to verify data consistency. The generic inner function is well-designed to handle different MmapPostingValue types.

fn test_mutable_to_immutable() {
let mutable = mutable_inverted_index(2000, 400);

// todo: test with phrase-enabled
Copy link
Contributor

@coderabbitai coderabbitai bot Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Complete the phrase-enabled test coverage.

There are multiple TODO comments indicating missing test coverage for the phrase matching feature. This is a critical gap that should be addressed before merging.

Would you like me to help generate the phrase-enabled test cases or open an issue to track this work?

Also applies to: 385-385, 459-459

🤖 Prompt for AI Agents
In lib/segment/src/index/field_index/full_text_index/inverted_index.rs at lines
345, 385, and 459, there are TODO comments indicating missing test coverage for
the phrase-enabled feature. To fix this, write comprehensive test cases that
specifically cover phrase matching functionality, ensuring all edge cases and
typical usage scenarios are tested. This will close the critical test coverage
gap before merging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointting it out, I have addressed these in the upcoming PR #6493

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

@coszio Thanks for the quick response! I noticed you mentioned PR #6493, but the current PR is #6602. Could you double-check the PR number? I'd be happy to verify that the phrase-enabled test coverage has been properly addressed once I can confirm the correct PR reference.

It's great to hear that you're staying on top of the test coverage - comprehensive testing for the phrase matching feature will definitely strengthen the implementation! 🚀


🏁 Script executed:

#!/bin/bash
# Check if PR #6493 exists in the repository
gh pr view 6493 --json number,title,state 2>/dev/null || echo "PR #6493 not found"

Length of output: 137


@coszio Perfect! I can see that PR #6493 "[phrase matching] populate point_to_doc structures" does indeed exist and is open. That makes complete sense - addressing the phrase matching test coverage in a dedicated PR focused on populating the point_to_doc structures is a logical approach.

Thanks for handling the test coverage systematically across the related PRs. The phrase matching implementation will be much more robust with proper test coverage in place!

Feel free to tag me if you'd like a review on PR #6493 as well. 👍


🧬 Code Graph Analysis Results

The following code snippets are relevant to understanding the file lib/segment/src/index/field_index/full_text_index/inverted_index.rs and the user comment context:


From lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs (lines 102-177):

fn get_vocab_mut(&mut self) -> &mut HashMap<String, TokenId> {
    &mut self.vocab
}

fn index_tokens(
    &mut self,
    point_id: PointOffsetType,
    tokens: TokenSet,
    hw_counter: &HardwareCounterCell,
) -> OperationResult<()> {
    self.points_count += 1;

    let mut hw_cell_wb = hw_counter
        .payload_index_io_write_counter()
        .write_back_counter();

    if self.point_to_tokens.len() <= point_id as usize {
        let new_len = point_id as usize + 1;

        // Only measure the overhead of `TokenSet` here since we account for the tokens a few lines below.
        hw_cell_wb
            .incr_delta((new_len - self.point_to_tokens.len()) * size_of::<Option<TokenSet>>());

        self.point_to_tokens.resize_with(new_len, Default::default);
    }

    for token_id in tokens.tokens() {
        let token_idx_usize = *token_id as usize;

        if self.postings.len() <= token_idx_usize {
            let new_len = token_idx_usize + 1;
            hw_cell_wb.incr_delta(new_len - self.postings.len());
            self.postings.resize_with(new_len, Default::default);
        }

        self.postings
            .get_mut(token_idx_usize)
            .expect("posting must exist")
            .insert(point_id);

        hw_cell_wb.incr_delta(size_of_val(&point_id));
    }
    self.point_to_tokens[point_id as usize] = Some(tokens);

    Ok(())
}

fn index_document(
    &mut self,
    point_id: PointOffsetType,
    ordered_document: Document,
    hw_counter: &HardwareCounterCell,
) -> OperationResult<()> {
    let Some(point_to_doc) = &mut self.point_to_doc else {
        // Phrase matching is not enabled
        return Ok(());
    };

    let mut hw_cell_wb = hw_counter
        .payload_index_io_write_counter()
        .write_back_counter();

    // Ensure container has enough capacity
    if point_to_doc.len() <= point_id as usize {
        let new_len = point_id as usize + 1;

        hw_cell_wb.incr_delta((new_len - point_to_doc.len()) * size_of::<Option<Document>>());

        point_to_doc.resize_with(new_len, Default::default);
    }

    // Store the ordered document
    point_to_doc[point_id as usize] = Some(ordered_document);

    Ok(())
}

fn remove(&mut self, idx: PointOffsetType) -> bool {
    if self.point_to_tokens.len() <= idx as usize {
        return false; // Already removed or never actually existed
    }

    let Some(removed_token_set) = self.point_to_tokens[idx as usize].take() else {
        return false;
    };

    self.points_count -= 1;

    for removed_token in removed_token_set.tokens() {
        // unwrap safety: posting list exists and contains the point idx
        let posting = self.postings.get_mut(*removed_token as usize).unwrap();
        posting.remove(idx);
    }

    if let Some(point_to_doc) = &mut self.point_to_doc {
        point_to_doc[idx as usize] = None;
    }

    true
}

fn filter(
    &self,
    query: ParsedQuery,
    _hw_counter: &HardwareCounterCell,
) -> Box<dyn Iterator<Item = PointOffsetType> + '_> {
    match query {
        ParsedQuery::Tokens(tokens) => self.filter_has_subset(tokens),
    }
}

fn get_posting_len(&self, token_id: TokenId, _: &HardwareCounterCell) -> Option<usize> {
    self.postings.get(token_id as usize).map(|x| x.len())
}

fn points_count(&self) -> usize {
    self.points_count
}

fn vocab_with_postings_len_iter(&self) -> impl Iterator<Item = (&str, usize)> + '_ {
    self.vocab.iter().filter_map(|(token, &posting_idx)| {
        self.postings
            .get(posting_idx as usize)
            .map(|postings| (token.as_str(), postings.len()))
    })
}

fn values_is_empty(&self, point_id: PointOffsetType) -> bool {
    self.get_tokens(point_id).is_none_or(|x| x.is_empty())
}

fn values_count(&self, point_id: PointOffsetType) -> usize {
    // Maybe we want number of documents in the future?
    self.get_tokens(point_id).map(|x| x.len()).unwrap_or(0)
}

fn get_token_id(&self, token: &str, _hw_counter: &HardwareCounterCell) -> Option<TokenId> {
    self.vocab.get(token).copied()
}

From lib/segment/src/index/field_index/full_text_index/immutable_inverted_index.rs (lines 119-198):

fn get_vocab_mut(&mut self) -> &mut HashMap<String, TokenId> {
    &mut self.vocab
}

fn index_tokens(
    &mut self,
    _idx: PointOffsetType,
    _tokens: super::inverted_index::TokenSet,
    _hw_counter: &HardwareCounterCell,
) -> OperationResult<()> {
    Err(OperationError::service_error(
        "Can't add values to immutable text index",
    ))
}

fn index_document(
    &mut self,
    _idx: PointOffsetType,
    _document: super::inverted_index::Document,
    _hw_counter: &HardwareCounterCell,
) -> OperationResult<()> {
    Err(OperationError::service_error(
        "Can't add values to immutable text index",
    ))
}

fn remove(&mut self, idx: PointOffsetType) -> bool {
    if self.values_is_empty(idx) {
        return false; // Already removed or never actually existed
    }
    self.point_to_tokens_count[idx as usize] = None;
    self.points_count -= 1;
    true
}

fn filter<'a>(
    &'a self,
    query: ParsedQuery,
    _hw_counter: &'a HardwareCounterCell,
) -> Box<dyn Iterator<Item = PointOffsetType> + 'a> {
    match query {
        ParsedQuery::Tokens(tokens) => self.filter_has_subset(tokens),
    }
}

fn get_posting_len(&self, token_id: TokenId, _: &HardwareCounterCell) -> Option<usize> {
    self.postings.posting_len(token_id)
}

fn points_count(&self) -> usize {
    self.points_count
}

fn vocab_with_postings_len_iter(&self) -> impl Iterator<Item = (&str, usize)> + '_ {
    self.vocab.iter().filter_map(|(token, &token_id)| {
        self.postings
            .posting_len(token_id)
            .map(|len| (token.as_str(), len))
    })
}

fn values_is_empty(&self, point_id: PointOffsetType) -> bool {
    self.point_to_tokens_count
        .get(point_id as usize)
        .is_none_or(|count| count.is_none())
}

fn values_count(&self, point_id: PointOffsetType) -> usize {
    self.point_to_tokens_count
        .get(point_id as usize)
        .and_then(|&count| count)
        .unwrap_or(0)
}

fn get_token_id(&self, token: &str, _: &HardwareCounterCell) -> Option<TokenId> {
    self.vocab.get(token).copied()
}

From lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs (lines 273-379):

fn get_vocab_mut(&mut self) -> &mut HashMap<String, TokenId> {
    unreachable!("MmapInvertedIndex does not support mutable operations")
}

fn index_tokens(
    &mut self,
    _idx: PointOffsetType,
    _tokens: super::inverted_index::TokenSet,
    _hw_counter: &HardwareCounterCell,
) -> OperationResult<()> {
    Err(OperationError::service_error(
        "Can't add values to mmap immutable text index",
    ))
}

fn index_document(
    &mut self,
    _idx: PointOffsetType,
    _document: super::inverted_index::Document,
    _hw_counter: &HardwareCounterCell,
) -> OperationResult<()> {
    Err(OperationError::service_error(
        "Can't add values to mmap immutable text index",
    ))
}

fn remove(&mut self, idx: PointOffsetType) -> bool {
    let Some(is_deleted) = self.deleted_points.get(idx as usize) else {
        return false; // Never existed
    };

    if is_deleted {
        return false; // Already removed
    }

    self.deleted_points.set(idx as usize, true);
    if let Some(count) = self.point_to_tokens_count.get_mut(idx as usize) {
        *count = 0;

        // `deleted_points`'s length can be larger than `point_to_tokens_count`'s length.
        // Only if the index is within bounds of `point_to_tokens_count`, we decrement the active points count.
        self.active_points_count -= 1;
    }

    true
}

fn filter<'a>(
    &'a self,
    query: ParsedQuery,
    hw_counter: &'a HardwareCounterCell,
) -> Box<dyn Iterator<Item = PointOffsetType> + 'a> {
    match query {
        ParsedQuery::Tokens(tokens) => self.filter_has_subset(tokens, hw_counter),
    }
}

fn get_posting_len(
    &self,
    token_id: TokenId,
    hw_counter: &HardwareCounterCell,
) -> Option<usize> {
    self.postings.posting_len(token_id, hw_counter)
}

fn points_count(&self) -> usize {
    self.active_points_count
}

fn vocab_with_postings_len_iter(&self) -> impl Iterator<Item = (&str, usize)> + '_ {
    let hw_counter = HardwareCounterCell::disposable(); // No propagation needed here because this function is only used for building HNSW index.

    self.iter_vocab().filter_map(move |(token, &token_id)| {
        self.postings
            .posting_len(token_id, &hw_counter)
            .map(|posting_len| (token, posting_len))
    })
}

fn values_is_empty(&self, point_id: PointOffsetType) -> bool {
    if self.deleted_points.get(point_id as usize).unwrap_or(true) {
        return true;
    }
    self.point_to_tokens_count
        .get(point_id as usize)
        .map(|count| *count == 0)
        // if the point does not exist, it is considered empty
        .unwrap_or(true)
}

fn values_count(&self, point_id: PointOffsetType) -> usize {
    if self.deleted_points.get(point_id as usize).unwrap_or(true) {
        return 0;
    }
    self.point_to_tokens_count
        .get(point_id as usize)
        .copied()
        // if the point does not exist, it is considered empty
        .unwrap_or(0)
}

fn get_token_id(&self, token: &str, hw_counter: &HardwareCounterCell) -> Option<TokenId> {
    if self.is_on_disk {
        hw_counter.payload_index_io_read_counter().incr_delta(
            READ_ENTRY_OVERHEAD + size_of::<TokenId>(), // Avoid check overhead and assume token is always read
        );
    }

    self.vocab
        .get(token)
        .ok()
        .flatten()
        .and_then(<[TokenId]>::first)
        .copied()
}

From lib/segment/src/index/field_index/full_text_index/inverted_index.rs (the main file):

  • The file defines the TokenSet struct which holds a sorted unique vector of token IDs with methods like contains, has_subset, and conversions from sets or iterators.

  • The Document struct holds tokens in order, allowing duplicates.

  • The ParsedQuery enum currently supports Tokens(TokenSet) variant with a method check_match to verify if a token set matches the query.

  • The InvertedIndex trait defines methods for managing vocabulary, indexing tokens and documents, removing points, filtering by queries, estimating cardinality, and other utility methods.

  • The trait provides default implementations for register_tokens (mapping string tokens to token IDs, adding new tokens to vocabulary), estimate_cardinality (delegates to estimate_has_subset_cardinality), and payload_blocks (generates payload block conditions for tokens with postings length above a threshold).

  • The trait methods include:

    • get_vocab_mut to get mutable vocabulary map.
    • index_tokens and index_document to index tokens or documents.
    • remove to remove a point.
    • filter to filter points by a parsed query.
    • get_posting_len to get posting list length for a token.
    • estimate_cardinality and estimate_has_subset_cardinality for cardinality estimation.
    • vocab_with_postings_len_iter to iterate over vocabulary tokens with their postings length.
    • check_match to check if a point matches a parsed query.
    • values_is_empty and values_count to check if a point has values and count them.
    • points_count to get total points count.
    • get_token_id to get token ID for a string token.
  • The file also contains extensive tests for mutable, immutable, and mmap inverted indexes, checking consistency and congruence of filtering and indexing.


These snippets provide the full context of the InvertedIndex trait and its implementations for mutable, immutable, and mmap-backed inverted indexes, including methods for indexing, filtering, removing, and cardinality estimation, which are central to the file inverted_index.rs. They also show how token sets and documents are represented and how queries are parsed and matched.

This context is relevant to understanding the code in inverted_index.rs and any changes or fixes mentioned in the user comment regarding PR #6493.

@coszio coszio merged commit 9692614 into dev Jun 3, 2025
17 checks passed
@coszio coszio deleted the mmap-full-text-index-for-phrase branch June 3, 2025 20:16
generall pushed a commit that referenced this pull request Jul 17, 2025
* enumify `.positions` field in immutable and mmap inverted index

* review fixes
@coderabbitai coderabbitai bot mentioned this pull request Aug 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants