Skip to content

[phrase matching] populate point_to_doc structures#6493

Merged
coszio merged 4 commits intodevfrom
persist-ordered-documents
Jun 10, 2025
Merged

[phrase matching] populate point_to_doc structures#6493
coszio merged 4 commits intodevfrom
persist-ordered-documents

Conversation

@coszio
Copy link
Contributor

@coszio coszio commented May 5, 2025

Builds on top of #6602

Adds a phrase_matching bool param in the text index parameterization which controls how to store the tokens of each text.

@generall
Copy link
Member

generall commented May 6, 2025

This PR is expected to only affect the mutable index, right?

@coszio
Copy link
Contributor Author

coszio commented May 6, 2025

This PR is expected to only affect the mutable index, right?

Mutable and immutable, since immutable loads from db too.

@coszio coszio force-pushed the introduce-ordered-document branch 3 times, most recently from ec4561c to e2452d0 Compare May 28, 2025 19:59
@coszio coszio force-pushed the persist-ordered-documents branch 2 times, most recently from b807adc to 4bfff19 Compare May 30, 2025 17:25
@coszio coszio changed the title [phrase matching] choose persistance format [phrase matching] populate point_to_doc structures May 30, 2025
@coszio coszio changed the base branch from introduce-ordered-document to mmap-full-text-index-for-phrase May 30, 2025 17:30
@coszio coszio force-pushed the mmap-full-text-index-for-phrase branch 2 times, most recently from 1728a44 to 6834d87 Compare May 30, 2025 20:27
@coszio coszio force-pushed the persist-ordered-documents branch from 4bfff19 to b9e747c Compare May 30, 2025 20:35
@coszio coszio added this to the Phrase matching milestone May 30, 2025
@coszio coszio requested a review from generall May 30, 2025 20:48
Comment on lines +218 to +240
struct StoredTokens {
tokens: BTreeSet<String>,
}
let doc = StoredTokens { tokens };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check if we can simply use a Vec here. IIRC, the serialized format is compatible. If possible, we probably want an assertion here that asserts ordering just in case.

Copy link
Contributor Author

@coszio coszio Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, I have added a test and cfg(test) attributes for this in 017a70f. We can remove the extra implementations in a cleanup PR

Copy link
Member

@timvisee timvisee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a congruence test for the phrase index functionality.

Also, maybe it's worth to add a basic unit test that just asserts Document/TokenSet are handled correctly when filtering/checking - in one we use ordering, in the other we don't.

@coszio coszio force-pushed the mmap-full-text-index-for-phrase branch from 6834d87 to e741a8f Compare June 2, 2025 19:13
@coszio coszio force-pushed the persist-ordered-documents branch from b9e747c to 342ac36 Compare June 2, 2025 23:31
@coszio coszio force-pushed the mmap-full-text-index-for-phrase branch 2 times, most recently from 3a51585 to d701880 Compare June 3, 2025 19:28
Base automatically changed from mmap-full-text-index-for-phrase to dev June 3, 2025 20:16
@coszio coszio force-pushed the persist-ordered-documents branch from 342ac36 to 9bf8e90 Compare June 3, 2025 20:18
@coszio
Copy link
Contributor Author

coszio commented Jun 3, 2025

@timvisee

Could you add a congruence test for the phrase index functionality.

I have now added cases with phrase enabled in 9bf8e90

Also, maybe it's worth to add a basic unit test that just asserts Document/TokenSet are handled correctly when filtering/checking - in one we use ordering, in the other we don't.

We haven't implemented phrase checking yet, but will definitely include tests for it.

@coszio coszio marked this pull request as ready for review June 3, 2025 20:20
@coszio coszio requested a review from timvisee June 3, 2025 20:20
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jun 3, 2025

📝 Walkthrough
## Walkthrough

This set of changes introduces conditional phrase matching support across the full-text index components. The code now distinguishes between handling ordered documents (for phrase matching) and token sets (for standard indexing), controlled by a `phrase_matching` configuration flag. Serialization and deserialization logic is split between token sets and documents, with improved error handling and migration to `serde_cbor`. The `MutableInvertedIndex` gains an explicit constructor for enabling or disabling positional information, and related methods are updated to use iterators over tokens rather than sets. Tests are extended and parameterized to cover both phrase matching and non-phrase matching scenarios. Additional minor improvements include documentation comments for traits, simplification of iterator usage, and stylistic refactoring.

## Possibly related PRs

- qdrant/qdrant#6486: Introduces the foundational distinction between ordered `Document` and unique `TokenSet` data structures, and updates the inverted index trait and implementations, which the current PR builds upon to extend phrase matching and indexing interfaces.  
- qdrant/qdrant#6602: The main PR and the retrieved PR both focus on introducing and supporting phrase matching in the full text index by adding positional information handling, including changes to the mutable and immutable inverted indexes, mmap indexes, and related serialization and test code, making their changes directly related at the code level.  
- qdrant/qdrant#6528: Adds documentation comments to the `UnsizedValue` trait methods in `lib/posting_list/src/lib.rs`, while the retrieved PR introduces the entire new compressed posting list implementation including the `UnsizedValue` trait and related posting list structures and traits in the same module; thus, the main PR’s changes are directly related to the code introduced in the retrieved PR.

## Suggested reviewers

- timvisee

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ae56018 and cc5d3d9.

📒 Files selected for processing (2)
  • lib/segment/src/index/field_index/full_text_index/inverted_index.rs (7 hunks)
  • lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs (8 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs
  • lib/segment/src/index/field_index/full_text_index/inverted_index.rs
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: integration-tests
  • GitHub Check: test-low-resources
  • GitHub Check: test-consistency
  • GitHub Check: test-consensus-compose
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: integration-tests-consensus
  • GitHub Check: storage-compat-test
  • GitHub Check: lint
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: rust-tests (macos-latest)
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (2)
lib/segment/src/index/field_index/full_text_index/text_index.rs (1)

216-260: 💡 Verification agent

🧩 Analysis chain

Verify backward compatibility for serialization format change.

The refactoring improves code organization by separating token set and document serialization logic. However, the switch from bincode to serde_cbor is a breaking change that could affect existing stored data.

Please confirm:

  1. Is the serialization format change intentional?
  2. How is backward compatibility handled for existing data serialized with bincode?
  3. Is there a migration strategy for existing installations?

The improved error handling with detailed service error messages is a good enhancement.


🏁 Script executed:

#!/bin/bash
# Description: Check if there are any migration utilities or backward compatibility handling for the serialization format change

# Search for migration code or compatibility handling
rg -A 5 -B 5 "bincode|serde_cbor" --type rust

# Look for version handling or migration utilities
fd "migrat" --type f
fd "compat" --type f

Length of output: 61416


🏁 Script executed:

#!/bin/bash
# Show beginning of segment's compat module
sed -n '1,200p' lib/segment/src/compat.rs

# Search for full_text or text_index mentions
grep -R "text_index" -n lib/segment/src/compat.rs || echo "No text_index-specific compatibility code"

Length of output: 7397


Breaking Change: Full-text Index Serialization Now CBOR Only
There’s no fallback or migration path in lib/segment/src/index/field_index/full_text_index/text_index.rs (lines 216–260) or in lib/segment/src/compat.rs for data previously serialized with bincode. As a result, any existing index files written with bincode will no longer deserialize correctly.

Please add one of the following before merging:

  • A versioned wrapper (e.g. prefixing data with a format version byte) and a fallback bincode::from_slice in deserialize_* to maintain compatibility.
  • A one-time migration utility that reads old bincode blobs, converts to the new CBOR format, and rewrites storage.
  • Documentation/upgrade guide clearly stating that all existing indices must be rebuilt.

Fix Required: ensure backward compatibility or provide a migration strategy for stored full-text index data.

lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs (1)

36-90: 🛠️ Refactor suggestion

Comprehensive build_index implementation with proper structure management.

The method correctly handles both phrase matching and non-phrase matching modes. The resizing logic for both point_to_tokens and point_to_doc structures is properly synchronized.

However, there's a potential issue with the resizing condition on line 48:

-            if index.point_to_tokens.len() <= idx as usize {
+            if index.point_to_tokens.len() <= idx as usize {
                 index
                     .point_to_tokens
                     .resize_with(idx as usize + 1, Default::default);
 
                 if let Some(point_to_doc) = &mut index.point_to_doc {
                     point_to_doc.resize_with(idx as usize + 1, Default::default);
                 }
             }

The logic is correct, but consider adding bounds checking for very large idx values to prevent excessive memory allocation.

🧹 Nitpick comments (1)
lib/segment/src/index/field_index/full_text_index/inverted_index.rs (1)

318-349: Clarify the dual indexing logic and consider renaming the function.

The conditional indexing logic raises a few questions:

  1. Dual indexing: When with_positions is true, both index_document and index_tokens are called. Is this intentional? It seems redundant to index the same data in two different formats.

  2. Function naming: The function name mutable_inverted_index doesn't reflect that it now supports phrase matching modes. Consider renaming to mutable_inverted_index_with_options or similar.

Could you clarify if the dual indexing is necessary, or should it be either document OR token set indexing based on the with_positions flag?

-            if with_positions {
-                index
-                    .index_document(idx, Document(token_ids.clone()), &hw_counter)
-                    .unwrap();
-            }
-            let token_set = TokenSet::from_iter(token_ids);
-            index.index_tokens(idx, token_set, &hw_counter).unwrap();
+            if with_positions {
+                index
+                    .index_document(idx, Document(token_ids), &hw_counter)
+                    .unwrap();
+            } else {
+                let token_set = TokenSet::from_iter(token_ids);
+                index.index_tokens(idx, token_set, &hw_counter).unwrap();
+            }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9692614 and 9bf8e90.

📒 Files selected for processing (10)
  • lib/posting_list/src/lib.rs (1 hunks)
  • lib/posting_list/src/value_handler.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/immutable_text_index.rs (2 hunks)
  • lib/segment/src/index/field_index/full_text_index/inverted_index.rs (6 hunks)
  • lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs (2 hunks)
  • lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs (4 hunks)
  • lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs (6 hunks)
  • lib/segment/src/index/field_index/full_text_index/positions.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/text_index.rs (5 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
lib/posting_list/src/lib.rs (2)
lib/segment/src/index/field_index/full_text_index/positions.rs (2)
  • write_len (19-21)
  • write_to (23-28)
lib/posting_list/src/tests.rs (2)
  • write_len (18-20)
  • write_to (22-24)
⏰ Context from checks skipped due to timeout of 90000ms (12)
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: integration-tests
  • GitHub Check: test-consistency
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: lint
  • GitHub Check: test-consensus-compose
  • GitHub Check: integration-tests-consensus
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: rust-tests (macos-latest)
🔇 Additional comments (25)
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs (1)

105-111: Good refactoring from if-else to match.

The refactor to use match instead of if-else for the binary has_positions condition improves readability and is more idiomatic Rust. The logic remains functionally equivalent while supporting the conditional phrase matching feature.

lib/segment/src/index/field_index/full_text_index/positions.rs (1)

20-20: Improved byte length calculation.

The change to use self.0.as_bytes().len() directly aligns with the UnsizedValue trait's documented behavior of returning "the length of the serialized value in bytes." This is more accurate and eliminates potential arithmetic errors from manual calculation.

lib/posting_list/src/lib.rs (1)

25-32: Excellent documentation improvements.

The added documentation comments clearly explain the purpose and expected behavior of each UnsizedValue trait method. This improves code clarity and helps implementers understand the contract, which is particularly valuable given the broader serialization changes in this PR.

lib/posting_list/src/value_handler.rs (1)

112-112: Good simplification of iterator construction.

The change to use last_offset.map(|&last| ...) directly is more idiomatic and concise than the previous approach. It maintains the same functionality while reducing verbosity in the iterator chain construction.

lib/segment/src/index/field_index/full_text_index/immutable_text_index.rs (4)

4-4: LGTM: Clean import addition for Either type

The itertools::Either import is appropriately added to support the conditional iterator logic introduced in the phrase matching implementation.


82-95: LGTM: Correct conditional deserialization logic

The implementation correctly handles phrase matching by:

  • Extracting the phrase_matching flag from the configuration
  • Conditionally deserializing as documents (preserving order) vs token sets
  • Using Either to unify the iterator types for downstream processing
  • Passing the flag to MutableInvertedIndex::build_index

This aligns well with the PR objectives of supporting ordered token storage for phrase matching.


4-4: Good addition of Either import for conditional deserialization.

The import is appropriately used to handle the conditional deserialization logic between documents and token sets.


82-95:

Details

✅ Verification successful

Well-implemented conditional loading logic for phrase matching.

The implementation correctly switches between document and token set deserialization based on the phrase_matching flag. The use of Either provides a clean abstraction for handling both iterator types uniformly.

However, ensure that the deserialization methods handle errors appropriately and that the iterator behavior is consistent between both paths.

Let me verify that the deserialization methods exist and handle errors properly:


🏁 Script executed:

#!/bin/bash
# Description: Verify that deserialize_document and deserialize_token_set methods exist and handle errors

# Search for the deserialization method implementations
ast-grep --pattern 'fn deserialize_document($$$) -> $_ {
  $$$
}'

ast-grep --pattern 'fn deserialize_token_set($$$) -> $_ {
  $$$
}'

Length of output: 136


Let's search for the actual definitions with ripgrep to confirm their signatures and error handling:


🏁 Script executed:

#!/bin/bash
set -e
# Find deserialize_document definition and its return type/context
rg -n "fn deserialize_document" -C3 .

# Find deserialize_token_set definition and its return type/context
rg -n "fn deserialize_token_set" -C3 .

Length of output: 1478


Verification Complete: Error Handling and Iterator Consistency Confirmed

Both deserialize_document and deserialize_token_set are defined in text_index.rs and return OperationResult, so the ? operator in the iterator closure will correctly propagate any deserialization errors. Each branch (Vec<String> vs. BTreeSet<String>) is converted into an iterator over String, ensuring uniform behavior when passed to MutableInvertedIndex::build_index. No further changes needed.

lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs (4)

91-94: LGTM: Correct initialization with phrase matching support

The constructor correctly extracts the with_positions flag from the phrase matching configuration and initializes the MutableInvertedIndex accordingly. This ensures that positional information is enabled when phrase matching is requested.


129-129:

Details

❓ Verification inconclusive

Verify register_tokens interface consistency

The change from passing an iterator to passing a slice reference (&str_tokens) looks intentional but should be verified for consistency with the updated register_tokens method signature across the codebase.


🏁 Script executed:

#!/bin/bash
# Description: Verify register_tokens method signatures across the codebase
# Expect: Consistent method signatures for register_tokens

echo "Searching for register_tokens method definitions:"
ast-grep --pattern 'fn register_tokens($$$) { $$$ }'

echo -e "\nSearching for register_tokens method calls:"
rg -A 2 "\.register_tokens\(" 

Length of output: 1659


🏁 Script executed:

#!/usr/bin/env bash
set -e

echo "== Definition in mutable_inverted_index.rs =="
rg -C3 "fn register_tokens" -n lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs

echo -e "\n== Definition in inverted_index.rs =="
rg -C3 "fn register_tokens" -n lib/segment/src/index/field_index/full_text_index/inverted_index.rs

Length of output: 218


Verify register_tokens interface consistency

The switch from passing an iterator to passing a slice reference needs to align with the method’s signature in both MutableInvertedIndex and InvertedIndex. Please confirm that register_tokens is declared to accept the new argument type and update call sites as needed:

• lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs:129
let tokens = self.mutable_index.register_tokens(&str_tokens);

• lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs
let tokens = index.register_tokens(str_tokens_iter);

• lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs
let tokens = self.inverted_index.register_tokens(&str_tokens);

• lib/segment/src/index/field_index/full_text_index/inverted_index.rs
let token_ids = index.register_tokens(&tokens);

Double-check the signature definitions and harmonize all usages (either always an iterator or always a slice).


91-94: Clean implementation of phrase matching flag handling.

The conversion from Option<bool> to bool using == Some(true) is correct and the flag is properly passed to the MutableInvertedIndex constructor.


129-129:

Details

✅ Verification successful

Verify the register_tokens signature change.

The method call has been simplified to pass &str_tokens directly. Ensure this aligns with the updated method signature in the InvertedIndex trait.


🏁 Script executed:

#!/bin/bash
# Description: Verify the register_tokens method signature in the InvertedIndex trait

# Search for register_tokens method definition
ast-grep --pattern 'fn register_tokens($$$) $$$'

Length of output: 50


🏁 Script executed:

#!/bin/bash
# Search for the InvertedIndex trait definition
rg "trait InvertedIndex" -n -A5 -B5

# Search for any register_tokens method definitions
rg "fn register_tokens" -n -A3 -B3

Length of output: 2764


#!/bin/bash
# Display context around the call site in mmap_text_index.rs
sed -n '120,160p' lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs

# Show declaration of `str_tokens`
rg 'str_tokens' -n -A3 -B3 lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs

register_tokens signature aligns with trait
The InvertedIndex::register_tokens<'a>(&mut self, str_tokens: impl IntoIterator<Item = impl AsRef<str>> + 'a) definition in full_text_index/inverted_index.rs accepts any iterator over items that implement AsRef<str>. Passing &str_tokens (e.g. a &Vec<String> or &[&str]) produces an IntoIterator<Item=&String> (or &str) where each item implements AsRef<str>. No adjustments are needed.

lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs (6)

1-1: LGTM: Required import for HashMap usage

The std::collections::HashMap import is correctly added to support the explicit HashMap usage in the constructor.


25-34: LGTM: Improved constructor design

The explicit new(with_positions: bool) constructor is a significant improvement over relying on Default. It provides:

  • Clear intent about positional indexing capabilities
  • Explicit control over the point_to_doc structure initialization
  • Better API design for the phrase matching feature

This is a cleaner and more maintainable approach.


36-90: LGTM: Enhanced build_index method with phrase matching support

The updated build_index method correctly:

  • Accepts an iterator over tokens and a phrase matching flag
  • Uses the new constructor with proper positional indexing configuration
  • Implements improved resizing logic for both point_to_tokens and point_to_doc structures
  • Conditionally handles document storage based on phrase matching requirements
  • Maintains backward compatibility with token set storage

The implementation is well-structured and supports the phrase matching feature requirements.


187-187: LGTM: Improved boundary checking and conditional handling

The changes improve the implementation by:

  • Using >= instead of <= for proper boundary checking (line 187)
  • Adding proper conditional handling for point_to_doc cleanup during removal (lines 210-212)

These changes enhance robustness and ensure consistent state management for both indexing modes.

Also applies to: 202-212


25-34: Well-designed constructor with explicit positional control.

The new constructor with with_positions parameter provides clear control over whether positional information should be stored. The conditional initialization of point_to_doc using then_some() is clean and efficient.


202-212: Proper cleanup in remove method.

The remove method correctly handles cleanup of both point_to_tokens and point_to_doc structures when a point is removed. The bounds checking prevents out-of-bounds access.

lib/segment/src/index/field_index/full_text_index/inverted_index.rs (2)

117-139: LGTM! Improved API flexibility and addressed past feedback.

The changes enhance the trait's usability by:

  • Accepting any iterator of string-like items instead of just string slices
  • Adding the requested register_token method that reflects the mutable nature
  • Maintaining backward compatibility through the existing register_tokens method

The implementation correctly maps tokens through the new register_token helper method.


352-352: Excellent test coverage improvement!

Parameterizing the tests with phrase_matching ensures both indexing modes are properly tested. This follows good testing practices and will help catch regressions in either configuration.

Also applies to: 394-394, 463-463

lib/segment/src/index/field_index/full_text_index/text_index.rs (1)

562-571: Well-implemented test parameterization for phrase matching.

The changes properly:

  • Add phrase_matching parameter to the builder creation function
  • Pass the setting through to TextIndexParams configuration
  • Parameterize the congruence test to validate both phrase matching modes

This ensures comprehensive test coverage across different indexing configurations.

Also applies to: 679-682

lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs (4)

21-23: Proper initialization with phrase matching support.

The constructor correctly converts the phrase_matching configuration to a boolean flag and passes it to the MutableInvertedIndex constructor.


38-52: Consistent conditional loading logic.

The deserialization logic matches the implementation in immutable_text_index.rs, ensuring consistency across mutable and immutable indexes. The use of Either maintains a uniform interface for both token types.


79-83: Conditional document indexing for phrase matching.

The logic correctly indexes documents with positional information only when phrase matching is enabled. This preserves token order for phrase queries while avoiding unnecessary overhead for standard token matching.


91-95:

Details

✅ Verification successful

Conditional serialization based on phrase matching mode.

The serialization logic appropriately switches between document and token set serialization. However, verify that the serialization methods handle empty collections and edge cases properly.


🏁 Script executed:

#!/bin/bash
# Description: Verify serialization methods handle edge cases

# Search for serialize_document and serialize_token_set implementations
ast-grep --pattern 'fn serialize_document($$$) -> $_ {
  $$$
}'

ast-grep --pattern 'fn serialize_token_set($$$) -> $_ {
  $$$
}'

Length of output: 132


🏁 Script executed:

#!/bin/bash
# Re-verify serialize_document and serialize_token_set implementations

# Search for any occurrences (definitions or references) across the segment crate
grep -R "serialize_document" -n lib/segment/src
grep -R "serialize_token_set" -n lib/segment/src

Length of output: 1675


🏁 Script executed:

#!/bin/bash
# Display serialize_token_set and serialize_document implementations with context
sed -n '200,300p' lib/segment/src/index/field_index/full_text_index/text_index.rs

Length of output: 3853


Serialization edge cases verified

The serialize_document and serialize_token_set methods simply wrap their token collections in a small #[derive(Serialize)] struct and hand off to serde_cbor. In practice, CBOR serialization handles empty Vec<String> and empty BTreeSet<String> out of the box, and both methods map any serialization errors into OperationError::service_error. No additional edge‐case handling is required here. Great to merge as is.

@coszio coszio force-pushed the persist-ordered-documents branch 2 times, most recently from 017a70f to 7a1af3b Compare June 4, 2025 14:54
@coszio coszio force-pushed the persist-ordered-documents branch from 7a1af3b to 2b9d2b4 Compare June 4, 2025 15:57
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
lib/segment/src/index/field_index/full_text_index/text_index.rs (3)

217-231: Consider adding a formal TODO comment for cleanup tracking.

The function is correctly marked as temporary with detailed comments, but lacks a formal TODO comment that can be tracked.

Consider adding a TODO comment for better tracking:

+ /// TODO: Remove this function after cleanup PR following https://github.com/qdrant/qdrant/pull/6493
  /// CBOR representation is the same for BTreeSet<String> and Vec<String> if the elements are sorted, thus, we can resort to
  /// the vec implementation always. Let's just keep this to prove this works fine during https://github.com/qdrant/qdrant/pull/6493
  ///
  /// We can remove this afterwards

233-248: Consider adding a formal TODO comment for cleanup tracking.

Similar to the serialization function, this temporary deserialization function would benefit from a formal TODO comment for tracking.

Consider adding a TODO comment:

+ /// TODO: Remove this function after cleanup PR following https://github.com/qdrant/qdrant/pull/6493
  /// CBOR representation is the same for BTreeSet<String> and Vec<String> if the elements are sorted, thus, we can resort to
  /// the vec implementation always. Let's just keep this to prove this works fine during https://github.com/qdrant/qdrant/pull/6493

833-874: Remove debug print statements or use proper test logging.

The test correctly validates CBOR serialization compatibility between BTreeSet and Vec, but includes unnecessary eprintln! statements that will always print during test runs.

Consider removing the debug print statements or replacing them with proper test logging:

- eprintln!(
-     "Serialized set: {:?}",
-     serialized_set
-         .iter()
-         .map(|&b| b as char)
-         .collect::<String>()
- );
- eprintln!(
-     "Serialized vec: {:?}",
-     serialized_vec
-         .iter()
-         .map(|&b| b as char)
-         .collect::<String>()
- );

Or use conditional debug logging:

+ #[cfg(test)]
+ if std::env::var("RUST_LOG").is_ok() {
+     eprintln!("Serialized set: {:?}", /* ... */);
+     eprintln!("Serialized vec: {:?}", /* ... */);
+ }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7a1af3b and 2b9d2b4.

📒 Files selected for processing (10)
  • lib/posting_list/src/lib.rs (1 hunks)
  • lib/posting_list/src/value_handler.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/immutable_text_index.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/inverted_index.rs (7 hunks)
  • lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs (2 hunks)
  • lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs (4 hunks)
  • lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs (6 hunks)
  • lib/segment/src/index/field_index/full_text_index/positions.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/text_index.rs (7 hunks)
🚧 Files skipped from review as they are similar to previous changes (9)
  • lib/segment/src/index/field_index/full_text_index/positions.rs
  • lib/posting_list/src/value_handler.rs
  • lib/segment/src/index/field_index/full_text_index/immutable_text_index.rs
  • lib/posting_list/src/lib.rs
  • lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs
  • lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs
  • lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs
  • lib/segment/src/index/field_index/full_text_index/inverted_index.rs
  • lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: integration-tests
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: storage-compat-test
  • GitHub Check: test-consistency
  • GitHub Check: test-consensus-compose
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: lint
  • GitHub Check: integration-tests-consensus
🔇 Additional comments (4)
lib/segment/src/index/field_index/full_text_index/text_index.rs (4)

1-1: LGTM: Proper conditional compilation for test-only import.

The #[cfg(test)] attribute correctly restricts the BTreeSet import to test compilation only, which aligns with its usage in the temporary test functions.


250-271: LGTM: Correct transition from BTreeSet to Vec for phrase matching.

The change from BTreeSet<String> to Vec<String> in the document serialization functions is appropriate for supporting phrase matching where token order matters. The error handling and structure remain consistent.


573-582: LGTM: Proper test infrastructure for phrase matching support.

The addition of the phrase_matching parameter to the test builder function correctly configures the TextIndexParams to test both phrase matching and non-phrase matching scenarios.


690-693: LGTM: Comprehensive parameterized testing.

The test function is properly parameterized to cover all combinations of deleted and phrase_matching flags, ensuring thorough test coverage for the new functionality.

@coszio coszio force-pushed the persist-ordered-documents branch from 2b9d2b4 to ae56018 Compare June 5, 2025 17:44
Copy link
Member

@timvisee timvisee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

No significant remarks, only nits.

@coszio coszio merged commit 052ce9b into dev Jun 10, 2025
17 checks passed
@coszio coszio deleted the persist-ordered-documents branch June 10, 2025 17:07
generall pushed a commit that referenced this pull request Jul 17, 2025
* populate point_to_doc

* add phrase_matching=true case to tests

* CBOR ordered Vec and BTreeSet have same representation

* review nits
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants