[phrase matching] populate `point_to_doc` structures by coszio · Pull Request #6493 · qdrant/qdrant

coszio · 2025-05-05T21:26:37Z

Builds on top of #6602

Adds a phrase_matching bool param in the text index parameterization which controls how to store the tokens of each text.

lib/segment/src/index/field_index/full_text_index/inverted_index.rs

generall · 2025-05-06T12:17:11Z

This PR is expected to only affect the mutable index, right?

coszio · 2025-05-06T13:49:24Z

This PR is expected to only affect the mutable index, right?

Mutable and immutable, since immutable loads from db too.

timvisee · 2025-06-02T14:32:35Z

lib/segment/src/index/field_index/full_text_index/text_index.rs

+        struct StoredTokens {
+            tokens: BTreeSet<String>,
+        }
+        let doc = StoredTokens { tokens };


Could you check if we can simply use a Vec here. IIRC, the serialized format is compatible. If possible, we probably want an assertion here that asserts ordering just in case.

That's correct, I have added a test and cfg(test) attributes for this in 017a70f. We can remove the extra implementations in a cleanup PR

timvisee

Could you add a congruence test for the phrase index functionality.

Also, maybe it's worth to add a basic unit test that just asserts Document/TokenSet are handled correctly when filtering/checking - in one we use ordering, in the other we don't.

coszio · 2025-06-03T20:20:46Z

@timvisee

Could you add a congruence test for the phrase index functionality.

I have now added cases with phrase enabled in 9bf8e90

Also, maybe it's worth to add a basic unit test that just asserts Document/TokenSet are handled correctly when filtering/checking - in one we use ordering, in the other we don't.

We haven't implemented phrase checking yet, but will definitely include tests for it.

coderabbitai · 2025-06-03T20:27:20Z

📝 Walkthrough

## Walkthrough

This set of changes introduces conditional phrase matching support across the full-text index components. The code now distinguishes between handling ordered documents (for phrase matching) and token sets (for standard indexing), controlled by a `phrase_matching` configuration flag. Serialization and deserialization logic is split between token sets and documents, with improved error handling and migration to `serde_cbor`. The `MutableInvertedIndex` gains an explicit constructor for enabling or disabling positional information, and related methods are updated to use iterators over tokens rather than sets. Tests are extended and parameterized to cover both phrase matching and non-phrase matching scenarios. Additional minor improvements include documentation comments for traits, simplification of iterator usage, and stylistic refactoring.

## Possibly related PRs

- qdrant/qdrant#6486: Introduces the foundational distinction between ordered `Document` and unique `TokenSet` data structures, and updates the inverted index trait and implementations, which the current PR builds upon to extend phrase matching and indexing interfaces.  
- qdrant/qdrant#6602: The main PR and the retrieved PR both focus on introducing and supporting phrase matching in the full text index by adding positional information handling, including changes to the mutable and immutable inverted indexes, mmap indexes, and related serialization and test code, making their changes directly related at the code level.  
- qdrant/qdrant#6528: Adds documentation comments to the `UnsizedValue` trait methods in `lib/posting_list/src/lib.rs`, while the retrieved PR introduces the entire new compressed posting list implementation including the `UnsizedValue` trait and related posting list structures and traits in the same module; thus, the main PR’s changes are directly related to the code introduced in the retrieved PR.

## Suggested reviewers

- timvisee

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ae56018 and cc5d3d9.

📒 Files selected for processing (2)

lib/segment/src/index/field_index/full_text_index/inverted_index.rs (7 hunks)
lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs (8 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs
lib/segment/src/index/field_index/full_text_index/inverted_index.rs

⏰ Context from checks skipped due to timeout of 90000ms (13)

GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: integration-tests
GitHub Check: test-low-resources
GitHub Check: test-consistency
GitHub Check: test-consensus-compose
GitHub Check: Basic TLS/HTTPS tests
GitHub Check: integration-tests-consensus
GitHub Check: storage-compat-test
GitHub Check: lint
GitHub Check: rust-tests (windows-latest)
GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: rust-tests (macos-latest)

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🔭 Outside diff range comments (2)

lib/segment/src/index/field_index/full_text_index/text_index.rs (1)
216-260: 💡 Verification agent
🧩 Analysis chain

Verify backward compatibility for serialization format change.

The refactoring improves code organization by separating token set and document serialization logic. However, the switch from bincode to serde_cbor is a breaking change that could affect existing stored data.

Please confirm:

Is the serialization format change intentional?

How is backward compatibility handled for existing data serialized with bincode?

Is there a migration strategy for existing installations?

The improved error handling with detailed service error messages is a good enhancement.

🏁 Script executed:
#!/bin/bash
# Description: Check if there are any migration utilities or backward compatibility handling for the serialization format change

# Search for migration code or compatibility handling
rg -A 5 -B 5 "bincode|serde_cbor" --type rust

# Look for version handling or migration utilities
fd "migrat" --type f
fd "compat" --type f
Length of output: 61416

🏁 Script executed:
#!/bin/bash
# Show beginning of segment's compat module
sed -n '1,200p' lib/segment/src/compat.rs

# Search for full_text or text_index mentions
grep -R "text_index" -n lib/segment/src/compat.rs || echo "No text_index-specific compatibility code"
Length of output: 7397
Breaking Change: Full-text Index Serialization Now CBOR Only
There’s no fallback or migration path in lib/segment/src/index/field_index/full_text_index/text_index.rs (lines 216–260) or in lib/segment/src/compat.rs for data previously serialized with bincode. As a result, any existing index files written with bincode will no longer deserialize correctly.

Please add one of the following before merging:

A versioned wrapper (e.g. prefixing data with a format version byte) and a fallback bincode::from_slice in deserialize_* to maintain compatibility.

A one-time migration utility that reads old bincode blobs, converts to the new CBOR format, and rewrites storage.

Documentation/upgrade guide clearly stating that all existing indices must be rebuilt.

Fix Required: ensure backward compatibility or provide a migration strategy for stored full-text index data.
lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs (1)
36-90: 🛠️ Refactor suggestion

Comprehensive build_index implementation with proper structure management.

The method correctly handles both phrase matching and non-phrase matching modes. The resizing logic for both point_to_tokens and point_to_doc structures is properly synchronized.

However, there's a potential issue with the resizing condition on line 48:
-            if index.point_to_tokens.len() <= idx as usize {
+            if index.point_to_tokens.len() <= idx as usize {
                 index
                     .point_to_tokens
                     .resize_with(idx as usize + 1, Default::default);
 
                 if let Some(point_to_doc) = &mut index.point_to_doc {
                     point_to_doc.resize_with(idx as usize + 1, Default::default);
                 }
             }
The logic is correct, but consider adding bounds checking for very large idx values to prevent excessive memory allocation.

🧹 Nitpick comments (1)

lib/segment/src/index/field_index/full_text_index/inverted_index.rs (1)
318-349: Clarify the dual indexing logic and consider renaming the function.

The conditional indexing logic raises a few questions:

Dual indexing: When with_positions is true, both index_document and index_tokens are called. Is this intentional? It seems redundant to index the same data in two different formats.

Function naming: The function name mutable_inverted_index doesn't reflect that it now supports phrase matching modes. Consider renaming to mutable_inverted_index_with_options or similar.

Could you clarify if the dual indexing is necessary, or should it be either document OR token set indexing based on the with_positions flag?
-            if with_positions {
-                index
-                    .index_document(idx, Document(token_ids.clone()), &hw_counter)
-                    .unwrap();
-            }
-            let token_set = TokenSet::from_iter(token_ids);
-            index.index_tokens(idx, token_set, &hw_counter).unwrap();
+            if with_positions {
+                index
+                    .index_document(idx, Document(token_ids), &hw_counter)
+                    .unwrap();
+            } else {
+                let token_set = TokenSet::from_iter(token_ids);
+                index.index_tokens(idx, token_set, &hw_counter).unwrap();
+            }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9692614 and 9bf8e90.

📒 Files selected for processing (10)

lib/posting_list/src/lib.rs (1 hunks)
lib/posting_list/src/value_handler.rs (1 hunks)
lib/segment/src/index/field_index/full_text_index/immutable_text_index.rs (2 hunks)
lib/segment/src/index/field_index/full_text_index/inverted_index.rs (6 hunks)
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs (1 hunks)
lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs (2 hunks)
lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs (4 hunks)
lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs (6 hunks)
lib/segment/src/index/field_index/full_text_index/positions.rs (1 hunks)
lib/segment/src/index/field_index/full_text_index/text_index.rs (5 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (1)

lib/posting_list/src/lib.rs (2)

lib/segment/src/index/field_index/full_text_index/positions.rs (2)

write_len (19-21)

write_to (23-28)

lib/posting_list/src/tests.rs (2)

write_len (18-20)

write_to (22-24)

⏰ Context from checks skipped due to timeout of 90000ms (12)

GitHub Check: Basic TLS/HTTPS tests
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: test-low-resources
GitHub Check: integration-tests
GitHub Check: test-consistency
GitHub Check: rust-tests (windows-latest)
GitHub Check: lint
GitHub Check: test-consensus-compose
GitHub Check: integration-tests-consensus
GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: rust-tests (macos-latest)

🔇 Additional comments (25)

lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs (1)

105-111: Good refactoring from if-else to match.

The refactor to use match instead of if-else for the binary has_positions condition improves readability and is more idiomatic Rust. The logic remains functionally equivalent while supporting the conditional phrase matching feature.

lib/segment/src/index/field_index/full_text_index/positions.rs (1)

20-20: Improved byte length calculation.

The change to use self.0.as_bytes().len() directly aligns with the UnsizedValue trait's documented behavior of returning "the length of the serialized value in bytes." This is more accurate and eliminates potential arithmetic errors from manual calculation.

lib/posting_list/src/lib.rs (1)

25-32: Excellent documentation improvements.

The added documentation comments clearly explain the purpose and expected behavior of each UnsizedValue trait method. This improves code clarity and helps implementers understand the contract, which is particularly valuable given the broader serialization changes in this PR.

lib/posting_list/src/value_handler.rs (1)

112-112: Good simplification of iterator construction.

The change to use last_offset.map(|&last| ...) directly is more idiomatic and concise than the previous approach. It maintains the same functionality while reducing verbosity in the iterator chain construction.
lib/segment/src/index/field_index/full_text_index/immutable_text_index.rs (4)

4-4: LGTM: Clean import addition for Either type

The itertools::Either import is appropriately added to support the conditional iterator logic introduced in the phrase matching implementation.

82-95: LGTM: Correct conditional deserialization logic

The implementation correctly handles phrase matching by:

Extracting the phrase_matching flag from the configuration

Conditionally deserializing as documents (preserving order) vs token sets

Using Either to unify the iterator types for downstream processing

Passing the flag to MutableInvertedIndex::build_index

This aligns well with the PR objectives of supporting ordered token storage for phrase matching.

4-4: Good addition of Either import for conditional deserialization.

The import is appropriately used to handle the conditional deserialization logic between documents and token sets.

82-95:
Details

✅ Verification successful

Well-implemented conditional loading logic for phrase matching.

The implementation correctly switches between document and token set deserialization based on the phrase_matching flag. The use of Either provides a clean abstraction for handling both iterator types uniformly.

However, ensure that the deserialization methods handle errors appropriately and that the iterator behavior is consistent between both paths.

Let me verify that the deserialization methods exist and handle errors properly:

🏁 Script executed:
#!/bin/bash
# Description: Verify that deserialize_document and deserialize_token_set methods exist and handle errors

# Search for the deserialization method implementations
ast-grep --pattern 'fn deserialize_document($$$) -> $_ {
  $$$
}'

ast-grep --pattern 'fn deserialize_token_set($$$) -> $_ {
  $$$
}'
Length of output: 136

Let's search for the actual definitions with ripgrep to confirm their signatures and error handling:

🏁 Script executed:
#!/bin/bash
set -e
# Find deserialize_document definition and its return type/context
rg -n "fn deserialize_document" -C3 .

# Find deserialize_token_set definition and its return type/context
rg -n "fn deserialize_token_set" -C3 .
Length of output: 1478
Verification Complete: Error Handling and Iterator Consistency Confirmed

Both deserialize_document and deserialize_token_set are defined in text_index.rs and return OperationResult, so the ? operator in the iterator closure will correctly propagate any deserialization errors. Each branch (Vec<String> vs. BTreeSet<String>) is converted into an iterator over String, ensuring uniform behavior when passed to MutableInvertedIndex::build_index. No further changes needed.
lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs (4)

91-94: LGTM: Correct initialization with phrase matching support

The constructor correctly extracts the with_positions flag from the phrase matching configuration and initializes the MutableInvertedIndex accordingly. This ensures that positional information is enabled when phrase matching is requested.

129-129:
Details

❓ Verification inconclusive

Verify register_tokens interface consistency

The change from passing an iterator to passing a slice reference (&str_tokens) looks intentional but should be verified for consistency with the updated register_tokens method signature across the codebase.

🏁 Script executed:
#!/bin/bash
# Description: Verify register_tokens method signatures across the codebase
# Expect: Consistent method signatures for register_tokens

echo "Searching for register_tokens method definitions:"
ast-grep --pattern 'fn register_tokens($$$) { $$$ }'

echo -e "\nSearching for register_tokens method calls:"
rg -A 2 "\.register_tokens\(" 
Length of output: 1659

🏁 Script executed:
#!/usr/bin/env bash
set -e

echo "== Definition in mutable_inverted_index.rs =="
rg -C3 "fn register_tokens" -n lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs

echo -e "\n== Definition in inverted_index.rs =="
rg -C3 "fn register_tokens" -n lib/segment/src/index/field_index/full_text_index/inverted_index.rs
Length of output: 218
Verify register_tokens interface consistency

The switch from passing an iterator to passing a slice reference needs to align with the method’s signature in both MutableInvertedIndex and InvertedIndex. Please confirm that register_tokens is declared to accept the new argument type and update call sites as needed:

• lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs:129
let tokens = self.mutable_index.register_tokens(&str_tokens);

• lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs
let tokens = index.register_tokens(str_tokens_iter);

• lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs
let tokens = self.inverted_index.register_tokens(&str_tokens);

• lib/segment/src/index/field_index/full_text_index/inverted_index.rs
let token_ids = index.register_tokens(&tokens);

Double-check the signature definitions and harmonize all usages (either always an iterator or always a slice).

91-94: Clean implementation of phrase matching flag handling.

The conversion from Option<bool> to bool using == Some(true) is correct and the flag is properly passed to the MutableInvertedIndex constructor.

129-129:
Details

✅ Verification successful

Verify the register_tokens signature change.

The method call has been simplified to pass &str_tokens directly. Ensure this aligns with the updated method signature in the InvertedIndex trait.

🏁 Script executed:
#!/bin/bash
# Description: Verify the register_tokens method signature in the InvertedIndex trait

# Search for register_tokens method definition
ast-grep --pattern 'fn register_tokens($$$) $$$'
Length of output: 50

🏁 Script executed:
#!/bin/bash
# Search for the InvertedIndex trait definition
rg "trait InvertedIndex" -n -A5 -B5

# Search for any register_tokens method definitions
rg "fn register_tokens" -n -A3 -B3
Length of output: 2764
#!/bin/bash
# Display context around the call site in mmap_text_index.rs
sed -n '120,160p' lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs

# Show declaration of `str_tokens`
rg 'str_tokens' -n -A3 -B3 lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs
register_tokens signature aligns with trait
The InvertedIndex::register_tokens<'a>(&mut self, str_tokens: impl IntoIterator<Item = impl AsRef<str>> + 'a) definition in full_text_index/inverted_index.rs accepts any iterator over items that implement AsRef<str>. Passing &str_tokens (e.g. a &Vec<String> or &[&str]) produces an IntoIterator<Item=&String> (or &str) where each item implements AsRef<str>. No adjustments are needed.
lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs (6)

1-1: LGTM: Required import for HashMap usage

The std::collections::HashMap import is correctly added to support the explicit HashMap usage in the constructor.

25-34: LGTM: Improved constructor design

The explicit new(with_positions: bool) constructor is a significant improvement over relying on Default. It provides:

Clear intent about positional indexing capabilities

Explicit control over the point_to_doc structure initialization

Better API design for the phrase matching feature

This is a cleaner and more maintainable approach.

36-90: LGTM: Enhanced build_index method with phrase matching support

The updated build_index method correctly:

Accepts an iterator over tokens and a phrase matching flag

Uses the new constructor with proper positional indexing configuration

Implements improved resizing logic for both point_to_tokens and point_to_doc structures

Conditionally handles document storage based on phrase matching requirements

Maintains backward compatibility with token set storage

The implementation is well-structured and supports the phrase matching feature requirements.

187-187: LGTM: Improved boundary checking and conditional handling

The changes improve the implementation by:

Using >= instead of <= for proper boundary checking (line 187)

Adding proper conditional handling for point_to_doc cleanup during removal (lines 210-212)

These changes enhance robustness and ensure consistent state management for both indexing modes.

Also applies to: 202-212

25-34: Well-designed constructor with explicit positional control.

The new constructor with with_positions parameter provides clear control over whether positional information should be stored. The conditional initialization of point_to_doc using then_some() is clean and efficient.

202-212: Proper cleanup in remove method.

The remove method correctly handles cleanup of both point_to_tokens and point_to_doc structures when a point is removed. The bounds checking prevents out-of-bounds access.

lib/segment/src/index/field_index/full_text_index/inverted_index.rs (2)

117-139: LGTM! Improved API flexibility and addressed past feedback.

The changes enhance the trait's usability by:

Accepting any iterator of string-like items instead of just string slices

Adding the requested register_token method that reflects the mutable nature

Maintaining backward compatibility through the existing register_tokens method

The implementation correctly maps tokens through the new register_token helper method.

352-352: Excellent test coverage improvement!

Parameterizing the tests with phrase_matching ensures both indexing modes are properly tested. This follows good testing practices and will help catch regressions in either configuration.

Also applies to: 394-394, 463-463

lib/segment/src/index/field_index/full_text_index/text_index.rs (1)

562-571: Well-implemented test parameterization for phrase matching.

The changes properly:

Add phrase_matching parameter to the builder creation function

Pass the setting through to TextIndexParams configuration

Parameterize the congruence test to validate both phrase matching modes

This ensures comprehensive test coverage across different indexing configurations.

Also applies to: 679-682
lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs (4)

21-23: Proper initialization with phrase matching support.

The constructor correctly converts the phrase_matching configuration to a boolean flag and passes it to the MutableInvertedIndex constructor.

38-52: Consistent conditional loading logic.

The deserialization logic matches the implementation in immutable_text_index.rs, ensuring consistency across mutable and immutable indexes. The use of Either maintains a uniform interface for both token types.

79-83: Conditional document indexing for phrase matching.

The logic correctly indexes documents with positional information only when phrase matching is enabled. This preserves token order for phrase queries while avoiding unnecessary overhead for standard token matching.

91-95:
Details

✅ Verification successful

Conditional serialization based on phrase matching mode.

The serialization logic appropriately switches between document and token set serialization. However, verify that the serialization methods handle empty collections and edge cases properly.

🏁 Script executed:
#!/bin/bash
# Description: Verify serialization methods handle edge cases

# Search for serialize_document and serialize_token_set implementations
ast-grep --pattern 'fn serialize_document($$$) -> $_ {
  $$$
}'

ast-grep --pattern 'fn serialize_token_set($$$) -> $_ {
  $$$
}'
Length of output: 132

🏁 Script executed:
#!/bin/bash
# Re-verify serialize_document and serialize_token_set implementations

# Search for any occurrences (definitions or references) across the segment crate
grep -R "serialize_document" -n lib/segment/src
grep -R "serialize_token_set" -n lib/segment/src
Length of output: 1675

🏁 Script executed:
#!/bin/bash
# Display serialize_token_set and serialize_document implementations with context
sed -n '200,300p' lib/segment/src/index/field_index/full_text_index/text_index.rs
Length of output: 3853
Serialization edge cases verified

The serialize_document and serialize_token_set methods simply wrap their token collections in a small #[derive(Serialize)] struct and hand off to serde_cbor. In practice, CBOR serialization handles empty Vec<String> and empty BTreeSet<String> out of the box, and both methods map any serialization errors into OperationError::service_error. No additional edge‐case handling is required here. Great to merge as is.

lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

lib/segment/src/index/field_index/full_text_index/text_index.rs (3)
217-231: Consider adding a formal TODO comment for cleanup tracking.

The function is correctly marked as temporary with detailed comments, but lacks a formal TODO comment that can be tracked.

Consider adding a TODO comment for better tracking:
+ /// TODO: Remove this function after cleanup PR following https://github.com/qdrant/qdrant/pull/6493
  /// CBOR representation is the same for BTreeSet<String> and Vec<String> if the elements are sorted, thus, we can resort to
  /// the vec implementation always. Let's just keep this to prove this works fine during https://github.com/qdrant/qdrant/pull/6493
  ///
  /// We can remove this afterwards
233-248: Consider adding a formal TODO comment for cleanup tracking.

Similar to the serialization function, this temporary deserialization function would benefit from a formal TODO comment for tracking.

Consider adding a TODO comment:
+ /// TODO: Remove this function after cleanup PR following https://github.com/qdrant/qdrant/pull/6493
  /// CBOR representation is the same for BTreeSet<String> and Vec<String> if the elements are sorted, thus, we can resort to
  /// the vec implementation always. Let's just keep this to prove this works fine during https://github.com/qdrant/qdrant/pull/6493
833-874: Remove debug print statements or use proper test logging.

The test correctly validates CBOR serialization compatibility between BTreeSet and Vec, but includes unnecessary eprintln! statements that will always print during test runs.

Consider removing the debug print statements or replacing them with proper test logging:
- eprintln!(
-     "Serialized set: {:?}",
-     serialized_set
-         .iter()
-         .map(|&b| b as char)
-         .collect::<String>()
- );
- eprintln!(
-     "Serialized vec: {:?}",
-     serialized_vec
-         .iter()
-         .map(|&b| b as char)
-         .collect::<String>()
- );
Or use conditional debug logging:
+ #[cfg(test)]
+ if std::env::var("RUST_LOG").is_ok() {
+     eprintln!("Serialized set: {:?}", /* ... */);
+     eprintln!("Serialized vec: {:?}", /* ... */);
+ }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7a1af3b and 2b9d2b4.

📒 Files selected for processing (10)

lib/posting_list/src/lib.rs (1 hunks)
lib/posting_list/src/value_handler.rs (1 hunks)
lib/segment/src/index/field_index/full_text_index/immutable_text_index.rs (1 hunks)
lib/segment/src/index/field_index/full_text_index/inverted_index.rs (7 hunks)
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs (1 hunks)
lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs (2 hunks)
lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs (4 hunks)
lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs (6 hunks)
lib/segment/src/index/field_index/full_text_index/positions.rs (1 hunks)
lib/segment/src/index/field_index/full_text_index/text_index.rs (7 hunks)

🚧 Files skipped from review as they are similar to previous changes (9)

lib/segment/src/index/field_index/full_text_index/positions.rs
lib/posting_list/src/value_handler.rs
lib/segment/src/index/field_index/full_text_index/immutable_text_index.rs
lib/posting_list/src/lib.rs
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs
lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs
lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs
lib/segment/src/index/field_index/full_text_index/inverted_index.rs
lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs

⏰ Context from checks skipped due to timeout of 90000ms (13)

GitHub Check: rust-tests (macos-latest)
GitHub Check: rust-tests (windows-latest)
GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: test-low-resources
GitHub Check: integration-tests
GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: storage-compat-test
GitHub Check: test-consistency
GitHub Check: test-consensus-compose
GitHub Check: Basic TLS/HTTPS tests
GitHub Check: lint
GitHub Check: integration-tests-consensus

🔇 Additional comments (4)

lib/segment/src/index/field_index/full_text_index/text_index.rs (4)

1-1: LGTM: Proper conditional compilation for test-only import.

The #[cfg(test)] attribute correctly restricts the BTreeSet import to test compilation only, which aligns with its usage in the temporary test functions.

250-271: LGTM: Correct transition from BTreeSet to Vec for phrase matching.

The change from BTreeSet<String> to Vec<String> in the document serialization functions is appropriate for supporting phrase matching where token order matters. The error handling and structure remain consistent.

573-582: LGTM: Proper test infrastructure for phrase matching support.

The addition of the phrase_matching parameter to the test builder function correctly configures the TextIndexParams to test both phrase matching and non-phrase matching scenarios.

690-693: LGTM: Comprehensive parameterized testing.

The test function is properly parameterized to cover all combinations of deleted and phrase_matching flags, ensuring thorough test coverage for the new functionality.

timvisee

Nice!

No significant remarks, only nits.

lib/segment/src/index/field_index/full_text_index/inverted_index.rs

lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs

lib/segment/src/index/field_index/full_text_index/text_index.rs

* populate point_to_doc * add phrase_matching=true case to tests * CBOR ordered Vec and BTreeSet have same representation * review nits

This was referenced May 5, 2025

Flaky test multivector_filtrable_hnsw_test::test_multi_filterable_hnsw::case_8_recosumscores_multi #6313

Closed

Flaky test hnsw_discover_test::hnsw_discover_precision #2973

Open

generall reviewed May 6, 2025

View reviewed changes

lib/segment/src/index/field_index/full_text_index/inverted_index.rs Outdated Show resolved Hide resolved

coszio force-pushed the introduce-ordered-document branch 3 times, most recently from ec4561c to e2452d0 Compare May 28, 2025 19:59

coszio force-pushed the persist-ordered-documents branch 2 times, most recently from b807adc to 4bfff19 Compare May 30, 2025 17:25

coszio changed the title ~~[phrase matching] choose persistance format~~ [phrase matching] populate point_to_doc structures May 30, 2025

coszio changed the base branch from introduce-ordered-document to mmap-full-text-index-for-phrase May 30, 2025 17:30

coszio force-pushed the mmap-full-text-index-for-phrase branch 2 times, most recently from 1728a44 to 6834d87 Compare May 30, 2025 20:27

coszio force-pushed the persist-ordered-documents branch from 4bfff19 to b9e747c Compare May 30, 2025 20:35

coszio mentioned this pull request May 30, 2025

[phrase matching] expose phrase_matching flag in rest and grpc #6620

Merged

coszio added this to the Phrase matching milestone May 30, 2025

coszio requested a review from generall May 30, 2025 20:48

timvisee approved these changes Jun 2, 2025

View reviewed changes

timvisee requested changes Jun 2, 2025

View reviewed changes

coszio force-pushed the mmap-full-text-index-for-phrase branch from 6834d87 to e741a8f Compare June 2, 2025 19:13

coszio force-pushed the persist-ordered-documents branch from b9e747c to 342ac36 Compare June 2, 2025 23:31

coszio force-pushed the mmap-full-text-index-for-phrase branch 2 times, most recently from 3a51585 to d701880 Compare June 3, 2025 19:28

coszio mentioned this pull request Jun 3, 2025

[phrase match] Mmap full text index for phrase #6602

Merged

Base automatically changed from mmap-full-text-index-for-phrase to dev June 3, 2025 20:16

coszio force-pushed the persist-ordered-documents branch from 342ac36 to 9bf8e90 Compare June 3, 2025 20:18

coszio marked this pull request as ready for review June 3, 2025 20:20

coszio requested a review from timvisee June 3, 2025 20:20

coderabbitai bot reviewed Jun 3, 2025

View reviewed changes

lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs Show resolved Hide resolved

coszio force-pushed the persist-ordered-documents branch 2 times, most recently from 017a70f to 7a1af3b Compare June 4, 2025 14:54

coszio mentioned this pull request Jun 4, 2025

[phrase match] Restructure full_text_index::inverted_index module #6638

Merged

coszio force-pushed the persist-ordered-documents branch from 7a1af3b to 2b9d2b4 Compare June 4, 2025 15:57

coderabbitai bot reviewed Jun 4, 2025

View reviewed changes

coszio added 3 commits June 5, 2025 13:43

populate point_to_doc

ad8e368

add phrase_matching=true case to tests

5ade330

CBOR ordered Vec and BTreeSet have same representation

ae56018

coszio force-pushed the persist-ordered-documents branch from 2b9d2b4 to ae56018 Compare June 5, 2025 17:44

timvisee approved these changes Jun 10, 2025

View reviewed changes

review nits

cc5d3d9

coszio merged commit 052ce9b into dev Jun 10, 2025
17 checks passed

coszio deleted the persist-ordered-documents branch June 10, 2025 17:07

generall pushed a commit that referenced this pull request Jul 17, 2025

[phrase matching] populate point_to_doc structures (#6493)

3b885dd

* populate point_to_doc * add phrase_matching=true case to tests * CBOR ordered Vec and BTreeSet have same representation * review nits

Conversation

coszio commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

generall commented May 6, 2025

Uh oh!

coszio commented May 6, 2025

Uh oh!

timvisee Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

coszio Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timvisee left a comment

Choose a reason for hiding this comment

Uh oh!

coszio commented Jun 3, 2025

Uh oh!

coderabbitai bot commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

timvisee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coszio commented May 5, 2025 •

edited

Loading

coszio Jun 3, 2025 •

edited

Loading

coderabbitai bot commented Jun 3, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)