[phrase matching] populate point_to_doc structures#6493
Conversation
lib/segment/src/index/field_index/full_text_index/inverted_index.rs
Outdated
Show resolved
Hide resolved
|
This PR is expected to only affect the mutable index, right? |
Mutable and immutable, since immutable loads from db too. |
ec4561c to
e2452d0
Compare
b807adc to
4bfff19
Compare
point_to_doc structures
1728a44 to
6834d87
Compare
4bfff19 to
b9e747c
Compare
| struct StoredTokens { | ||
| tokens: BTreeSet<String>, | ||
| } | ||
| let doc = StoredTokens { tokens }; |
There was a problem hiding this comment.
Could you check if we can simply use a Vec here. IIRC, the serialized format is compatible. If possible, we probably want an assertion here that asserts ordering just in case.
There was a problem hiding this comment.
That's correct, I have added a test and cfg(test) attributes for this in 017a70f. We can remove the extra implementations in a cleanup PR
timvisee
left a comment
There was a problem hiding this comment.
Could you add a congruence test for the phrase index functionality.
Also, maybe it's worth to add a basic unit test that just asserts Document/TokenSet are handled correctly when filtering/checking - in one we use ordering, in the other we don't.
6834d87 to
e741a8f
Compare
b9e747c to
342ac36
Compare
3a51585 to
d701880
Compare
342ac36 to
9bf8e90
Compare
I have now added cases with phrase enabled in 9bf8e90
We haven't implemented phrase checking yet, but will definitely include tests for it. |
📝 Walkthrough## Walkthrough
This set of changes introduces conditional phrase matching support across the full-text index components. The code now distinguishes between handling ordered documents (for phrase matching) and token sets (for standard indexing), controlled by a `phrase_matching` configuration flag. Serialization and deserialization logic is split between token sets and documents, with improved error handling and migration to `serde_cbor`. The `MutableInvertedIndex` gains an explicit constructor for enabling or disabling positional information, and related methods are updated to use iterators over tokens rather than sets. Tests are extended and parameterized to cover both phrase matching and non-phrase matching scenarios. Additional minor improvements include documentation comments for traits, simplification of iterator usage, and stylistic refactoring.
## Possibly related PRs
- qdrant/qdrant#6486: Introduces the foundational distinction between ordered `Document` and unique `TokenSet` data structures, and updates the inverted index trait and implementations, which the current PR builds upon to extend phrase matching and indexing interfaces.
- qdrant/qdrant#6602: The main PR and the retrieved PR both focus on introducing and supporting phrase matching in the full text index by adding positional information handling, including changes to the mutable and immutable inverted indexes, mmap indexes, and related serialization and test code, making their changes directly related at the code level.
- qdrant/qdrant#6528: Adds documentation comments to the `UnsizedValue` trait methods in `lib/posting_list/src/lib.rs`, while the retrieved PR introduces the entire new compressed posting list implementation including the `UnsizedValue` trait and related posting list structures and traits in the same module; thus, the main PR’s changes are directly related to the code introduced in the retrieved PR.
## Suggested reviewers
- timvisee📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
⏰ Context from checks skipped due to timeout of 90000ms (13)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 1
🔭 Outside diff range comments (2)
lib/segment/src/index/field_index/full_text_index/text_index.rs (1)
216-260: 💡 Verification agent🧩 Analysis chain
Verify backward compatibility for serialization format change.
The refactoring improves code organization by separating token set and document serialization logic. However, the switch from
bincodetoserde_cboris a breaking change that could affect existing stored data.Please confirm:
- Is the serialization format change intentional?
- How is backward compatibility handled for existing data serialized with bincode?
- Is there a migration strategy for existing installations?
The improved error handling with detailed service error messages is a good enhancement.
🏁 Script executed:
#!/bin/bash # Description: Check if there are any migration utilities or backward compatibility handling for the serialization format change # Search for migration code or compatibility handling rg -A 5 -B 5 "bincode|serde_cbor" --type rust # Look for version handling or migration utilities fd "migrat" --type f fd "compat" --type fLength of output: 61416
🏁 Script executed:
#!/bin/bash # Show beginning of segment's compat module sed -n '1,200p' lib/segment/src/compat.rs # Search for full_text or text_index mentions grep -R "text_index" -n lib/segment/src/compat.rs || echo "No text_index-specific compatibility code"Length of output: 7397
Breaking Change: Full-text Index Serialization Now CBOR Only
There’s no fallback or migration path inlib/segment/src/index/field_index/full_text_index/text_index.rs(lines 216–260) or inlib/segment/src/compat.rsfor data previously serialized with bincode. As a result, any existing index files written with bincode will no longer deserialize correctly.Please add one of the following before merging:
- A versioned wrapper (e.g. prefixing data with a format version byte) and a fallback
bincode::from_sliceindeserialize_*to maintain compatibility.- A one-time migration utility that reads old bincode blobs, converts to the new CBOR format, and rewrites storage.
- Documentation/upgrade guide clearly stating that all existing indices must be rebuilt.
Fix Required: ensure backward compatibility or provide a migration strategy for stored full-text index data.
lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs (1)
36-90: 🛠️ Refactor suggestionComprehensive build_index implementation with proper structure management.
The method correctly handles both phrase matching and non-phrase matching modes. The resizing logic for both
point_to_tokensandpoint_to_docstructures is properly synchronized.However, there's a potential issue with the resizing condition on line 48:
- if index.point_to_tokens.len() <= idx as usize { + if index.point_to_tokens.len() <= idx as usize { index .point_to_tokens .resize_with(idx as usize + 1, Default::default); if let Some(point_to_doc) = &mut index.point_to_doc { point_to_doc.resize_with(idx as usize + 1, Default::default); } }The logic is correct, but consider adding bounds checking for very large
idxvalues to prevent excessive memory allocation.
🧹 Nitpick comments (1)
lib/segment/src/index/field_index/full_text_index/inverted_index.rs (1)
318-349: Clarify the dual indexing logic and consider renaming the function.The conditional indexing logic raises a few questions:
Dual indexing: When
with_positionsis true, bothindex_documentandindex_tokensare called. Is this intentional? It seems redundant to index the same data in two different formats.Function naming: The function name
mutable_inverted_indexdoesn't reflect that it now supports phrase matching modes. Consider renaming tomutable_inverted_index_with_optionsor similar.Could you clarify if the dual indexing is necessary, or should it be either document OR token set indexing based on the
with_positionsflag?- if with_positions { - index - .index_document(idx, Document(token_ids.clone()), &hw_counter) - .unwrap(); - } - let token_set = TokenSet::from_iter(token_ids); - index.index_tokens(idx, token_set, &hw_counter).unwrap(); + if with_positions { + index + .index_document(idx, Document(token_ids), &hw_counter) + .unwrap(); + } else { + let token_set = TokenSet::from_iter(token_ids); + index.index_tokens(idx, token_set, &hw_counter).unwrap(); + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
lib/posting_list/src/lib.rs(1 hunks)lib/posting_list/src/value_handler.rs(1 hunks)lib/segment/src/index/field_index/full_text_index/immutable_text_index.rs(2 hunks)lib/segment/src/index/field_index/full_text_index/inverted_index.rs(6 hunks)lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs(1 hunks)lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs(2 hunks)lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs(4 hunks)lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs(6 hunks)lib/segment/src/index/field_index/full_text_index/positions.rs(1 hunks)lib/segment/src/index/field_index/full_text_index/text_index.rs(5 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
lib/posting_list/src/lib.rs (2)
lib/segment/src/index/field_index/full_text_index/positions.rs (2)
write_len(19-21)write_to(23-28)lib/posting_list/src/tests.rs (2)
write_len(18-20)write_to(22-24)
⏰ Context from checks skipped due to timeout of 90000ms (12)
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: integration-tests
- GitHub Check: test-consistency
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: lint
- GitHub Check: test-consensus-compose
- GitHub Check: integration-tests-consensus
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests (macos-latest)
🔇 Additional comments (25)
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs (1)
105-111: Good refactoring from if-else to match.The refactor to use
matchinstead ofif-elsefor the binaryhas_positionscondition improves readability and is more idiomatic Rust. The logic remains functionally equivalent while supporting the conditional phrase matching feature.lib/segment/src/index/field_index/full_text_index/positions.rs (1)
20-20: Improved byte length calculation.The change to use
self.0.as_bytes().len()directly aligns with theUnsizedValuetrait's documented behavior of returning "the length of the serialized value in bytes." This is more accurate and eliminates potential arithmetic errors from manual calculation.lib/posting_list/src/lib.rs (1)
25-32: Excellent documentation improvements.The added documentation comments clearly explain the purpose and expected behavior of each
UnsizedValuetrait method. This improves code clarity and helps implementers understand the contract, which is particularly valuable given the broader serialization changes in this PR.lib/posting_list/src/value_handler.rs (1)
112-112: Good simplification of iterator construction.The change to use
last_offset.map(|&last| ...)directly is more idiomatic and concise than the previous approach. It maintains the same functionality while reducing verbosity in the iterator chain construction.lib/segment/src/index/field_index/full_text_index/immutable_text_index.rs (4)
4-4: LGTM: Clean import addition for Either typeThe
itertools::Eitherimport is appropriately added to support the conditional iterator logic introduced in the phrase matching implementation.
82-95: LGTM: Correct conditional deserialization logicThe implementation correctly handles phrase matching by:
- Extracting the
phrase_matchingflag from the configuration- Conditionally deserializing as documents (preserving order) vs token sets
- Using
Eitherto unify the iterator types for downstream processing- Passing the flag to
MutableInvertedIndex::build_indexThis aligns well with the PR objectives of supporting ordered token storage for phrase matching.
4-4: Good addition of Either import for conditional deserialization.The import is appropriately used to handle the conditional deserialization logic between documents and token sets.
82-95:Details
✅ Verification successful
Well-implemented conditional loading logic for phrase matching.
The implementation correctly switches between document and token set deserialization based on the phrase_matching flag. The use of
Eitherprovides a clean abstraction for handling both iterator types uniformly.However, ensure that the deserialization methods handle errors appropriately and that the iterator behavior is consistent between both paths.
Let me verify that the deserialization methods exist and handle errors properly:
🏁 Script executed:
#!/bin/bash # Description: Verify that deserialize_document and deserialize_token_set methods exist and handle errors # Search for the deserialization method implementations ast-grep --pattern 'fn deserialize_document($$$) -> $_ { $$$ }' ast-grep --pattern 'fn deserialize_token_set($$$) -> $_ { $$$ }'Length of output: 136
Let's search for the actual definitions with ripgrep to confirm their signatures and error handling:
🏁 Script executed:
#!/bin/bash set -e # Find deserialize_document definition and its return type/context rg -n "fn deserialize_document" -C3 . # Find deserialize_token_set definition and its return type/context rg -n "fn deserialize_token_set" -C3 .Length of output: 1478
Verification Complete: Error Handling and Iterator Consistency Confirmed
Both
deserialize_documentanddeserialize_token_setare defined intext_index.rsand returnOperationResult, so the?operator in the iterator closure will correctly propagate any deserialization errors. Each branch (Vec<String>vs.BTreeSet<String>) is converted into an iterator overString, ensuring uniform behavior when passed toMutableInvertedIndex::build_index. No further changes needed.lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs (4)
91-94: LGTM: Correct initialization with phrase matching supportThe constructor correctly extracts the
with_positionsflag from the phrase matching configuration and initializes theMutableInvertedIndexaccordingly. This ensures that positional information is enabled when phrase matching is requested.
129-129:Details
❓ Verification inconclusive
Verify register_tokens interface consistency
The change from passing an iterator to passing a slice reference (
&str_tokens) looks intentional but should be verified for consistency with the updatedregister_tokensmethod signature across the codebase.
🏁 Script executed:
#!/bin/bash # Description: Verify register_tokens method signatures across the codebase # Expect: Consistent method signatures for register_tokens echo "Searching for register_tokens method definitions:" ast-grep --pattern 'fn register_tokens($$$) { $$$ }' echo -e "\nSearching for register_tokens method calls:" rg -A 2 "\.register_tokens\("Length of output: 1659
🏁 Script executed:
#!/usr/bin/env bash set -e echo "== Definition in mutable_inverted_index.rs ==" rg -C3 "fn register_tokens" -n lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs echo -e "\n== Definition in inverted_index.rs ==" rg -C3 "fn register_tokens" -n lib/segment/src/index/field_index/full_text_index/inverted_index.rsLength of output: 218
Verify
register_tokensinterface consistencyThe switch from passing an iterator to passing a slice reference needs to align with the method’s signature in both
MutableInvertedIndexandInvertedIndex. Please confirm thatregister_tokensis declared to accept the new argument type and update call sites as needed:• lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs:129
let tokens = self.mutable_index.register_tokens(&str_tokens);• lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs
let tokens = index.register_tokens(str_tokens_iter);• lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs
let tokens = self.inverted_index.register_tokens(&str_tokens);• lib/segment/src/index/field_index/full_text_index/inverted_index.rs
let token_ids = index.register_tokens(&tokens);Double-check the signature definitions and harmonize all usages (either always an iterator or always a slice).
91-94: Clean implementation of phrase matching flag handling.The conversion from
Option<bool>toboolusing== Some(true)is correct and the flag is properly passed to theMutableInvertedIndexconstructor.
129-129:Details
✅ Verification successful
Verify the register_tokens signature change.
The method call has been simplified to pass
&str_tokensdirectly. Ensure this aligns with the updated method signature in theInvertedIndextrait.
🏁 Script executed:
#!/bin/bash # Description: Verify the register_tokens method signature in the InvertedIndex trait # Search for register_tokens method definition ast-grep --pattern 'fn register_tokens($$$) $$$'Length of output: 50
🏁 Script executed:
#!/bin/bash # Search for the InvertedIndex trait definition rg "trait InvertedIndex" -n -A5 -B5 # Search for any register_tokens method definitions rg "fn register_tokens" -n -A3 -B3Length of output: 2764
#!/bin/bash # Display context around the call site in mmap_text_index.rs sed -n '120,160p' lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs # Show declaration of `str_tokens` rg 'str_tokens' -n -A3 -B3 lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs
register_tokens signature aligns with trait
TheInvertedIndex::register_tokens<'a>(&mut self, str_tokens: impl IntoIterator<Item = impl AsRef<str>> + 'a)definition infull_text_index/inverted_index.rsaccepts any iterator over items that implementAsRef<str>. Passing&str_tokens(e.g. a&Vec<String>or&[&str]) produces anIntoIterator<Item=&String>(or&str) where each item implementsAsRef<str>. No adjustments are needed.lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs (6)
1-1: LGTM: Required import for HashMap usageThe
std::collections::HashMapimport is correctly added to support the explicit HashMap usage in the constructor.
25-34: LGTM: Improved constructor designThe explicit
new(with_positions: bool)constructor is a significant improvement over relying onDefault. It provides:
- Clear intent about positional indexing capabilities
- Explicit control over the
point_to_docstructure initialization- Better API design for the phrase matching feature
This is a cleaner and more maintainable approach.
36-90: LGTM: Enhanced build_index method with phrase matching supportThe updated
build_indexmethod correctly:
- Accepts an iterator over tokens and a phrase matching flag
- Uses the new constructor with proper positional indexing configuration
- Implements improved resizing logic for both
point_to_tokensandpoint_to_docstructures- Conditionally handles document storage based on phrase matching requirements
- Maintains backward compatibility with token set storage
The implementation is well-structured and supports the phrase matching feature requirements.
187-187: LGTM: Improved boundary checking and conditional handlingThe changes improve the implementation by:
- Using
>=instead of<=for proper boundary checking (line 187)- Adding proper conditional handling for
point_to_doccleanup during removal (lines 210-212)These changes enhance robustness and ensure consistent state management for both indexing modes.
Also applies to: 202-212
25-34: Well-designed constructor with explicit positional control.The new constructor with
with_positionsparameter provides clear control over whether positional information should be stored. The conditional initialization ofpoint_to_docusingthen_some()is clean and efficient.
202-212: Proper cleanup in remove method.The remove method correctly handles cleanup of both
point_to_tokensandpoint_to_docstructures when a point is removed. The bounds checking prevents out-of-bounds access.lib/segment/src/index/field_index/full_text_index/inverted_index.rs (2)
117-139: LGTM! Improved API flexibility and addressed past feedback.The changes enhance the trait's usability by:
- Accepting any iterator of string-like items instead of just string slices
- Adding the requested
register_tokenmethod that reflects the mutable nature- Maintaining backward compatibility through the existing
register_tokensmethodThe implementation correctly maps tokens through the new
register_tokenhelper method.
352-352: Excellent test coverage improvement!Parameterizing the tests with
phrase_matchingensures both indexing modes are properly tested. This follows good testing practices and will help catch regressions in either configuration.Also applies to: 394-394, 463-463
lib/segment/src/index/field_index/full_text_index/text_index.rs (1)
562-571: Well-implemented test parameterization for phrase matching.The changes properly:
- Add
phrase_matchingparameter to the builder creation function- Pass the setting through to
TextIndexParamsconfiguration- Parameterize the congruence test to validate both phrase matching modes
This ensures comprehensive test coverage across different indexing configurations.
Also applies to: 679-682
lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs (4)
21-23: Proper initialization with phrase matching support.The constructor correctly converts the phrase_matching configuration to a boolean flag and passes it to the
MutableInvertedIndexconstructor.
38-52: Consistent conditional loading logic.The deserialization logic matches the implementation in
immutable_text_index.rs, ensuring consistency across mutable and immutable indexes. The use ofEithermaintains a uniform interface for both token types.
79-83: Conditional document indexing for phrase matching.The logic correctly indexes documents with positional information only when phrase matching is enabled. This preserves token order for phrase queries while avoiding unnecessary overhead for standard token matching.
91-95:Details
✅ Verification successful
Conditional serialization based on phrase matching mode.
The serialization logic appropriately switches between document and token set serialization. However, verify that the serialization methods handle empty collections and edge cases properly.
🏁 Script executed:
#!/bin/bash # Description: Verify serialization methods handle edge cases # Search for serialize_document and serialize_token_set implementations ast-grep --pattern 'fn serialize_document($$$) -> $_ { $$$ }' ast-grep --pattern 'fn serialize_token_set($$$) -> $_ { $$$ }'Length of output: 132
🏁 Script executed:
#!/bin/bash # Re-verify serialize_document and serialize_token_set implementations # Search for any occurrences (definitions or references) across the segment crate grep -R "serialize_document" -n lib/segment/src grep -R "serialize_token_set" -n lib/segment/srcLength of output: 1675
🏁 Script executed:
#!/bin/bash # Display serialize_token_set and serialize_document implementations with context sed -n '200,300p' lib/segment/src/index/field_index/full_text_index/text_index.rsLength of output: 3853
Serialization edge cases verified
The
serialize_documentandserialize_token_setmethods simply wrap their token collections in a small#[derive(Serialize)]struct and hand off toserde_cbor. In practice, CBOR serialization handles emptyVec<String>and emptyBTreeSet<String>out of the box, and both methods map any serialization errors intoOperationError::service_error. No additional edge‐case handling is required here. Great to merge as is.
lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs
Show resolved
Hide resolved
017a70f to
7a1af3b
Compare
7a1af3b to
2b9d2b4
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
lib/segment/src/index/field_index/full_text_index/text_index.rs (3)
217-231: Consider adding a formal TODO comment for cleanup tracking.The function is correctly marked as temporary with detailed comments, but lacks a formal TODO comment that can be tracked.
Consider adding a TODO comment for better tracking:
+ /// TODO: Remove this function after cleanup PR following https://github.com/qdrant/qdrant/pull/6493 /// CBOR representation is the same for BTreeSet<String> and Vec<String> if the elements are sorted, thus, we can resort to /// the vec implementation always. Let's just keep this to prove this works fine during https://github.com/qdrant/qdrant/pull/6493 /// /// We can remove this afterwards
233-248: Consider adding a formal TODO comment for cleanup tracking.Similar to the serialization function, this temporary deserialization function would benefit from a formal TODO comment for tracking.
Consider adding a TODO comment:
+ /// TODO: Remove this function after cleanup PR following https://github.com/qdrant/qdrant/pull/6493 /// CBOR representation is the same for BTreeSet<String> and Vec<String> if the elements are sorted, thus, we can resort to /// the vec implementation always. Let's just keep this to prove this works fine during https://github.com/qdrant/qdrant/pull/6493
833-874: Remove debug print statements or use proper test logging.The test correctly validates CBOR serialization compatibility between
BTreeSetandVec, but includes unnecessaryeprintln!statements that will always print during test runs.Consider removing the debug print statements or replacing them with proper test logging:
- eprintln!( - "Serialized set: {:?}", - serialized_set - .iter() - .map(|&b| b as char) - .collect::<String>() - ); - eprintln!( - "Serialized vec: {:?}", - serialized_vec - .iter() - .map(|&b| b as char) - .collect::<String>() - );Or use conditional debug logging:
+ #[cfg(test)] + if std::env::var("RUST_LOG").is_ok() { + eprintln!("Serialized set: {:?}", /* ... */); + eprintln!("Serialized vec: {:?}", /* ... */); + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
lib/posting_list/src/lib.rs(1 hunks)lib/posting_list/src/value_handler.rs(1 hunks)lib/segment/src/index/field_index/full_text_index/immutable_text_index.rs(1 hunks)lib/segment/src/index/field_index/full_text_index/inverted_index.rs(7 hunks)lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs(1 hunks)lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs(2 hunks)lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs(4 hunks)lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs(6 hunks)lib/segment/src/index/field_index/full_text_index/positions.rs(1 hunks)lib/segment/src/index/field_index/full_text_index/text_index.rs(7 hunks)
🚧 Files skipped from review as they are similar to previous changes (9)
- lib/segment/src/index/field_index/full_text_index/positions.rs
- lib/posting_list/src/value_handler.rs
- lib/segment/src/index/field_index/full_text_index/immutable_text_index.rs
- lib/posting_list/src/lib.rs
- lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs
- lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs
- lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs
- lib/segment/src/index/field_index/full_text_index/inverted_index.rs
- lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: integration-tests
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: storage-compat-test
- GitHub Check: test-consistency
- GitHub Check: test-consensus-compose
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: lint
- GitHub Check: integration-tests-consensus
🔇 Additional comments (4)
lib/segment/src/index/field_index/full_text_index/text_index.rs (4)
1-1: LGTM: Proper conditional compilation for test-only import.The
#[cfg(test)]attribute correctly restricts the BTreeSet import to test compilation only, which aligns with its usage in the temporary test functions.
250-271: LGTM: Correct transition from BTreeSet to Vec for phrase matching.The change from
BTreeSet<String>toVec<String>in the document serialization functions is appropriate for supporting phrase matching where token order matters. The error handling and structure remain consistent.
573-582: LGTM: Proper test infrastructure for phrase matching support.The addition of the
phrase_matchingparameter to the test builder function correctly configures theTextIndexParamsto test both phrase matching and non-phrase matching scenarios.
690-693: LGTM: Comprehensive parameterized testing.The test function is properly parameterized to cover all combinations of
deletedandphrase_matchingflags, ensuring thorough test coverage for the new functionality.
2b9d2b4 to
ae56018
Compare
timvisee
left a comment
There was a problem hiding this comment.
Nice!
No significant remarks, only nits.
lib/segment/src/index/field_index/full_text_index/inverted_index.rs
Outdated
Show resolved
Hide resolved
lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs
Outdated
Show resolved
Hide resolved
* populate point_to_doc * add phrase_matching=true case to tests * CBOR ordered Vec and BTreeSet have same representation * review nits
Builds on top of #6602
Adds a
phrase_matchingbool param in the text index parameterization which controls how to store the tokens of each text.