[phrase match] Mmap full text index for phrase#6602
Conversation
74539b7 to
ec4561c
Compare
b4f8034 to
8513c9e
Compare
ec4561c to
e2452d0
Compare
0ca392d to
651d0a8
Compare
1728a44 to
6834d87
Compare
9ef8a46 to
999c924
Compare
6834d87 to
e741a8f
Compare
| #[cfg_attr(test, derive(Clone))] | ||
| #[derive(Default, Debug)] | ||
| #[derive(Debug)] | ||
| pub enum ImmutablePostings { |
generall
left a comment
There was a problem hiding this comment.
nit: let's add a comment for filter_has_subset function to describe exacly what it is doing
nit: let's try to move big structures with implementations into a dedicated files
e741a8f to
da61ccf
Compare
3a51585 to
d701880
Compare
📝 WalkthroughWalkthroughThis change introduces a new optional boolean field, Possibly related PRs
Suggested reviewers
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (5)
lib/segment/src/index/field_index/full_text_index/immutable_postings_enum.rs (1)
23-32: Add bounds checking documentation.The method correctly handles bounds checking by using
Option::map, but this behavior should be documented.Add documentation to clarify the bounds checking behavior:
+ /// Returns the length of the posting list for the given token. + /// Returns `None` if the token ID is out of bounds. pub fn posting_len(&self, token: TokenId) -> Option<usize> {lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings_enum.rs (1)
37-56: Consider extracting common iterator pattern.The
iter_idsimplementation is very similar to the one inImmutablePostings. Consider if a trait or macro could reduce this duplication.Both enum implementations have nearly identical
iter_idsmethods. Consider creating a trait:trait PostingsIterator { fn iter_ids(&self, token_id: TokenId) -> Option<Box<dyn Iterator<Item = PointOffsetType> + '_>>; }This would reduce code duplication and ensure consistent behavior across implementations.
lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs (1)
28-28: Consider extracting the duplicate logic into a helper method.The logic
config.phrase_matching == Some(true)is duplicated in both theopenandfinalizemethods. Consider extracting this into a helper method for better maintainability.Add a helper method to the
TextIndexParamsstruct:impl TextIndexParams { pub fn has_positions(&self) -> bool { self.phrase_matching == Some(true) } }Then use it as:
-let has_positions = config.phrase_matching == Some(true); +let has_positions = config.has_positions();Also applies to: 182-182
lib/segment/src/index/field_index/full_text_index/immutable_inverted_index.rs (2)
244-276: Consider adding documentation for this helper function.The implementation is solid, but it would benefit from documentation explaining its purpose and the remapping logic.
+/// Filters out empty postings and remaps token IDs to create a compact representation. +/// Returns the filtered postings, updated vocab with new token IDs, and a mapping from original to new token IDs. fn optimized_postings_and_vocab(
293-334: Complex but correct position extraction logic.The function correctly builds postings with positions. The
expect()at lines 326-328 is justified by the invariant that any document in a token's posting list must contain that token at least once.Consider adding documentation to explain the algorithm.
+/// Creates compressed postings with position information. +/// Processes documents to extract token positions and builds posting lists that include +/// these positions for each token. fn create_compressed_postings_with_positions(
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (16)
lib/api/src/grpc/conversions.rs(2 hunks)lib/segment/src/data_types/index.rs(1 hunks)lib/segment/src/index/field_index/full_text_index/immutable_inverted_index.rs(5 hunks)lib/segment/src/index/field_index/full_text_index/immutable_postings_enum.rs(1 hunks)lib/segment/src/index/field_index/full_text_index/immutable_text_index.rs(4 hunks)lib/segment/src/index/field_index/full_text_index/inverted_index.rs(6 hunks)lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings_enum.rs(1 hunks)lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs(8 hunks)lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs(3 hunks)lib/segment/src/index/field_index/full_text_index/mod.rs(1 hunks)lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs(1 hunks)lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs(1 hunks)lib/segment/src/index/field_index/full_text_index/positions.rs(1 hunks)lib/segment/src/index/field_index/full_text_index/tests/mod.rs(1 hunks)lib/segment/src/index/field_index/full_text_index/text_index.rs(1 hunks)lib/segment/src/index/field_index/full_text_index/tokenizers.rs(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: test-low-resources
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-consistency
- GitHub Check: test-consensus-compose
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: integration-tests
- GitHub Check: integration-tests-consensus
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: storage-compat-test
- GitHub Check: lint
🔇 Additional comments (22)
lib/segment/src/index/field_index/full_text_index/tests/mod.rs (1)
167-167: LGTM: Test configuration updated correctly.The addition of
phrase_matching: Noneto theTextIndexParamsstruct initialization is consistent with the new field introduced in the broader phrase matching feature. This ensures the test remains compatible with the updated struct definition.lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs (1)
74-74: LGTM: Helpful documentation added.The documentation comment clearly describes the purpose of the
filter_has_subsetmethod, improving code readability and maintainability.lib/segment/src/index/field_index/full_text_index/mod.rs (1)
1-1: LGTM: New module addition supports phrase matching functionality.The addition of the
immutable_postings_enummodule is well-aligned with the PR objectives to introduce phrase matching capabilities through postings with positional information. This provides a clean architectural foundation for the newImmutablePostingsenum.lib/segment/src/index/field_index/full_text_index/tokenizers.rs (1)
258-258: LGTM: Test configuration consistently updated.The addition of
phrase_matching: Nonemaintains test compatibility with the updatedTextIndexParamsstruct definition and follows the consistent pattern applied across other test files.lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs (1)
154-154: LGTM: Test configuration update completes consistent pattern.The addition of
phrase_matching: Noneto theTextIndexParamsstruct initialization maintains compatibility with the updated struct definition. This change, along with similar updates across other test files, demonstrates a well-coordinated and consistent approach to integrating the new phrase matching feature.lib/segment/src/data_types/index.rs (1)
197-200: New field added for phrase matching support.The addition of the
phrase_matchingfield aligns with the PR objectives for introducing phrase matching capabilities. The implementation approach is sound:
- Using
Option<bool>maintains backwards compatibility- The
#[serde(skip)]annotation prevents premature exposure in the API- Field placement at the end preserves struct layout
Remember to update the serialization behavior when removing the
skipattribute in the future, and ensure proper API documentation and validation are added.lib/segment/src/index/field_index/full_text_index/positions.rs (2)
5-5: Good addition of Default trait derivation.Deriving
Defaultfor thePositionsstruct enables easy initialization of empty position lists, which is essential for building postings with positional information.
8-12: Clean API for building position lists.The
pushmethod provides a straightforward interface for appending positions to the internal vector. The implementation correctly delegates to the underlyingVec<u32>.lib/segment/src/index/field_index/full_text_index/text_index.rs (1)
369-372:Details
✅ Verification successful
Method signature alignment for mmap remove_point.
The change aligns with the updated
MmapFullTextIndex::remove_pointmethod signature that now returns()instead of aResult. The explicit call followed byOk(())maintains the expected interface.Verify that error handling is consistent across all remove_point implementations:
🏁 Script executed:
#!/bin/bash # Description: Check remove_point method signatures across full text index implementations # Expected: Consistent error handling patterns # Search for remove_point method definitions in full text index modules ast-grep --pattern $'impl $_ { $$$ fn remove_point(&mut self, $_) -> $_ { $$$ } $$$ }'Length of output: 99
Couldn’t detect any
remove_pointdefinitions via AST—let’s fallback to a regex search to ensure we catch all implementations and match arms:
🏁 Script executed:
#!/bin/bash # List all remove_point method definitions in full_text_index modules rg -n "fn remove_point" -C2 lib/segment/src/index/field_index/full_text_index # Show all match arms invoking remove_point in text_index.rs rg -n "FullTextIndex::.*remove_point" -C2 lib/segment/src/index/field_index/full_text_index/text_index.rsLength of output: 3139
Method signature alignment for mmap remove_point
Verified that:
MutableTextIndex::remove_pointandImmutableTextIndex::remove_pointboth returnOperationResult<()>.MmapTextIndex::remove_pointis inherently infallible (returns()), so the wrapper inFullTextIndex::Mmapcorrectly callsindex.remove_point(id);and then returnsOk(()).No inconsistencies found—error‐handling remains uniform across all variants.
lib/segment/src/index/field_index/full_text_index/immutable_postings_enum.rs (2)
8-13: Well-designed enum for positional postings support.The enum provides a clean abstraction for handling postings with or without positional information. The conditional
Clonederive for tests is a good practice.
34-52: Excellent test-only iterator implementation.The
iter_idsmethod provides a clean interface for testing by abstracting over both enum variants. The trait object approach allows uniform handling in tests.lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings_enum.rs (2)
9-12: Consistent enum design with ImmutablePostings.The enum structure mirrors
ImmutablePostings, providing consistency across different storage backends.
22-35: Proper hardware counter integration.The method correctly passes the hardware counter to the underlying mmap operations for performance monitoring. The
Option::mappattern handles bounds checking appropriately.lib/segment/src/index/field_index/full_text_index/mmap_text_index.rs (1)
56-58: Verify the breaking API change is intentional.The
remove_pointmethod signature has changed from returningOperationResult<()>to returning nothing. This is a breaking change that could affect existing callers expecting a Result type.lib/segment/src/index/field_index/full_text_index/immutable_text_index.rs (1)
47-52: LGTM! Clean pattern matching for postings types.The pattern matching correctly initializes the appropriate inverted index type based on whether the postings include positions or just IDs.
lib/segment/src/index/field_index/full_text_index/inverted_index.rs (1)
73-85: Good additions to the Document struct.The
len()method andIntoIteratorimplementation enhance the usability of the Document struct and enable more idiomatic Rust patterns.lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs (2)
105-112: Clean implementation of postings type selection.The conditional logic correctly instantiates the appropriate postings variant based on the
has_positionsflag.
163-194: Well-structured polymorphic handling with helper functions.The use of the generic
intersectionhelper function effectively reduces code duplication while maintaining type safety across different postings variants. This is a good example of DRY principle application.lib/segment/src/index/field_index/full_text_index/immutable_inverted_index.rs (4)
30-47: LGTM! Clean factory methods for different posting types.The two constructors provide clear, type-safe ways to create empty instances for different posting types.
49-85: Excellent refactoring to support multiple posting types.The generic inner function elegantly handles both posting types without code duplication. The match statement provides clean dispatch based on the enum variant.
209-241: Well-structured conversion with proper handling of positions.The implementation correctly:
- Optimizes postings and vocab by filtering empty entries
- Creates the appropriate posting type based on whether positions are available
- Maintains consistency between data structures
336-415: Clean mmap conversion with good error checking.The implementation correctly handles both posting types and includes helpful debug assertions to verify data consistency. The generic inner function is well-designed to handle different
MmapPostingValuetypes.
| fn test_mutable_to_immutable() { | ||
| let mutable = mutable_inverted_index(2000, 400); | ||
|
|
||
| // todo: test with phrase-enabled |
There was a problem hiding this comment.
Complete the phrase-enabled test coverage.
There are multiple TODO comments indicating missing test coverage for the phrase matching feature. This is a critical gap that should be addressed before merging.
Would you like me to help generate the phrase-enabled test cases or open an issue to track this work?
Also applies to: 385-385, 459-459
🤖 Prompt for AI Agents
In lib/segment/src/index/field_index/full_text_index/inverted_index.rs at lines
345, 385, and 459, there are TODO comments indicating missing test coverage for
the phrase-enabled feature. To fix this, write comprehensive test cases that
specifically cover phrase matching functionality, ensuring all edge cases and
typical usage scenarios are tested. This will close the critical test coverage
gap before merging.
There was a problem hiding this comment.
Thanks for pointting it out, I have addressed these in the upcoming PR #6493
There was a problem hiding this comment.
🧩 Analysis chain
@coszio Thanks for the quick response! I noticed you mentioned PR #6493, but the current PR is #6602. Could you double-check the PR number? I'd be happy to verify that the phrase-enabled test coverage has been properly addressed once I can confirm the correct PR reference.
It's great to hear that you're staying on top of the test coverage - comprehensive testing for the phrase matching feature will definitely strengthen the implementation! 🚀
🏁 Script executed:
#!/bin/bash
# Check if PR #6493 exists in the repository
gh pr view 6493 --json number,title,state 2>/dev/null || echo "PR #6493 not found"
Length of output: 137
@coszio Perfect! I can see that PR #6493 "[phrase matching] populate point_to_doc structures" does indeed exist and is open. That makes complete sense - addressing the phrase matching test coverage in a dedicated PR focused on populating the point_to_doc structures is a logical approach.
Thanks for handling the test coverage systematically across the related PRs. The phrase matching implementation will be much more robust with proper test coverage in place!
Feel free to tag me if you'd like a review on PR #6493 as well. 👍
🧬 Code Graph Analysis Results
The following code snippets are relevant to understanding the file lib/segment/src/index/field_index/full_text_index/inverted_index.rs and the user comment context:
From lib/segment/src/index/field_index/full_text_index/mutable_inverted_index.rs (lines 102-177):
fn get_vocab_mut(&mut self) -> &mut HashMap<String, TokenId> {
&mut self.vocab
}
fn index_tokens(
&mut self,
point_id: PointOffsetType,
tokens: TokenSet,
hw_counter: &HardwareCounterCell,
) -> OperationResult<()> {
self.points_count += 1;
let mut hw_cell_wb = hw_counter
.payload_index_io_write_counter()
.write_back_counter();
if self.point_to_tokens.len() <= point_id as usize {
let new_len = point_id as usize + 1;
// Only measure the overhead of `TokenSet` here since we account for the tokens a few lines below.
hw_cell_wb
.incr_delta((new_len - self.point_to_tokens.len()) * size_of::<Option<TokenSet>>());
self.point_to_tokens.resize_with(new_len, Default::default);
}
for token_id in tokens.tokens() {
let token_idx_usize = *token_id as usize;
if self.postings.len() <= token_idx_usize {
let new_len = token_idx_usize + 1;
hw_cell_wb.incr_delta(new_len - self.postings.len());
self.postings.resize_with(new_len, Default::default);
}
self.postings
.get_mut(token_idx_usize)
.expect("posting must exist")
.insert(point_id);
hw_cell_wb.incr_delta(size_of_val(&point_id));
}
self.point_to_tokens[point_id as usize] = Some(tokens);
Ok(())
}
fn index_document(
&mut self,
point_id: PointOffsetType,
ordered_document: Document,
hw_counter: &HardwareCounterCell,
) -> OperationResult<()> {
let Some(point_to_doc) = &mut self.point_to_doc else {
// Phrase matching is not enabled
return Ok(());
};
let mut hw_cell_wb = hw_counter
.payload_index_io_write_counter()
.write_back_counter();
// Ensure container has enough capacity
if point_to_doc.len() <= point_id as usize {
let new_len = point_id as usize + 1;
hw_cell_wb.incr_delta((new_len - point_to_doc.len()) * size_of::<Option<Document>>());
point_to_doc.resize_with(new_len, Default::default);
}
// Store the ordered document
point_to_doc[point_id as usize] = Some(ordered_document);
Ok(())
}
fn remove(&mut self, idx: PointOffsetType) -> bool {
if self.point_to_tokens.len() <= idx as usize {
return false; // Already removed or never actually existed
}
let Some(removed_token_set) = self.point_to_tokens[idx as usize].take() else {
return false;
};
self.points_count -= 1;
for removed_token in removed_token_set.tokens() {
// unwrap safety: posting list exists and contains the point idx
let posting = self.postings.get_mut(*removed_token as usize).unwrap();
posting.remove(idx);
}
if let Some(point_to_doc) = &mut self.point_to_doc {
point_to_doc[idx as usize] = None;
}
true
}
fn filter(
&self,
query: ParsedQuery,
_hw_counter: &HardwareCounterCell,
) -> Box<dyn Iterator<Item = PointOffsetType> + '_> {
match query {
ParsedQuery::Tokens(tokens) => self.filter_has_subset(tokens),
}
}
fn get_posting_len(&self, token_id: TokenId, _: &HardwareCounterCell) -> Option<usize> {
self.postings.get(token_id as usize).map(|x| x.len())
}
fn points_count(&self) -> usize {
self.points_count
}
fn vocab_with_postings_len_iter(&self) -> impl Iterator<Item = (&str, usize)> + '_ {
self.vocab.iter().filter_map(|(token, &posting_idx)| {
self.postings
.get(posting_idx as usize)
.map(|postings| (token.as_str(), postings.len()))
})
}
fn values_is_empty(&self, point_id: PointOffsetType) -> bool {
self.get_tokens(point_id).is_none_or(|x| x.is_empty())
}
fn values_count(&self, point_id: PointOffsetType) -> usize {
// Maybe we want number of documents in the future?
self.get_tokens(point_id).map(|x| x.len()).unwrap_or(0)
}
fn get_token_id(&self, token: &str, _hw_counter: &HardwareCounterCell) -> Option<TokenId> {
self.vocab.get(token).copied()
}From lib/segment/src/index/field_index/full_text_index/immutable_inverted_index.rs (lines 119-198):
fn get_vocab_mut(&mut self) -> &mut HashMap<String, TokenId> {
&mut self.vocab
}
fn index_tokens(
&mut self,
_idx: PointOffsetType,
_tokens: super::inverted_index::TokenSet,
_hw_counter: &HardwareCounterCell,
) -> OperationResult<()> {
Err(OperationError::service_error(
"Can't add values to immutable text index",
))
}
fn index_document(
&mut self,
_idx: PointOffsetType,
_document: super::inverted_index::Document,
_hw_counter: &HardwareCounterCell,
) -> OperationResult<()> {
Err(OperationError::service_error(
"Can't add values to immutable text index",
))
}
fn remove(&mut self, idx: PointOffsetType) -> bool {
if self.values_is_empty(idx) {
return false; // Already removed or never actually existed
}
self.point_to_tokens_count[idx as usize] = None;
self.points_count -= 1;
true
}
fn filter<'a>(
&'a self,
query: ParsedQuery,
_hw_counter: &'a HardwareCounterCell,
) -> Box<dyn Iterator<Item = PointOffsetType> + 'a> {
match query {
ParsedQuery::Tokens(tokens) => self.filter_has_subset(tokens),
}
}
fn get_posting_len(&self, token_id: TokenId, _: &HardwareCounterCell) -> Option<usize> {
self.postings.posting_len(token_id)
}
fn points_count(&self) -> usize {
self.points_count
}
fn vocab_with_postings_len_iter(&self) -> impl Iterator<Item = (&str, usize)> + '_ {
self.vocab.iter().filter_map(|(token, &token_id)| {
self.postings
.posting_len(token_id)
.map(|len| (token.as_str(), len))
})
}
fn values_is_empty(&self, point_id: PointOffsetType) -> bool {
self.point_to_tokens_count
.get(point_id as usize)
.is_none_or(|count| count.is_none())
}
fn values_count(&self, point_id: PointOffsetType) -> usize {
self.point_to_tokens_count
.get(point_id as usize)
.and_then(|&count| count)
.unwrap_or(0)
}
fn get_token_id(&self, token: &str, _: &HardwareCounterCell) -> Option<TokenId> {
self.vocab.get(token).copied()
}From lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs (lines 273-379):
fn get_vocab_mut(&mut self) -> &mut HashMap<String, TokenId> {
unreachable!("MmapInvertedIndex does not support mutable operations")
}
fn index_tokens(
&mut self,
_idx: PointOffsetType,
_tokens: super::inverted_index::TokenSet,
_hw_counter: &HardwareCounterCell,
) -> OperationResult<()> {
Err(OperationError::service_error(
"Can't add values to mmap immutable text index",
))
}
fn index_document(
&mut self,
_idx: PointOffsetType,
_document: super::inverted_index::Document,
_hw_counter: &HardwareCounterCell,
) -> OperationResult<()> {
Err(OperationError::service_error(
"Can't add values to mmap immutable text index",
))
}
fn remove(&mut self, idx: PointOffsetType) -> bool {
let Some(is_deleted) = self.deleted_points.get(idx as usize) else {
return false; // Never existed
};
if is_deleted {
return false; // Already removed
}
self.deleted_points.set(idx as usize, true);
if let Some(count) = self.point_to_tokens_count.get_mut(idx as usize) {
*count = 0;
// `deleted_points`'s length can be larger than `point_to_tokens_count`'s length.
// Only if the index is within bounds of `point_to_tokens_count`, we decrement the active points count.
self.active_points_count -= 1;
}
true
}
fn filter<'a>(
&'a self,
query: ParsedQuery,
hw_counter: &'a HardwareCounterCell,
) -> Box<dyn Iterator<Item = PointOffsetType> + 'a> {
match query {
ParsedQuery::Tokens(tokens) => self.filter_has_subset(tokens, hw_counter),
}
}
fn get_posting_len(
&self,
token_id: TokenId,
hw_counter: &HardwareCounterCell,
) -> Option<usize> {
self.postings.posting_len(token_id, hw_counter)
}
fn points_count(&self) -> usize {
self.active_points_count
}
fn vocab_with_postings_len_iter(&self) -> impl Iterator<Item = (&str, usize)> + '_ {
let hw_counter = HardwareCounterCell::disposable(); // No propagation needed here because this function is only used for building HNSW index.
self.iter_vocab().filter_map(move |(token, &token_id)| {
self.postings
.posting_len(token_id, &hw_counter)
.map(|posting_len| (token, posting_len))
})
}
fn values_is_empty(&self, point_id: PointOffsetType) -> bool {
if self.deleted_points.get(point_id as usize).unwrap_or(true) {
return true;
}
self.point_to_tokens_count
.get(point_id as usize)
.map(|count| *count == 0)
// if the point does not exist, it is considered empty
.unwrap_or(true)
}
fn values_count(&self, point_id: PointOffsetType) -> usize {
if self.deleted_points.get(point_id as usize).unwrap_or(true) {
return 0;
}
self.point_to_tokens_count
.get(point_id as usize)
.copied()
// if the point does not exist, it is considered empty
.unwrap_or(0)
}
fn get_token_id(&self, token: &str, hw_counter: &HardwareCounterCell) -> Option<TokenId> {
if self.is_on_disk {
hw_counter.payload_index_io_read_counter().incr_delta(
READ_ENTRY_OVERHEAD + size_of::<TokenId>(), // Avoid check overhead and assume token is always read
);
}
self.vocab
.get(token)
.ok()
.flatten()
.and_then(<[TokenId]>::first)
.copied()
}From lib/segment/src/index/field_index/full_text_index/inverted_index.rs (the main file):
-
The file defines the
TokenSetstruct which holds a sorted unique vector of token IDs with methods likecontains,has_subset, and conversions from sets or iterators. -
The
Documentstruct holds tokens in order, allowing duplicates. -
The
ParsedQueryenum currently supportsTokens(TokenSet)variant with a methodcheck_matchto verify if a token set matches the query. -
The
InvertedIndextrait defines methods for managing vocabulary, indexing tokens and documents, removing points, filtering by queries, estimating cardinality, and other utility methods. -
The trait provides default implementations for
register_tokens(mapping string tokens to token IDs, adding new tokens to vocabulary),estimate_cardinality(delegates toestimate_has_subset_cardinality), andpayload_blocks(generates payload block conditions for tokens with postings length above a threshold). -
The trait methods include:
get_vocab_mutto get mutable vocabulary map.index_tokensandindex_documentto index tokens or documents.removeto remove a point.filterto filter points by a parsed query.get_posting_lento get posting list length for a token.estimate_cardinalityandestimate_has_subset_cardinalityfor cardinality estimation.vocab_with_postings_len_iterto iterate over vocabulary tokens with their postings length.check_matchto check if a point matches a parsed query.values_is_emptyandvalues_countto check if a point has values and count them.points_countto get total points count.get_token_idto get token ID for a string token.
-
The file also contains extensive tests for mutable, immutable, and mmap inverted indexes, checking consistency and congruence of filtering and indexing.
These snippets provide the full context of the InvertedIndex trait and its implementations for mutable, immutable, and mmap-backed inverted indexes, including methods for indexing, filtering, removing, and cardinality estimation, which are central to the file inverted_index.rs. They also show how token sets and documents are represented and how queries are parsed and matched.
This context is relevant to understanding the code in inverted_index.rs and any changes or fixes mentioned in the user comment regarding PR #6493.
* enumify `.positions` field in immutable and mmap inverted index * review fixes
Builds on top of #6619
Enables positions in postings by introducing two new enums:
ImmutablePostingsfor immutable index.MmapPostingsEnumfor mmap index.