Full text index with generic posting list#6565
Conversation
eadb3fb to
e708e96
Compare
| let smallest_posting_idx = postings | ||
| .iter() | ||
| .enumerate() | ||
| .min_by_key(|(_idx, posting)| posting.len()) | ||
| .map(|(idx, _posting)| idx) | ||
| .unwrap(); |
There was a problem hiding this comment.
Unrelated to this PR, but have we considered sorting all posting lists, so that the smallest posting lists are checked first?
5272bfc to
c2d4339
Compare
improve iterator intersection
c2d4339 to
13d9138
Compare
There was a problem hiding this comment.
This file is a copy of mmap_postings.rs before the changes in this PR.
This file, along with the entire compressed_posting module can safely be deleted later. For now we keep them to ensure compatibility of file format
📝 Walkthrough""" WalkthroughThis change refactors the posting list implementation in the full-text index system by introducing a generic, zero-copy, little-endian-compliant posting list structure using the Possibly related PRs
Suggested reviewers
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (13)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 9
🔭 Outside diff range comments (1)
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings.rs (1)
1-17:⚠️ Potential issueMissing
size_ofimport mirrors the issue in the legacy file
size_ofis referenced dozens of times but never brought into scope.use std::io::Write; use std::path::PathBuf; +use std::mem::size_of; // ← add
🧹 Nitpick comments (8)
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/tests.rs (1)
51-59: Consider using test assertion libraries for better error messages.While the current assertion is functional, consider using a testing library like
pretty_assertionsfor more readable output when assertions fail, especially when comparing complex data structures.- assert!( - id_model == id_old && id_model == elem_new.id, - "Mismatch at token_id {token_id}: offset: {offset}, old: {id_old}, new: {}, model: {id_model}", - elem_new.id - ); + use pretty_assertions::assert_eq; + assert_eq!(id_model, id_old, "Model vs Old mismatch at token_id {token_id}, offset: {offset}"); + assert_eq!(id_model, elem_new.id, "Model vs New mismatch at token_id {token_id}, offset: {offset}");lib/posting_list/src/visitor.rs (1)
159-163: Unnecessary temporary when re-emitting remainder elementYou allocate
idthen immediately callid.get(); the binding is redundant.-let id = e.id; -… -PostingElement { id: id.get(), value } +PostingElement { + id: e.id.get(), + value, +}lib/posting_list/src/iterator.rs (1)
70-77: Avoid double-cloningPostingElementinnext()
next()currently clones the element once to feedcurrent_elemand again to hand the value to the caller (Option::inspect+clone).
ForH::Valuetypes larger than zero bytes this is unnecessary churn.A lighter variant keeps a reference or performs a single move:
- let next_opt = self.visitor.get_by_offset(self.offset).inspect(|_| { - self.offset += 1; - }); - self.current_elem = next_opt.clone(); - next_opt + let next_opt = self.visitor.get_by_offset(self.offset); + if let Some(ref elem) = next_opt { + self.offset += 1; + self.current_elem = Some(elem.clone()); + } + next_optThis preserves semantics while eliminating one clone per element.
lib/segment/src/index/field_index/full_text_index/postings_iterator.rs (1)
24-30: Rename thefilterparameter for clarityInside the iterator pipeline the parameter
filteris invoked inside a.filter(...)adaptor, which easily leads to “filter inside filter” mental gymnastics.
Consider renaming the argument to something likedoc_filterorpre_filterto make the intent crystal-clear.-fn intersect_compressed_postings_iterator<'a>( - mut postings: Vec<IdsPostingListView<'a>>, - filter: impl Fn(PointOffsetType) -> bool + 'a, +fn intersect_compressed_postings_iterator<'a>( + mut postings: Vec<IdsPostingListView<'a>>, + doc_filter: impl Fn(PointOffsetType) -> bool + 'a,…and update the usage accordingly.
lib/segment/src/index/field_index/full_text_index/immutable_inverted_index.rs (1)
164-173: Builder loop could be replaced by a zero-copy or bulk constructorThe conversion allocates a new
PostingBuilderfor every posting and pushes IDs one-by-one:for id in posting.iter() { builder.add_id(id); }If
PostingBuilder(orIdsPostingList) already provides afrom_iter,extend, or zero-copyfrom_sorted_sliceAPI, using it will:
- Avoid the per-ID method dispatch overhead.
- Allow the builder to pre-allocate the exact capacity, reducing reallocations.
- Shorten the conversion code.
Consider exposing such an API in
posting_listand using it here.lib/posting_list/src/posting_list.rs (1)
105-110:iter()forcesCloneunnecessarilyRequiring
H::Value: Cloneprevents iterating over lists whose values are non-cloneable (e.g.,Box<dyn …>).
Consider yielding&H::Value(orCow) fromPostingIteratorso that cloning is not mandatory.
This widens API usability without performance drawbacks forCopytypes.lib/posting_list/src/view.rs (2)
68-78:IntoIteratorimposes aClonebound on values – is it necessary?Requiring
H::Value: Clonepropagates to every caller that wants to iterate
over aPostingListView, even if they only need borrowed data. If the value
type is large (e.g. a weighted payload), the extraClonecan be costly or
impossible.If the iterator already yields references (or
Copytypes), you can drop the
bound and avoid the constraint:-impl<'a, H: ValueHandler> IntoIterator for PostingListView<'a, H> -where - H::Value: Clone, +impl<'a, H: ValueHandler> IntoIterator for PostingListView<'a, H>
85-96:to_ownedperforms full allocations – consider zero-copy alternativesCloning
id_data,var_size_data, and the entirechunks/remainders
arrays can be expensive for large posting lists. If the only goal is to
obtain an owned version that lives longer than the source slice, a cheap
Arc<[u8]>/Arc<[T]>(orBox<[_]>) could provide ownership without data
duplication.This will reduce peak memory and improve conversion speed for large indexes.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (20)
lib/posting_list/Cargo.toml(1 hunks)lib/posting_list/src/builder.rs(4 hunks)lib/posting_list/src/iterator.rs(2 hunks)lib/posting_list/src/lib.rs(2 hunks)lib/posting_list/src/posting_list.rs(4 hunks)lib/posting_list/src/tests.rs(1 hunks)lib/posting_list/src/value_handler.rs(3 hunks)lib/posting_list/src/view.rs(9 hunks)lib/posting_list/src/visitor.rs(5 hunks)lib/segment/Cargo.toml(1 hunks)lib/segment/src/index/field_index/full_text_index/compressed_posting/compressed_chunks_reader.rs(0 hunks)lib/segment/src/index/field_index/full_text_index/immutable_inverted_index.rs(5 hunks)lib/segment/src/index/field_index/full_text_index/inverted_index.rs(2 hunks)lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings.rs(7 hunks)lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs(4 hunks)lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/old_mmap_postings.rs(1 hunks)lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/tests.rs(1 hunks)lib/segment/src/index/field_index/full_text_index/mod.rs(1 hunks)lib/segment/src/index/field_index/full_text_index/posting_list.rs(0 hunks)lib/segment/src/index/field_index/full_text_index/postings_iterator.rs(5 hunks)
💤 Files with no reviewable changes (2)
- lib/segment/src/index/field_index/full_text_index/posting_list.rs
- lib/segment/src/index/field_index/full_text_index/compressed_posting/compressed_chunks_reader.rs
🧰 Additional context used
🧬 Code Graph Analysis (3)
lib/posting_list/src/lib.rs (2)
lib/posting_list/src/posting_list.rs (2)
view(80-98)visitor(100-103)lib/posting_list/src/view.rs (1)
visitor(81-83)
lib/posting_list/src/visitor.rs (3)
lib/posting_list/src/posting_list.rs (1)
is_empty(116-118)lib/posting_list/src/view.rs (1)
is_empty(230-232)lib/segment/src/index/field_index/full_text_index/posting_list.rs (1)
is_empty(34-36)
lib/segment/src/index/field_index/full_text_index/inverted_index.rs (4)
lib/posting_list/src/posting_list.rs (2)
visitor(100-103)iter(105-110)lib/posting_list/src/view.rs (1)
visitor(81-83)lib/posting_list/src/visitor.rs (1)
contains(100-118)lib/segment/src/index/field_index/full_text_index/posting_list.rs (2)
contains(39-41)iter(44-46)
⏰ Context from checks skipped due to timeout of 90000ms (9)
- GitHub Check: test-consensus-compose
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-consistency
- GitHub Check: integration-tests-consensus
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests (windows-latest)
🔇 Additional comments (26)
lib/posting_list/Cargo.toml (1)
15-15: Good addition of zerocopy dependency.Adding the zerocopy crate as a dependency aligns with the PR's goal of implementing a generic, zero-copy posting list structure. This will enable more efficient memory layout guarantees and safer serialization/deserialization operations.
lib/segment/Cargo.toml (1)
120-120: Appropriate integration of posting_list crate.Adding the posting_list crate as a dependency is a good approach for integrating the new generic posting list implementation. This dependency enables the segment library to use the improved posting list abstractions and data structures.
lib/segment/src/index/field_index/full_text_index/mod.rs (1)
11-12: Good approach for maintaining backward compatibility.Adding the
#[cfg(test)]attribute to the compressed_posting module is a sound strategy during this transition phase. It ensures the old implementation is available for compatibility tests while removing it from production code, aligning with the PR objective of replacing the compressed posting list implementation while maintaining compatibility.lib/posting_list/src/tests.rs (1)
112-112: Appropriate trait bound addition for testing.Adding the
std::fmt::Debugtrait bound toH::Valuein the test function is beneficial for producing meaningful assertion failure messages. This aligns with the broader changes whereDebugwas removed from production code trait bounds but kept for testing purposes.lib/posting_list/src/value_handler.rs (4)
14-14: Appropriate removal of Debug constraint.Removing the
Debugtrait bound fromValuetype makes theValueHandlertrait more flexible and less restrictive, which is good practice for generic abstractions.
16-16: Appropriate removal of Debug constraint.Similar to the
Valuetype, removing theDebugbound fromSizedtype increases flexibility while maintaining the essentialSized + Copybounds needed for the type's functionality.
34-34: Good addition of trait derivations.Adding
#[derive(Debug, Clone, Copy)]toSizedHandlerensures the struct implements these traits, which is appropriate since it's meant to be debuggable and copyable.
54-54: Good addition of trait derivations.Adding
#[derive(Debug, Clone, Copy)]toUnsizedHandlerensures the struct implements these traits, which is appropriate since it's meant to be debuggable and copyable.lib/posting_list/src/lib.rs (3)
18-18: Appropriate removal of Debug constraint.Removing the
Debugtrait bound fromSizedValuealigns with the changes in value_handler.rs, creating a more flexible trait that focuses only on the essential requirements (Sized + Copy).
24-24: Appropriate removal of Debug constraint.Removing the
Debugtrait bound fromUnsizedValuemakes this trait more flexible and consistent with the changes to other traits in this refactoring.
51-54: Good expansion of public API.Exporting additional types (
RemainderPosting,ValueHandler,PostingListComponents, andPostingVisitor) appropriately expands the public API surface to support the new generic posting list architecture. This enables users of the library to work with these abstractions directly.lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mod.rs (4)
13-13: Good replacement of ChunkReader import.Replacing the import of
ChunkReaderwithIdsPostingListViewcorrectly implements the refactoring goal of using the generic posting list instead of the specialized compressed posting list.
24-28: Good approach for maintaining backward compatibility.Adding test-only modules for old implementations is a prudent approach for ensuring backward compatibility during this refactoring. This aligns with the PR objective of maintaining compatibility with existing data structures.
137-137: Appropriate return type update.Changing the return type of
iter_postingstoimpl Iterator<Item = Option<IdsPostingListView<'a>>>correctly reflects the refactoring to use the generic posting list types.
282-283: Method call updated correctly.Changing from a method like
.reader()to.visitor()properly adapts to the new API of the generic posting list. Thevisitor()method provides access to the functionality needed to check if a posting list contains a specific point ID.lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/tests.rs (1)
13-61: Well-designed compatibility test.This test effectively verifies compatibility between old and new posting list implementations by:
- Generating random sets of document IDs with various lengths
- Creating posting lists using the old implementation
- Writing them to a file using the old format
- Reading them back using both old and new implementations
- Verifying that both implementations return identical IDs
This comprehensive test is crucial for ensuring backward compatibility during this refactoring.
lib/segment/src/index/field_index/full_text_index/inverted_index.rs (1)
289-293: Borrow checker issue mirrors the previous one
new_posting.iter()yieldsPostingElement; you then map theidand call
orig_posting.contains(point_id), which is fine.
However theidfield is aU32wrapper in the new code. Call.get()to
obtain aPointOffsetTypeexplicitly and make the intent obvious:- .map(|elem| elem.id) + .map(|elem| elem.id.get())lib/posting_list/src/builder.rs (1)
83-87: Overflow check forPointOffsetType → U32conversion
idis aPointOffsetType(aliased tou32today) but the codebase has
started experimenting withu64in places. If that change lands, a silent
wrap-around here would corrupt postings.Add an explicit check:
-remainders.push(RemainderPosting { - id: U32::from(id), +remainders.push(RemainderPosting { + id: U32::from( + u32::try_from(id).expect("point id must fit into 32 bits") + ), value, });lib/posting_list/src/visitor.rs (3)
33-35:is_emptyis fine but duplicate oflen() == 0The helper is handy; consider delegating to
self.len() == 0or marking the
method#[inline]to make the intention explicit.
No action required.
50-98: Edge-case:search_greater_or_equalskips the first remainderWhen
idis smaller than every element in the posting list
(find_chunkreturnsNone) the remainder search starts immediately.
If the list contains chunks,search_in_remainderslooks into remainders only,
so the very first element (initial id of chunk 0) is ignored.Consider short-circuiting before remainder search:
// after `find_chunk` if chunk_index.is_none() { // the first element in the list is the smallest id return Some(0); }Otherwise a query for an id below the posting list range yields
Noneinstead
of offset 0.
167-170: ExtraClonebound correctly addedThe iterator now yields cloned values – the new bound is required and properly
scoped.
Looks good.lib/segment/src/index/field_index/full_text_index/postings_iterator.rs (1)
42-55: Duplicate document IDs can slip through the intersectionThe algorithm emits one value per occurrence in the smallest posting list.
If that list contains duplicates, the resulting stream will also contain duplicates even though an intersection is typically expected to behave like a mathematical set.If duplicates are not meaningful for downstream consumers, deduplicate here:
let and_iter = smallest_posting_iterator .map(|elem| elem.id) .dedup() // from itertools, or roll your own simple `scan`-based dedup .filter(move |id| { /* unchanged */ })Alternatively, guarantee uniqueness earlier when building
IdsPostingList.lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/old_mmap_postings.rs (1)
1-13:⚠️ Potential issue
size_ofis used but never imported – compilation will fail
size_of::<T>()is referenced multiple times in the file, yetstd::mem::size_ofis not in scope.
Rust will not implicitly import it, producing anunresolved importerror during build.use std::io::Write; use std::path::PathBuf; +use std::mem::size_of; // ← add // …⛔ Skipped due to learnings
Learnt from: JojiiOfficial PR: qdrant/qdrant#5954 File: lib/segment/src/index/field_index/bool_index/simple_bool_index.rs:219-228 Timestamp: 2025-03-12T13:40:56.980Z Learning: In Rust 1.80.0 and later, `std::mem::size_of` is included in the prelude and doesn't need to be explicitly imported.Learnt from: JojiiOfficial PR: qdrant/qdrant#5954 File: lib/segment/src/index/field_index/bool_index/simple_bool_index.rs:219-228 Timestamp: 2025-03-12T13:40:56.980Z Learning: In Rust 1.80.0 and later, `std::mem::size_of` is included in the prelude and doesn't need to be explicitly imported.lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings.rs (1)
181-187: Risk of silent overflow whenremainder_countexceeds 255
remainder_countis stored asu8, butremainders.len()is cast without a check.
While current builders keep remainders below 128 (CHUNK_LEN), a future change could exceed this and silently wrap.Consider changing the header field to
u16(and adjust serialization accordingly) or at least assert:assert!(remainders.len() <= u8::MAX as usize);lib/posting_list/src/posting_list.rs (1)
112-118:len()can over-report when the last chunk is not completely filled
len()assumes every stored chunk containsCHUNK_LENelements:self.chunks.len() * CHUNK_LEN + self.remainders.len()If the builder ever stores a partially-filled chunk (e.g., for an all-full compression path in the future),
this will overestimate. Either document that each chunk is guaranteed full, or compute the actual count
by decoding the last chunk’s header/offset.lib/posting_list/src/view.rs (1)
24-30: Type asymmetry forlast_idbetween view and components is confusing
PostingListViewstoreslast_idasOption<PointOffsetType>while
PostingListComponentsusesOption<U32>, requiring ad-hoc conversions in
both directions. Aligning these types (e.g. usingU32everywhere) would
simplify the API and prevent accidental loss of endianness guarantees.
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/old_mmap_postings.rs
Show resolved
Hide resolved
lib/segment/src/index/field_index/full_text_index/mmap_inverted_index/mmap_postings.rs
Show resolved
Hide resolved
| let chunk_bits = compressed_size * u8::BITS as usize / CHUNK_LEN; | ||
| BitPackerImpl::new().decompress_strictly_sorted( | ||
| chunk.initial_id.checked_sub(1), | ||
| &self.id_data[chunk.offset as usize..chunk.offset as usize + compressed_size], | ||
| BitPackerImpl::new().decompress_sorted( | ||
| chunk.initial_id.get(), | ||
| &self.id_data | ||
| [chunk.offset.get() as usize..chunk.offset.get() as usize + compressed_size], | ||
| decompressed_chunk, | ||
| chunk_bits as u8, | ||
| ); |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Add overflow / range-check when recovering chunk_bits
chunk_bits is reconstructed via integer arithmetic and then cast to u8.
If the compressed size is corrupted or larger than expected, the computed
value may exceed u8::MAX, leading to silent truncation and UB-adjacent
behaviour inside bitpacking.
Consider guarding the cast in debug builds (or even release) to fail fast:
- let chunk_bits = compressed_size * u8::BITS as usize / CHUNK_LEN;
+ let chunk_bits = compressed_size * u8::BITS as usize / CHUNK_LEN;
+ debug_assert!(
+ chunk_bits <= u8::MAX as usize,
+ "bit width {} exceeds u8::MAX – corrupted posting list?",
+ chunk_bits
+ );📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let chunk_bits = compressed_size * u8::BITS as usize / CHUNK_LEN; | |
| BitPackerImpl::new().decompress_strictly_sorted( | |
| chunk.initial_id.checked_sub(1), | |
| &self.id_data[chunk.offset as usize..chunk.offset as usize + compressed_size], | |
| BitPackerImpl::new().decompress_sorted( | |
| chunk.initial_id.get(), | |
| &self.id_data | |
| [chunk.offset.get() as usize..chunk.offset.get() as usize + compressed_size], | |
| decompressed_chunk, | |
| chunk_bits as u8, | |
| ); | |
| let chunk_bits = compressed_size * u8::BITS as usize / CHUNK_LEN; | |
| + debug_assert!( | |
| + chunk_bits <= u8::MAX as usize, | |
| + "bit width {} exceeds u8::MAX – corrupted posting list?", | |
| + chunk_bits | |
| + ); | |
| BitPackerImpl::new().decompress_sorted( | |
| chunk.initial_id.get(), | |
| &self.id_data | |
| [chunk.offset.get() as usize..chunk.offset.get() as usize + compressed_size], | |
| decompressed_chunk, | |
| chunk_bits as u8, | |
| ); |
🤖 Prompt for AI Agents
In lib/posting_list/src/view.rs around lines 142 to 149, the calculation of
chunk_bits involves integer arithmetic and is cast to u8 without checks, risking
overflow and silent truncation if compressed_size is corrupted or too large. Add
a range check before casting chunk_bits to u8 to ensure it does not exceed
u8::MAX, and handle the error appropriately, such as by panicking in debug
builds or returning a Result error, to fail fast and avoid undefined behavior.
protects against the case where the iterator's current element hasn't been extracted yet
3566915 to
726975b
Compare
726975b to
1b12a92
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
lib/posting_list/src/view.rs (1)
143-143: Add overflow protection when calculating chunk_bits.This calculation can overflow and cause silent truncation if
compressed_sizeis corrupted or unexpectedly large, leading to undefined behavior in bitpacking operations.The same overflow issue flagged in previous reviews still exists. Consider adding the suggested range check:
let chunk_bits = compressed_size * u8::BITS as usize / CHUNK_LEN; +debug_assert!( + chunk_bits <= u8::MAX as usize, + "bit width {} exceeds u8::MAX – corrupted posting list?", + chunk_bits +);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
lib/posting_list/src/iterator.rs(2 hunks)lib/posting_list/src/posting_list.rs(4 hunks)lib/posting_list/src/tests.rs(9 hunks)lib/posting_list/src/view.rs(9 hunks)lib/posting_list/src/visitor.rs(5 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- lib/posting_list/src/tests.rs
- lib/posting_list/src/posting_list.rs
- lib/posting_list/src/iterator.rs
- lib/posting_list/src/visitor.rs
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/posting_list/src/builder.rs:63-67
Timestamp: 2025-05-26T14:47:23.482Z
Learning: In the posting_list crate's PostingChunk struct, avoid adding extra fields like storing computed chunk_bits to prevent struct bloat and maintain mmap compatibility. The bit calculations from offsets are inexpensive compared to the memory and compatibility benefits of keeping the struct minimal.
lib/posting_list/src/view.rs (1)
Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/posting_list/src/builder.rs:63-67
Timestamp: 2025-05-26T14:47:23.482Z
Learning: In the posting_list crate's PostingChunk struct, avoid adding extra fields like storing computed chunk_bits to prevent struct bloat and maintain mmap compatibility. The bit calculations from offsets are inexpensive compared to the memory and compatibility benefits of keeping the struct minimal.
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: test-low-resources
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: storage-compat-test
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-consensus-compose
- GitHub Check: lint
- GitHub Check: test-consistency
- GitHub Check: integration-tests-consensus
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: integration-tests
- GitHub Check: rust-tests (ubuntu-latest)
🔇 Additional comments (5)
lib/posting_list/src/view.rs (5)
69-79: IntoIterator implementation looks well-designed.The IntoIterator implementation properly delegates to the visitor pattern and includes appropriate trait bounds requiring
H::Value: Clone. This maintains consistency with the iterator design.
86-97: Good design choice for to_owned method.The comment correctly explains why this isn't implemented as the
ToOwnedtrait - the borrow relationship would be problematic. The implementation properly clones all components to create an ownedPostingList.
145-153: Correct usage of zerocopy types in decompress_chunk.The changes properly use
.get()methods to access fields from zerocopy types (U32), ensuring correct little-endian byte order handling. The slice bounds are calculated correctly.
163-174: Improved type safety in ids_range method.The return type change to
RangeInclusive<u32>provides better type safety. The logic correctly handles empty posting lists and properly extracts IDs using.get()methods from zerocopy types.
184-219: Consistent zerocopy type usage in search methods.All ID comparisons now properly use
.get()to extract values fromU32types, maintaining consistency with the zerocopy migration. The binary search logic remains correct.
|
|
||
| /// A non-owning view of [`PostingList`]. | ||
| #[derive(Debug, Clone)] | ||
| #[derive(Debug)] |
There was a problem hiding this comment.
💡 Verification agent
❓ Verification inconclusive
Verify impact of removing Clone derive from PostingListView.
The Clone derive has been removed from PostingListView. This is a breaking change that may affect code depending on cloning this type.
🏁 Script executed:
#!/bin/bash
# Description: Check for code that depends on cloning PostingListView
# Expected: Find any usage of .clone() on PostingListView instances
rg -A 3 -B 3 "PostingListView.*\.clone\(\)"
rg -A 3 -B 3 "\.clone\(\).*PostingListView"
ast-grep --pattern 'let $_ = $var.clone()' | grep -i postinglistviewLength of output: 162
No internal .clone() usage found on PostingListView
We ran ripgrep and ast-grep across the repo to locate any .clone() calls on PostingListView—none were found. Removing Clone from PostingListView does not break internal code. However, since this type is part of the public API, this is a breaking change for downstream consumers:
- Bump the crate’s major version.
- Document the removal of
Clonein the changelog or migration guide.
🤖 Prompt for AI Agents
In lib/posting_list/src/view.rs at line 15, the Clone derive was removed from
PostingListView, which breaks downstream code relying on cloning this type. To
fix this, update the crate's major version to reflect the breaking change and
add clear documentation about the removal of Clone in the changelog or migration
guide to inform users of the API change.
| pub struct PostingIterator<'a, H: ValueHandler> { | ||
| visitor: PostingVisitor<'a, H>, | ||
| current_id: Option<PointOffsetType>, | ||
| current_elem: Option<PostingElement<H::Value>>, |
There was a problem hiding this comment.
This change now requires Copy. What the reason for this?
There was a problem hiding this comment.
So that we can cache the value and avoid going into the posting list if not necessary
| /// Internal structure to store the remainder of the posting list. So that IntoBytes can be derived | ||
| #[derive(Clone, Debug, FromBytes, Immutable, IntoBytes, KnownLayout)] | ||
| #[repr(C)] | ||
| pub struct RemainderPosting<S: Sized> { |
There was a problem hiding this comment.
If this is internal, can it be not pub? Or at elast pub(crate)?
There was a problem hiding this comment.
It needs to be public because we need to construct a PostingListView from the specific mmap postings implementation. I guess the correct description would be
Structure to store (...) The difference with PostingElement is that this is zero-copy friendly
| #[repr(C)] | ||
| pub struct RemainderPosting<S: Sized> { | ||
| pub id: U32, | ||
| pub value: S, |
There was a problem hiding this comment.
probably a nitpick, but why S here and V in PostingElement?
There was a problem hiding this comment.
Because this is not necessarily the actual value for unsized values, but rather whatever is used as fixed-sized value (in case of unsized values, it is a u32 pointing into var-sized-data)
* integrate generic posting list in full-text index * improve intersection fn improve iterator intersection * make old module just for test * recover old implementation of MmapPostings for tests * add compatibility test * fix compression * clippy * nits * check if first id is already greater protects against the case where the iterator's current element hasn't been extracted yet * update current element when there is no greater or equal * reuse visitor in test * optimize alignment * bound checks and refactor is_in_range * review * edit comment --------- Co-authored-by: generall <andrey@vasnetsov.com>
builds on top of #6528
This PR:
todo