Graph links serializer: avoid large intermediate allocations#7118
Conversation
📝 WalkthroughWalkthroughIntroduces a generic atomic_save API and refactors binary/JSON saves to use it. Replaces GraphLinksSerializer with a streaming writer-based serialize_graph_links function and removes the GraphLinksSerializer type. Adds GraphLinks::new_from_edges and GraphLinks::as_bytes, updates GraphLayers and GraphLayersBuilder to construct and persist links via edge vectors and serialize_graph_links (using atomic_save for on-disk writes). Serializer module rewritten to stream serialized graph data to a Write+Seek writer. Tests and imports adjusted to the new APIs. Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/segment/src/index/hnsw_index/graph_links.rs (1)
231-233: Bug: on_disk() returns true for RAM-backed data (inverted condition).This will confuse any callers relying on on_disk() to know if data is memory-mapped. It should report true when the backing is Mmap.
Apply this diff:
- pub fn on_disk(&self) -> bool { - matches!(self.borrow_owner(), GraphLinksEnum::Ram(_)) - } + pub fn on_disk(&self) -> bool { + matches!(self.borrow_owner(), GraphLinksEnum::Mmap(_)) + }
🧹 Nitpick comments (5)
lib/common/io/src/file_operations.rs (1)
31-41: Optional: consider a “sync directory” option for stronger crash-safety guarantees.On some filesystems, durability of the rename may require syncing the parent directory. atomicwrites handles the temp->final rename, but if you need “survives power loss” guarantees for these files, an additional (optional) fsync of the parent directory can be exposed by atomic_save (or a variant).
lib/segment/src/index/hnsw_index/graph_layers_builder.rs (1)
234-239: Future improvement: stream directly from LinksContainer → writer to avoid building edges Vec.links_layers_to_edges materializes Vec<Vec<Vec>> before writing. To squeeze peak RAM further for huge graphs, you could stream per-point/per-level from the RwLocks into serialize_graph_links (e.g., an iterator-based adapter that yields slices without forming a nested Vec). Not a blocker for this PR.
lib/segment/src/index/hnsw_index/graph_links.rs (1)
268-271: Doc nit: reference to removed GraphLinksSerializer.The comment mentions GraphLinksSerializer::new, which no longer exists. Point to serialize_graph_links instead.
Apply this diff:
- /// Convert the graph links to a vector of edges, suitable for passing into - /// [`GraphLinksSerializer::new`] or using in tests. + /// Convert the graph links to a vector of edges, suitable for passing into + /// [`serialize_graph_links`] or using in tests.lib/segment/src/index/hnsw_index/graph_layers.rs (1)
374-377: Optional: consider syncing the compressed file on conversion.Currently sync=false in convert_to_compressed. If conversion happens in contexts where crash-consistency matters (even though not production as per comment), setting sync=true would be safer before deleting the source.
- atomic_save(&compressed_path, false, |writer| { + atomic_save(&compressed_path, true, |writer| {lib/segment/src/index/hnsw_index/graph_links/serializer.rs (1)
102-114: Minor: reuse pre-checkedvector_layoutwithout unwrap in loop.You already computed vector_layout above. Capture it once and reuse to avoid repeated unwraps. Micro, but removes a branch in the hot path.
- GraphLinksFormatParam::CompressedWithVectors(vectors) => { - // Unwrap safety: `vector_layout` is `Some` for `CompressedWithVectors`. - let vector_layout = vector_layout.unwrap(); + GraphLinksFormatParam::CompressedWithVectors(vectors) => { + let vector_layout = vector_layout.expect("pre-checked at function start");
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
lib/common/io/src/file_operations.rs(1 hunks)lib/segment/src/index/hnsw_index/graph_layers.rs(4 hunks)lib/segment/src/index/hnsw_index/graph_layers_builder.rs(4 hunks)lib/segment/src/index/hnsw_index/graph_links.rs(6 hunks)lib/segment/src/index/hnsw_index/graph_links/serializer.rs(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-06T09:56:59.311Z
Learnt from: xzfc
PR: qdrant/qdrant#6982
File: lib/segment/src/index/hnsw_index/graph_links/view.rs:217-220
Timestamp: 2025-08-06T09:56:59.311Z
Learning: In Qdrant's GraphLinksView implementation (lib/segment/src/index/hnsw_index/graph_links/view.rs), methods like links() and links_with_vectors() intentionally use unwrap() and can panic for performance reasons, maintaining consistency across similar methods. Error handling improvements are considered secondary to feature completion and require benchmarking before implementation.
Applied to files:
lib/segment/src/index/hnsw_index/graph_layers_builder.rslib/segment/src/index/hnsw_index/graph_links.rslib/segment/src/index/hnsw_index/graph_links/serializer.rslib/segment/src/index/hnsw_index/graph_layers.rs
🧬 Code graph analysis (1)
lib/segment/src/index/hnsw_index/graph_links/serializer.rs (2)
lib/common/common/src/bitpacking.rs (1)
packed_bits(190-192)lib/common/common/src/bitpacking_links.rs (1)
pack_links(23-66)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
- GitHub Check: storage-compat-test
- GitHub Check: e2e-tests
- GitHub Check: lint
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: test-consistency
- GitHub Check: test-consensus-compose
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: integration-tests
- GitHub Check: integration-tests-consensus
🔇 Additional comments (7)
lib/common/io/src/file_operations.rs (1)
10-29: Atomic writer API looks solid (flush + optional fsync + precise error mapping).
- Wrapping the provided file in BufWriter, calling into_inner() to flush, and optionally syncing the file is correct.
- Mapping atomicwrites::Error::{Internal, User} into the caller’s E is neat and keeps the API generic.
lib/segment/src/index/hnsw_index/graph_layers_builder.rs (1)
193-203: Nice: streaming serializer through atomic_save eliminates the large neighbors Vec.This directly addresses the OOM scenario in CompressedWithVectors by removing the big intermediate in-memory neighbors section. The on-disk branch writes once and mmaps back; the RAM branch builds once and reuses as_bytes() to persist. Good balance of memory and I/O.
lib/segment/src/index/hnsw_index/graph_links.rs (2)
199-211: Good addition: new_from_edges() + as_bytes() simplify RAM construction and persistence.
- Using a Cursor to stream into RAM mirrors the on-disk path and keeps a single source of truth.
- Exposing as_bytes() is pragmatic for persistence and testing.
Also applies to: 217-220
231-233: No external calls toGraphLinks::on_disk()detected
A repo-wide grep foron_disk(turned up only the method’s own definition ingraph_links.rsand no other call sites. Since no downstream code invokes this (inverted) method, correcting its logic poses no risk of breaking existing functionality.lib/segment/src/index/hnsw_index/graph_layers.rs (2)
374-377: Conversion path now streams: good change.Switching to atomic_save + serialize_graph_links avoids building a large intermediate buffer during “plain → compressed” conversion.
398-407: Test-only compress_ram: dummy swap is a clean self_cell workaround.Swapping in a zero-sized RAM GraphLinks to satisfy self_cell’s ownership guarantees is fine for tests and keeps the code simple.
lib/segment/src/index/hnsw_index/graph_links/serializer.rs (1)
63-73: Ensure consistent little-endian encoding for level offsetsThe current serializer writes each
u64offset withwriter.write_all(total_offsets_len.as_bytes())?;which emits native-endian bytes, whereas all header fields use explicit little-endian via
LittleU64. The reader’sread_level_offsetsthen doesget_slice::<u64>(bytes, levels_count)?which assumes host endianness. This will only work correctly on little-endian platforms.
Please reconcile the on-disk spec:
• Option A (portable):
– In serializer.rs (around “// 2. Write level offsets”), replace
rust writer.write_all(total_offsets_len.as_bytes())?;
with
rust writer.write_all(LittleU64::new(total_offsets_len).as_bytes())?;
– In view.rs’sread_level_offsets, read the offsets asLittleU64(e.g. viaget_slice::<LittleU64>) and convert tou64so every stored offset is defined as little-endian.• Option B (documented host constraint):
– Clearly document in the GraphLinks file format that all multi-byte integers, including level offsets, are little-endian and that only little-endian platforms are supported; leave the native-endian writes as-is.Let me know which approach matches the file-format requirements so we can make the change (or add the documentation).
0b33387 to
e8d87ac
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
lib/segment/src/index/hnsw_index/graph_links/serializer.rs (1)
50-54: Guard against empty per-point levels; point.len() - 1 will underflow.If any point has zero levels, indexing will panic. Even if current producers create at least one level, the serializer should defend against malformed inputs (especially now that it’s a public API).
Apply near here:
- let levels_count = back_index.first().map_or(0, |&id| edges[id].len()); - let mut point_count_by_level = vec![0; levels_count]; - for point in &edges { - point_count_by_level[point.len() - 1] += 1; - } + let levels_count = back_index.first().map_or(0, |&id| edges[id].len()); + if !edges.is_empty() && edges.iter().any(|levels| levels.is_empty()) { + return Err(OperationError::service_error( + "Each point must have at least one level", + )); + } + let mut point_count_by_level = vec![0; levels_count]; + for point in &edges { + // Safe due to the check above. + point_count_by_level[point.len() - 1] += 1; + }
🧹 Nitpick comments (4)
lib/common/io/src/file_operations.rs (2)
28-34: Minor: Keep the closure style consistent (wrap inner writes in Ok(...)?).For consistency with Lines [29] and [33], prefer the
Ok(inner_call()?)pattern across all call sites. This also encourages callers to return a unified error type when useful.Apply in callers as shown in graph_layers_builder.rs below.
74-84: From<atomicwrites::Error> for Error is correct, but currently unused by atomic_save.atomic_save maps errors manually (Lines [22-25]). Keeping this impl is fine for other sites that may want
?onatomicwrites::Error<E>, but it’s redundant for atomic_save itself.lib/segment/src/index/hnsw_index/graph_layers_builder.rs (2)
193-203: Unify error type in atomic_save write closures; else-branch currently infers io::Error.In the on-disk branch (Lines [196-199]) the closure returns OperationResult, so E=OperationError. In the in-RAM branch (Lines [202-203]) the closure returns io::Result, so E=io::Error, relying on Fromio::Error for OperationError at the call site. Make both branches return OperationResult for consistency and clearer error flows.
Apply:
- links = GraphLinks::new_from_edges(edges, format_param, self.hnsw_m)?; - atomic_save(&links_path, |writer| writer.write_all(links.as_bytes()))?; + links = GraphLinks::new_from_edges(edges, format_param, self.hnsw_m)?; + atomic_save(&links_path, |writer| Ok(writer.write_all(links.as_bytes())?))?;
234-239: links_layers_to_edges: clear and minimal copies; consider reserving levels Vecs.You might shave small reallocs by pre-allocating per-point levels when known, but current form is fine.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
lib/common/io/src/file_operations.rs(1 hunks)lib/segment/src/index/hnsw_index/graph_layers.rs(4 hunks)lib/segment/src/index/hnsw_index/graph_layers_builder.rs(4 hunks)lib/segment/src/index/hnsw_index/graph_links.rs(9 hunks)lib/segment/src/index/hnsw_index/graph_links/serializer.rs(1 hunks)lib/segment/src/index/hnsw_index/graph_links/view.rs(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- lib/segment/src/index/hnsw_index/graph_links/view.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- lib/segment/src/index/hnsw_index/graph_layers.rs
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-06T09:56:59.311Z
Learnt from: xzfc
PR: qdrant/qdrant#6982
File: lib/segment/src/index/hnsw_index/graph_links/view.rs:217-220
Timestamp: 2025-08-06T09:56:59.311Z
Learning: In Qdrant's GraphLinksView implementation (lib/segment/src/index/hnsw_index/graph_links/view.rs), methods like links() and links_with_vectors() intentionally use unwrap() and can panic for performance reasons, maintaining consistency across similar methods. Error handling improvements are considered secondary to feature completion and require benchmarking before implementation.
Applied to files:
lib/segment/src/index/hnsw_index/graph_layers_builder.rslib/segment/src/index/hnsw_index/graph_links/serializer.rslib/segment/src/index/hnsw_index/graph_links.rs
📚 Learning: 2025-06-14T20:35:10.603Z
Learnt from: generall
PR: qdrant/qdrant#6635
File: lib/segment/src/index/field_index/full_text_index/text_index.rs:784-832
Timestamp: 2025-06-14T20:35:10.603Z
Learning: Functions gated with `#[cfg(feature = "testing")]` within `#[cfg(test)]` modules are safe from compilation issues since both the function and its call sites are restricted to test builds. The additional feature gate is often used for optional heavy test utilities.
Applied to files:
lib/segment/src/index/hnsw_index/graph_links.rs
🧬 Code graph analysis (3)
lib/segment/src/index/hnsw_index/graph_layers_builder.rs (3)
lib/common/io/src/file_operations.rs (2)
atomic_save(10-26)atomic_save_bin(28-30)lib/segment/src/index/hnsw_index/graph_links/serializer.rs (1)
serialize_graph_links(22-214)lib/segment/src/index/hnsw_index/graph_links.rs (4)
links(249-251)on_disk(231-233)load_from_file(187-197)new_from_edges(199-211)
lib/segment/src/index/hnsw_index/graph_links/serializer.rs (3)
lib/common/common/src/bitpacking.rs (1)
packed_bits(190-192)lib/common/common/src/bitpacking_links.rs (1)
pack_links(23-66)lib/segment/src/index/hnsw_index/graph_links.rs (3)
vector_layout(74-74)vector_layout(82-84)vector_layout(93-99)
lib/segment/src/index/hnsw_index/graph_links.rs (4)
lib/segment/src/index/hnsw_index/graph_links/serializer.rs (1)
serialize_graph_links(22-214)lib/segment/src/index/hnsw_index/graph_links/view.rs (2)
load(103-112)links(210-241)lib/segment/src/index/hnsw_index/mod.rs (1)
new(30-32)lib/common/io/src/file_operations.rs (1)
atomic_save(10-26)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: storage-compat-test
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: lint
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
- GitHub Check: integration-tests-consensus
- GitHub Check: e2e-tests
- GitHub Check: test-consistency
- GitHub Check: test-consensus-compose
- GitHub Check: integration-tests
🔇 Additional comments (14)
lib/common/io/src/file_operations.rs (3)
10-26: Generic atomic_save API looks solid; BufWriter<&mut File> enables Seek for rewrites.The design works for streaming writers (e.g., serialize_graph_links) that need Seek to rewrite headers. Explicit flush on Line [19] is good to ensure bytes hit disk before AtomicFile commit.
86-90: Correct conversion from bincode::Error to Error.De-referencing the boxed ErrorKind avoids extra allocation and matches your enum layout.
92-101: Io back-conversion is helpful for interop.Mapping non-io variants via
io::Error::otheris pragmatic and keeps layering simple.lib/segment/src/index/hnsw_index/graph_layers_builder.rs (1)
223-232: Test-only path mirrors the main path; LGTM.The RAM-only builder keeps symmetry with production code by going through new_from_edges().
lib/segment/src/index/hnsw_index/graph_links.rs (6)
177-183: as_bytes plumbing is correct and avoids extra copies.Exposing the slice for both RAM and Mmap via a single method simplifies persistence call sites.
186-197: load_from_file now delegates to GraphLinksView::load with bytes; good simplification.Using a common loader minimizes code paths and keeps view logic centralized.
199-211: new_from_edges: streaming through a Cursor is correct; shrink_to_fit helps peak memory.This aligns RAM construction with the streaming serializer and reduces over-allocation.
267-271: to_edges docs correctly reference serialize_graph_links; API is coherent.Keeps tests and round-trips aligned with the new serializer.
380-406: Streaming save/load test validates headers rewrite and offsets path; nice coverage.Good to see both Plain/Compressed/WithVectors and both residency modes exercised.
421-425: Tests exercising new_from_edges cover edge shapes well.Diverse shapes help guard against layout regressions.
lib/segment/src/index/hnsw_index/graph_links/serializer.rs (4)
84-88: Seek-before-neighbor padding with BufWriter: OK (Seek impl flushes).BufWriter’s Seek impl flushes before repositioning; using stream_position and write_zeros here is fine.
102-147: CompressedWithVectors path correctly preserves link order for vectors; alignment handled.Varint length, compressed links, alignment padding, then vectors in raw_links order matches the spec. Good reuse of links_buf and explicit offset tracking.
151-165: Offsets encoding: plain vs bitpacked split is correct; padding for u64 alignment respected.This maintains existing on-disk compatibility expectations.
167-211: Header rewrite covers all formats; unwraps are safe in their match arms.The version fields and layout metadata look consistent, including u8 alignment conversion with explicit expect.
| GraphLinks::load_from_file(&links_path, true, format_param.as_format())? | ||
| let edges = Self::links_layers_to_edges(self.links_layers); | ||
| let links; | ||
| if on_disk { |
There was a problem hiding this comment.
it feels like on_disk param requires a comment.
There was a problem hiding this comment.
Added a comment.
Also, @IvanPleshkov noted that we don't need HNSW links after construction. So, probably this part will be gone in this or subsequent PRs.
There was a problem hiding this comment.
I propose a separate PR to make it more readable for reviewers and avoid blocking these changes to merge
IvanPleshkov
left a comment
There was a problem hiding this comment.
I don't have any complains, just a couple of questions.
I find new serializing more readable than before, cool
e8d87ac to
d1a5690
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
lib/segment/src/index/hnsw_index/graph_links.rs (1)
231-233: Bug: on_disk() returns true for RAM-backed data (inverted logic).This misreports residency and may affect caching/prefetch decisions. This was noted previously and acknowledged as pre-existing; still flagging because the method lives in this file and is easy to fix.
Apply:
pub fn on_disk(&self) -> bool { - matches!(self.borrow_owner(), GraphLinksEnum::Ram(_)) + matches!(self.borrow_owner(), GraphLinksEnum::Mmap(_)) }If you prefer removal instead (as discussed earlier), I can open a follow-up issue.
🧹 Nitpick comments (3)
lib/segment/src/index/hnsw_index/graph_links.rs (1)
199-211: Streaming in-memory constructor looks good; minor nit on shrink_to_fit placement.The RAM constructor via Cursor + serializer is clean and aligns with the streaming design. Minor nit: shrink_to_fit() happens before GraphLinksView::load() borrows the bytes; that’s fine, but calling it after try_new would avoid resizing the Vec prior to validation. Low impact either way.
- let mut bytes = cursor.into_inner(); - bytes.shrink_to_fit(); - Self::try_new(GraphLinksEnum::Ram(bytes), |x| { + let mut bytes = cursor.into_inner(); + let graph = Self::try_new(GraphLinksEnum::Ram(bytes), |x| { GraphLinksView::load(x.as_bytes(), format_param.as_format()) - }) + })?; + // Now that the view is built, shrink if you still want to reduce capacity. + if let GraphLinksEnum::Ram(ref mut buf) = self_cell::Owner::borrow_owner(&graph) { + buf.shrink_to_fit(); + } + Ok(graph)If you prefer to keep it as-is for simplicity, ignore. The functional behavior is correct.
lib/segment/src/index/hnsw_index/graph_links/serializer.rs (2)
84-88: Avoid truncating large positions when computing neighbors padding.Casting
stream_position()(u64) tousizecan truncate on 32-bit targets and hides potential overflows. Compute padding in u64 and downcast once.- let pos = writer.stream_position()? as usize; - writer.write_zeros(pos.next_multiple_of(vector_layout.align()) - pos)?; + let pos = writer.stream_position()?; + let align = vector_layout.align() as u64; + let padding = ((pos + align - 1) / align) * align - pos; + writer.write_zeros(usize::try_from(padding).expect("padding fits into usize"))?;
151-159: Same truncation concern for offsets padding.Use u64 math for alignment and only convert to usize at write time.
- let len = writer.stream_position()? as usize; - let offsets_padding = len.next_multiple_of(size_of::<u64>()) - len; - writer.write_zeros(offsets_padding)?; + let len = writer.stream_position()?; + let align = size_of::<u64>() as u64; + let offsets_padding = ((len + align - 1) / align) * align - len; + writer.write_zeros(usize::try_from(offsets_padding).expect("padding fits into usize"))?;
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
lib/segment/src/index/hnsw_index/graph_layers.rs(4 hunks)lib/segment/src/index/hnsw_index/graph_layers_builder.rs(4 hunks)lib/segment/src/index/hnsw_index/graph_links.rs(9 hunks)lib/segment/src/index/hnsw_index/graph_links/serializer.rs(1 hunks)lib/segment/src/index/hnsw_index/graph_links/view.rs(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- lib/segment/src/index/hnsw_index/graph_links/view.rs
- lib/segment/src/index/hnsw_index/graph_layers_builder.rs
- lib/segment/src/index/hnsw_index/graph_layers.rs
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-06T09:56:59.311Z
Learnt from: xzfc
PR: qdrant/qdrant#6982
File: lib/segment/src/index/hnsw_index/graph_links/view.rs:217-220
Timestamp: 2025-08-06T09:56:59.311Z
Learning: In Qdrant's GraphLinksView implementation (lib/segment/src/index/hnsw_index/graph_links/view.rs), methods like links() and links_with_vectors() intentionally use unwrap() and can panic for performance reasons, maintaining consistency across similar methods. Error handling improvements are considered secondary to feature completion and require benchmarking before implementation.
Applied to files:
lib/segment/src/index/hnsw_index/graph_links/serializer.rslib/segment/src/index/hnsw_index/graph_links.rs
📚 Learning: 2025-06-14T20:35:10.603Z
Learnt from: generall
PR: qdrant/qdrant#6635
File: lib/segment/src/index/field_index/full_text_index/text_index.rs:784-832
Timestamp: 2025-06-14T20:35:10.603Z
Learning: Functions gated with `#[cfg(feature = "testing")]` within `#[cfg(test)]` modules are safe from compilation issues since both the function and its call sites are restricted to test builds. The additional feature gate is often used for optional heavy test utilities.
Applied to files:
lib/segment/src/index/hnsw_index/graph_links.rs
🧬 Code graph analysis (2)
lib/segment/src/index/hnsw_index/graph_links/serializer.rs (3)
lib/common/common/src/bitpacking.rs (1)
packed_bits(190-192)lib/common/common/src/bitpacking_links.rs (1)
pack_links(23-66)lib/segment/src/index/hnsw_index/graph_links.rs (3)
vector_layout(74-74)vector_layout(82-84)vector_layout(93-99)
lib/segment/src/index/hnsw_index/graph_links.rs (3)
lib/segment/src/index/hnsw_index/graph_links/serializer.rs (1)
serialize_graph_links(22-214)lib/segment/src/index/hnsw_index/graph_links/view.rs (2)
load(103-112)links(210-241)lib/common/io/src/file_operations.rs (1)
atomic_save(10-26)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: e2e-tests
- GitHub Check: test-consistency
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: storage-compat-test
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: integration-tests-consensus
- GitHub Check: test-consensus-compose
- GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: lint
- GitHub Check: integration-tests
🔇 Additional comments (5)
lib/segment/src/index/hnsw_index/graph_links.rs (2)
217-220: Public as_bytes() is a useful addition.Providing a unified byte view over RAM/Mmap simplifies downstream consumers and testing. No issues spotted.
380-406: Tests: atomic_save + streaming path LGTM.The test mirrors the new on-disk path (atomic_save + serialize_graph_links) and validates both on-disk and RAM loading. Good coverage across formats.
lib/segment/src/index/hnsw_index/graph_links/serializer.rs (3)
22-31: API shift to a streaming serializer is solid.Write+Seek requirement is justified by the header backpatch. Signature is minimal and composes well with in-memory (Cursor) and on-disk (BufWriter) use.
118-126: Accounting for varint length inoffsetis correct.Including
VarInt::required_space(...)ensures offsets for CompressedWithVectors reflect the actual byte footprint. Good catch; this addresses a prior question about “u64::required_space” in dev.
167-211: Header backpatching is consistent across formats.Versioning, neighbor/offset counts, and vector metadata look correct; using LittleU64 for the compressed headers is consistent with readers. No issues spotted.
* Introduce io::file_operations::atomic_save * Replace GraphLinksSerializer with serialize_graph_links
Problem
The current
GraphLinksSerializerflow consists of two steps:new()method serializes each section (reindex,neighbors,offsets, etc.) into its ownVec.serialize_to_writer()writes these sections in sequence.The drawback of this flow is that the
neighborssection for theCompressedWithVectorsformat can consume a lot of RAM. During testing, I had the process killed by the OOM killer during graph conversion.Solution
This PR replaces the
GraphLinksSerializerstruct and its methods with a new function,serialize_graph_links, which serializes to the writer directly without an intermediate in-memory representation.Details
GraphLinksSerializer::to_graph_links_ramdid proper total size calculation forVec::with_capacity(). In this PR, this logic is dropped (otherwise it would require two compression passes: the first time to get length, the second time to write bytes). This is compensated by calling.shrink_to_fit()afterwards, and while this is still a bit sloppy, I expect that the overall amount of reallocations/peak memory usage is not worse than before.GraphLinksSerializer::save_as()logic is moved to a new general-purpose function,io::file_operations::atomic_save(), which can be combined withserialize_graph_links().GraphLinksSerializer::to_graph_links_ram()logic is replaced withGraphLinks::new_from_edges().