Skip to content

Graph links serializer: avoid large intermediate allocations#7118

Merged
xzfc merged 2 commits intodevfrom
serialize_graph_links
Aug 22, 2025
Merged

Graph links serializer: avoid large intermediate allocations#7118
xzfc merged 2 commits intodevfrom
serialize_graph_links

Conversation

@xzfc
Copy link
Member

@xzfc xzfc commented Aug 22, 2025

Problem

The current GraphLinksSerializer flow consists of two steps:

  1. The new() method serializes each section (reindex, neighbors, offsets, etc.) into its own Vec.
  2. Then serialize_to_writer() writes these sections in sequence.

The drawback of this flow is that the neighbors section for the CompressedWithVectors format can consume a lot of RAM. During testing, I had the process killed by the OOM killer during graph conversion.

Solution

This PR replaces the GraphLinksSerializer struct and its methods with a new function, serialize_graph_links, which serializes to the writer directly without an intermediate in-memory representation.

Details

  • Previously, GraphLinksSerializer::to_graph_links_ram did proper total size calculation for Vec::with_capacity(). In this PR, this logic is dropped (otherwise it would require two compression passes: the first time to get length, the second time to write bytes). This is compensated by calling .shrink_to_fit() afterwards, and while this is still a bit sloppy, I expect that the overall amount of reallocations/peak memory usage is not worse than before.
  • The GraphLinksSerializer::save_as() logic is moved to a new general-purpose function, io::file_operations::atomic_save(), which can be combined with serialize_graph_links().
  • The GraphLinksSerializer::to_graph_links_ram() logic is replaced with GraphLinks::new_from_edges().

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 22, 2025

📝 Walkthrough

Walkthrough

Introduces a generic atomic_save API and refactors binary/JSON saves to use it. Replaces GraphLinksSerializer with a streaming writer-based serialize_graph_links function and removes the GraphLinksSerializer type. Adds GraphLinks::new_from_edges and GraphLinks::as_bytes, updates GraphLayers and GraphLayersBuilder to construct and persist links via edge vectors and serialize_graph_links (using atomic_save for on-disk writes). Serializer module rewritten to stream serialized graph data to a Write+Seek writer. Tests and imports adjusted to the new APIs.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

chore

Suggested reviewers

  • timvisee
  • generall

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch serialize_graph_links

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lib/segment/src/index/hnsw_index/graph_links.rs (1)

231-233: Bug: on_disk() returns true for RAM-backed data (inverted condition).

This will confuse any callers relying on on_disk() to know if data is memory-mapped. It should report true when the backing is Mmap.

Apply this diff:

-    pub fn on_disk(&self) -> bool {
-        matches!(self.borrow_owner(), GraphLinksEnum::Ram(_))
-    }
+    pub fn on_disk(&self) -> bool {
+        matches!(self.borrow_owner(), GraphLinksEnum::Mmap(_))
+    }
🧹 Nitpick comments (5)
lib/common/io/src/file_operations.rs (1)

31-41: Optional: consider a “sync directory” option for stronger crash-safety guarantees.

On some filesystems, durability of the rename may require syncing the parent directory. atomicwrites handles the temp->final rename, but if you need “survives power loss” guarantees for these files, an additional (optional) fsync of the parent directory can be exposed by atomic_save (or a variant).

lib/segment/src/index/hnsw_index/graph_layers_builder.rs (1)

234-239: Future improvement: stream directly from LinksContainer → writer to avoid building edges Vec.

links_layers_to_edges materializes Vec<Vec<Vec>> before writing. To squeeze peak RAM further for huge graphs, you could stream per-point/per-level from the RwLocks into serialize_graph_links (e.g., an iterator-based adapter that yields slices without forming a nested Vec). Not a blocker for this PR.

lib/segment/src/index/hnsw_index/graph_links.rs (1)

268-271: Doc nit: reference to removed GraphLinksSerializer.

The comment mentions GraphLinksSerializer::new, which no longer exists. Point to serialize_graph_links instead.

Apply this diff:

-    /// Convert the graph links to a vector of edges, suitable for passing into
-    /// [`GraphLinksSerializer::new`] or using in tests.
+    /// Convert the graph links to a vector of edges, suitable for passing into
+    /// [`serialize_graph_links`] or using in tests.
lib/segment/src/index/hnsw_index/graph_layers.rs (1)

374-377: Optional: consider syncing the compressed file on conversion.

Currently sync=false in convert_to_compressed. If conversion happens in contexts where crash-consistency matters (even though not production as per comment), setting sync=true would be safer before deleting the source.

-        atomic_save(&compressed_path, false, |writer| {
+        atomic_save(&compressed_path, true, |writer| {
lib/segment/src/index/hnsw_index/graph_links/serializer.rs (1)

102-114: Minor: reuse pre-checked vector_layout without unwrap in loop.

You already computed vector_layout above. Capture it once and reuse to avoid repeated unwraps. Micro, but removes a branch in the hot path.

-                GraphLinksFormatParam::CompressedWithVectors(vectors) => {
-                    // Unwrap safety: `vector_layout` is `Some` for `CompressedWithVectors`.
-                    let vector_layout = vector_layout.unwrap();
+                GraphLinksFormatParam::CompressedWithVectors(vectors) => {
+                    let vector_layout = vector_layout.expect("pre-checked at function start");
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 5eb4011 and 0b33387.

📒 Files selected for processing (5)
  • lib/common/io/src/file_operations.rs (1 hunks)
  • lib/segment/src/index/hnsw_index/graph_layers.rs (4 hunks)
  • lib/segment/src/index/hnsw_index/graph_layers_builder.rs (4 hunks)
  • lib/segment/src/index/hnsw_index/graph_links.rs (6 hunks)
  • lib/segment/src/index/hnsw_index/graph_links/serializer.rs (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-06T09:56:59.311Z
Learnt from: xzfc
PR: qdrant/qdrant#6982
File: lib/segment/src/index/hnsw_index/graph_links/view.rs:217-220
Timestamp: 2025-08-06T09:56:59.311Z
Learning: In Qdrant's GraphLinksView implementation (lib/segment/src/index/hnsw_index/graph_links/view.rs), methods like links() and links_with_vectors() intentionally use unwrap() and can panic for performance reasons, maintaining consistency across similar methods. Error handling improvements are considered secondary to feature completion and require benchmarking before implementation.

Applied to files:

  • lib/segment/src/index/hnsw_index/graph_layers_builder.rs
  • lib/segment/src/index/hnsw_index/graph_links.rs
  • lib/segment/src/index/hnsw_index/graph_links/serializer.rs
  • lib/segment/src/index/hnsw_index/graph_layers.rs
🧬 Code graph analysis (1)
lib/segment/src/index/hnsw_index/graph_links/serializer.rs (2)
lib/common/common/src/bitpacking.rs (1)
  • packed_bits (190-192)
lib/common/common/src/bitpacking_links.rs (1)
  • pack_links (23-66)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
  • GitHub Check: storage-compat-test
  • GitHub Check: e2e-tests
  • GitHub Check: lint
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: test-consistency
  • GitHub Check: test-consensus-compose
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: integration-tests
  • GitHub Check: integration-tests-consensus
🔇 Additional comments (7)
lib/common/io/src/file_operations.rs (1)

10-29: Atomic writer API looks solid (flush + optional fsync + precise error mapping).

  • Wrapping the provided file in BufWriter, calling into_inner() to flush, and optionally syncing the file is correct.
  • Mapping atomicwrites::Error::{Internal, User} into the caller’s E is neat and keeps the API generic.
lib/segment/src/index/hnsw_index/graph_layers_builder.rs (1)

193-203: Nice: streaming serializer through atomic_save eliminates the large neighbors Vec.

This directly addresses the OOM scenario in CompressedWithVectors by removing the big intermediate in-memory neighbors section. The on-disk branch writes once and mmaps back; the RAM branch builds once and reuses as_bytes() to persist. Good balance of memory and I/O.

lib/segment/src/index/hnsw_index/graph_links.rs (2)

199-211: Good addition: new_from_edges() + as_bytes() simplify RAM construction and persistence.

  • Using a Cursor to stream into RAM mirrors the on-disk path and keeps a single source of truth.
  • Exposing as_bytes() is pragmatic for persistence and testing.

Also applies to: 217-220


231-233: No external calls to GraphLinks::on_disk() detected
A repo-wide grep for on_disk( turned up only the method’s own definition in graph_links.rs and no other call sites. Since no downstream code invokes this (inverted) method, correcting its logic poses no risk of breaking existing functionality.

lib/segment/src/index/hnsw_index/graph_layers.rs (2)

374-377: Conversion path now streams: good change.

Switching to atomic_save + serialize_graph_links avoids building a large intermediate buffer during “plain → compressed” conversion.


398-407: Test-only compress_ram: dummy swap is a clean self_cell workaround.

Swapping in a zero-sized RAM GraphLinks to satisfy self_cell’s ownership guarantees is fine for tests and keeps the code simple.

lib/segment/src/index/hnsw_index/graph_links/serializer.rs (1)

63-73: Ensure consistent little-endian encoding for level offsets

The current serializer writes each u64 offset with

writer.write_all(total_offsets_len.as_bytes())?;

which emits native-endian bytes, whereas all header fields use explicit little-endian via LittleU64. The reader’s read_level_offsets then does

get_slice::<u64>(bytes, levels_count)?

which assumes host endianness. This will only work correctly on little-endian platforms.

Please reconcile the on-disk spec:

• Option A (portable):
– In serializer.rs (around “// 2. Write level offsets”), replace
rust writer.write_all(total_offsets_len.as_bytes())?;
with
rust writer.write_all(LittleU64::new(total_offsets_len).as_bytes())?;
– In view.rs’s read_level_offsets, read the offsets as LittleU64 (e.g. via get_slice::<LittleU64>) and convert to u64 so every stored offset is defined as little-endian.

• Option B (documented host constraint):
– Clearly document in the GraphLinks file format that all multi-byte integers, including level offsets, are little-endian and that only little-endian platforms are supported; leave the native-endian writes as-is.

Let me know which approach matches the file-format requirements so we can make the change (or add the documentation).

@xzfc xzfc force-pushed the serialize_graph_links branch from 0b33387 to e8d87ac Compare August 22, 2025 05:29
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
lib/segment/src/index/hnsw_index/graph_links/serializer.rs (1)

50-54: Guard against empty per-point levels; point.len() - 1 will underflow.

If any point has zero levels, indexing will panic. Even if current producers create at least one level, the serializer should defend against malformed inputs (especially now that it’s a public API).

Apply near here:

-    let levels_count = back_index.first().map_or(0, |&id| edges[id].len());
-    let mut point_count_by_level = vec![0; levels_count];
-    for point in &edges {
-        point_count_by_level[point.len() - 1] += 1;
-    }
+    let levels_count = back_index.first().map_or(0, |&id| edges[id].len());
+    if !edges.is_empty() && edges.iter().any(|levels| levels.is_empty()) {
+        return Err(OperationError::service_error(
+            "Each point must have at least one level",
+        ));
+    }
+    let mut point_count_by_level = vec![0; levels_count];
+    for point in &edges {
+        // Safe due to the check above.
+        point_count_by_level[point.len() - 1] += 1;
+    }
🧹 Nitpick comments (4)
lib/common/io/src/file_operations.rs (2)

28-34: Minor: Keep the closure style consistent (wrap inner writes in Ok(...)?).

For consistency with Lines [29] and [33], prefer the Ok(inner_call()?) pattern across all call sites. This also encourages callers to return a unified error type when useful.

Apply in callers as shown in graph_layers_builder.rs below.


74-84: From<atomicwrites::Error> for Error is correct, but currently unused by atomic_save.

atomic_save maps errors manually (Lines [22-25]). Keeping this impl is fine for other sites that may want ? on atomicwrites::Error<E>, but it’s redundant for atomic_save itself.

lib/segment/src/index/hnsw_index/graph_layers_builder.rs (2)

193-203: Unify error type in atomic_save write closures; else-branch currently infers io::Error.

In the on-disk branch (Lines [196-199]) the closure returns OperationResult, so E=OperationError. In the in-RAM branch (Lines [202-203]) the closure returns io::Result, so E=io::Error, relying on Fromio::Error for OperationError at the call site. Make both branches return OperationResult for consistency and clearer error flows.

Apply:

-            links = GraphLinks::new_from_edges(edges, format_param, self.hnsw_m)?;
-            atomic_save(&links_path, |writer| writer.write_all(links.as_bytes()))?;
+            links = GraphLinks::new_from_edges(edges, format_param, self.hnsw_m)?;
+            atomic_save(&links_path, |writer| Ok(writer.write_all(links.as_bytes())?))?;

234-239: links_layers_to_edges: clear and minimal copies; consider reserving levels Vecs.

You might shave small reallocs by pre-allocating per-point levels when known, but current form is fine.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 0b33387 and e8d87ac.

📒 Files selected for processing (6)
  • lib/common/io/src/file_operations.rs (1 hunks)
  • lib/segment/src/index/hnsw_index/graph_layers.rs (4 hunks)
  • lib/segment/src/index/hnsw_index/graph_layers_builder.rs (4 hunks)
  • lib/segment/src/index/hnsw_index/graph_links.rs (9 hunks)
  • lib/segment/src/index/hnsw_index/graph_links/serializer.rs (1 hunks)
  • lib/segment/src/index/hnsw_index/graph_links/view.rs (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • lib/segment/src/index/hnsw_index/graph_links/view.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • lib/segment/src/index/hnsw_index/graph_layers.rs
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-06T09:56:59.311Z
Learnt from: xzfc
PR: qdrant/qdrant#6982
File: lib/segment/src/index/hnsw_index/graph_links/view.rs:217-220
Timestamp: 2025-08-06T09:56:59.311Z
Learning: In Qdrant's GraphLinksView implementation (lib/segment/src/index/hnsw_index/graph_links/view.rs), methods like links() and links_with_vectors() intentionally use unwrap() and can panic for performance reasons, maintaining consistency across similar methods. Error handling improvements are considered secondary to feature completion and require benchmarking before implementation.

Applied to files:

  • lib/segment/src/index/hnsw_index/graph_layers_builder.rs
  • lib/segment/src/index/hnsw_index/graph_links/serializer.rs
  • lib/segment/src/index/hnsw_index/graph_links.rs
📚 Learning: 2025-06-14T20:35:10.603Z
Learnt from: generall
PR: qdrant/qdrant#6635
File: lib/segment/src/index/field_index/full_text_index/text_index.rs:784-832
Timestamp: 2025-06-14T20:35:10.603Z
Learning: Functions gated with `#[cfg(feature = "testing")]` within `#[cfg(test)]` modules are safe from compilation issues since both the function and its call sites are restricted to test builds. The additional feature gate is often used for optional heavy test utilities.

Applied to files:

  • lib/segment/src/index/hnsw_index/graph_links.rs
🧬 Code graph analysis (3)
lib/segment/src/index/hnsw_index/graph_layers_builder.rs (3)
lib/common/io/src/file_operations.rs (2)
  • atomic_save (10-26)
  • atomic_save_bin (28-30)
lib/segment/src/index/hnsw_index/graph_links/serializer.rs (1)
  • serialize_graph_links (22-214)
lib/segment/src/index/hnsw_index/graph_links.rs (4)
  • links (249-251)
  • on_disk (231-233)
  • load_from_file (187-197)
  • new_from_edges (199-211)
lib/segment/src/index/hnsw_index/graph_links/serializer.rs (3)
lib/common/common/src/bitpacking.rs (1)
  • packed_bits (190-192)
lib/common/common/src/bitpacking_links.rs (1)
  • pack_links (23-66)
lib/segment/src/index/hnsw_index/graph_links.rs (3)
  • vector_layout (74-74)
  • vector_layout (82-84)
  • vector_layout (93-99)
lib/segment/src/index/hnsw_index/graph_links.rs (4)
lib/segment/src/index/hnsw_index/graph_links/serializer.rs (1)
  • serialize_graph_links (22-214)
lib/segment/src/index/hnsw_index/graph_links/view.rs (2)
  • load (103-112)
  • links (210-241)
lib/segment/src/index/hnsw_index/mod.rs (1)
  • new (30-32)
lib/common/io/src/file_operations.rs (1)
  • atomic_save (10-26)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: storage-compat-test
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: lint
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
  • GitHub Check: integration-tests-consensus
  • GitHub Check: e2e-tests
  • GitHub Check: test-consistency
  • GitHub Check: test-consensus-compose
  • GitHub Check: integration-tests
🔇 Additional comments (14)
lib/common/io/src/file_operations.rs (3)

10-26: Generic atomic_save API looks solid; BufWriter<&mut File> enables Seek for rewrites.

The design works for streaming writers (e.g., serialize_graph_links) that need Seek to rewrite headers. Explicit flush on Line [19] is good to ensure bytes hit disk before AtomicFile commit.


86-90: Correct conversion from bincode::Error to Error.

De-referencing the boxed ErrorKind avoids extra allocation and matches your enum layout.


92-101: Io back-conversion is helpful for interop.

Mapping non-io variants via io::Error::other is pragmatic and keeps layering simple.

lib/segment/src/index/hnsw_index/graph_layers_builder.rs (1)

223-232: Test-only path mirrors the main path; LGTM.

The RAM-only builder keeps symmetry with production code by going through new_from_edges().

lib/segment/src/index/hnsw_index/graph_links.rs (6)

177-183: as_bytes plumbing is correct and avoids extra copies.

Exposing the slice for both RAM and Mmap via a single method simplifies persistence call sites.


186-197: load_from_file now delegates to GraphLinksView::load with bytes; good simplification.

Using a common loader minimizes code paths and keeps view logic centralized.


199-211: new_from_edges: streaming through a Cursor is correct; shrink_to_fit helps peak memory.

This aligns RAM construction with the streaming serializer and reduces over-allocation.


267-271: to_edges docs correctly reference serialize_graph_links; API is coherent.

Keeps tests and round-trips aligned with the new serializer.


380-406: Streaming save/load test validates headers rewrite and offsets path; nice coverage.

Good to see both Plain/Compressed/WithVectors and both residency modes exercised.


421-425: Tests exercising new_from_edges cover edge shapes well.

Diverse shapes help guard against layout regressions.

lib/segment/src/index/hnsw_index/graph_links/serializer.rs (4)

84-88: Seek-before-neighbor padding with BufWriter: OK (Seek impl flushes).

BufWriter’s Seek impl flushes before repositioning; using stream_position and write_zeros here is fine.


102-147: CompressedWithVectors path correctly preserves link order for vectors; alignment handled.

Varint length, compressed links, alignment padding, then vectors in raw_links order matches the spec. Good reuse of links_buf and explicit offset tracking.


151-165: Offsets encoding: plain vs bitpacked split is correct; padding for u64 alignment respected.

This maintains existing on-disk compatibility expectations.


167-211: Header rewrite covers all formats; unwraps are safe in their match arms.

The version fields and layout metadata look consistent, including u8 alignment conversion with explicit expect.

GraphLinks::load_from_file(&links_path, true, format_param.as_format())?
let edges = Self::links_layers_to_edges(self.links_layers);
let links;
if on_disk {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels like on_disk param requires a comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment.

Also, @IvanPleshkov noted that we don't need HNSW links after construction. So, probably this part will be gone in this or subsequent PRs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose a separate PR to make it more readable for reviewers and avoid blocking these changes to merge

Copy link
Contributor

@IvanPleshkov IvanPleshkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any complains, just a couple of questions.
I find new serializing more readable than before, cool

@xzfc xzfc force-pushed the serialize_graph_links branch from e8d87ac to d1a5690 Compare August 22, 2025 13:29
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
lib/segment/src/index/hnsw_index/graph_links.rs (1)

231-233: Bug: on_disk() returns true for RAM-backed data (inverted logic).

This misreports residency and may affect caching/prefetch decisions. This was noted previously and acknowledged as pre-existing; still flagging because the method lives in this file and is easy to fix.

Apply:

 pub fn on_disk(&self) -> bool {
-    matches!(self.borrow_owner(), GraphLinksEnum::Ram(_))
+    matches!(self.borrow_owner(), GraphLinksEnum::Mmap(_))
 }

If you prefer removal instead (as discussed earlier), I can open a follow-up issue.

🧹 Nitpick comments (3)
lib/segment/src/index/hnsw_index/graph_links.rs (1)

199-211: Streaming in-memory constructor looks good; minor nit on shrink_to_fit placement.

The RAM constructor via Cursor + serializer is clean and aligns with the streaming design. Minor nit: shrink_to_fit() happens before GraphLinksView::load() borrows the bytes; that’s fine, but calling it after try_new would avoid resizing the Vec prior to validation. Low impact either way.

-        let mut bytes = cursor.into_inner();
-        bytes.shrink_to_fit();
-        Self::try_new(GraphLinksEnum::Ram(bytes), |x| {
+        let mut bytes = cursor.into_inner();
+        let graph = Self::try_new(GraphLinksEnum::Ram(bytes), |x| {
             GraphLinksView::load(x.as_bytes(), format_param.as_format())
-        })
+        })?;
+        // Now that the view is built, shrink if you still want to reduce capacity.
+        if let GraphLinksEnum::Ram(ref mut buf) = self_cell::Owner::borrow_owner(&graph) {
+            buf.shrink_to_fit();
+        }
+        Ok(graph)

If you prefer to keep it as-is for simplicity, ignore. The functional behavior is correct.

lib/segment/src/index/hnsw_index/graph_links/serializer.rs (2)

84-88: Avoid truncating large positions when computing neighbors padding.

Casting stream_position() (u64) to usize can truncate on 32-bit targets and hides potential overflows. Compute padding in u64 and downcast once.

-    let pos = writer.stream_position()? as usize;
-    writer.write_zeros(pos.next_multiple_of(vector_layout.align()) - pos)?;
+    let pos = writer.stream_position()?;
+    let align = vector_layout.align() as u64;
+    let padding = ((pos + align - 1) / align) * align - pos;
+    writer.write_zeros(usize::try_from(padding).expect("padding fits into usize"))?;

151-159: Same truncation concern for offsets padding.

Use u64 math for alignment and only convert to usize at write time.

-    let len = writer.stream_position()? as usize;
-    let offsets_padding = len.next_multiple_of(size_of::<u64>()) - len;
-    writer.write_zeros(offsets_padding)?;
+    let len = writer.stream_position()?;
+    let align = size_of::<u64>() as u64;
+    let offsets_padding = ((len + align - 1) / align) * align - len;
+    writer.write_zeros(usize::try_from(offsets_padding).expect("padding fits into usize"))?;
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e8d87ac and d1a5690.

📒 Files selected for processing (5)
  • lib/segment/src/index/hnsw_index/graph_layers.rs (4 hunks)
  • lib/segment/src/index/hnsw_index/graph_layers_builder.rs (4 hunks)
  • lib/segment/src/index/hnsw_index/graph_links.rs (9 hunks)
  • lib/segment/src/index/hnsw_index/graph_links/serializer.rs (1 hunks)
  • lib/segment/src/index/hnsw_index/graph_links/view.rs (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • lib/segment/src/index/hnsw_index/graph_links/view.rs
  • lib/segment/src/index/hnsw_index/graph_layers_builder.rs
  • lib/segment/src/index/hnsw_index/graph_layers.rs
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-06T09:56:59.311Z
Learnt from: xzfc
PR: qdrant/qdrant#6982
File: lib/segment/src/index/hnsw_index/graph_links/view.rs:217-220
Timestamp: 2025-08-06T09:56:59.311Z
Learning: In Qdrant's GraphLinksView implementation (lib/segment/src/index/hnsw_index/graph_links/view.rs), methods like links() and links_with_vectors() intentionally use unwrap() and can panic for performance reasons, maintaining consistency across similar methods. Error handling improvements are considered secondary to feature completion and require benchmarking before implementation.

Applied to files:

  • lib/segment/src/index/hnsw_index/graph_links/serializer.rs
  • lib/segment/src/index/hnsw_index/graph_links.rs
📚 Learning: 2025-06-14T20:35:10.603Z
Learnt from: generall
PR: qdrant/qdrant#6635
File: lib/segment/src/index/field_index/full_text_index/text_index.rs:784-832
Timestamp: 2025-06-14T20:35:10.603Z
Learning: Functions gated with `#[cfg(feature = "testing")]` within `#[cfg(test)]` modules are safe from compilation issues since both the function and its call sites are restricted to test builds. The additional feature gate is often used for optional heavy test utilities.

Applied to files:

  • lib/segment/src/index/hnsw_index/graph_links.rs
🧬 Code graph analysis (2)
lib/segment/src/index/hnsw_index/graph_links/serializer.rs (3)
lib/common/common/src/bitpacking.rs (1)
  • packed_bits (190-192)
lib/common/common/src/bitpacking_links.rs (1)
  • pack_links (23-66)
lib/segment/src/index/hnsw_index/graph_links.rs (3)
  • vector_layout (74-74)
  • vector_layout (82-84)
  • vector_layout (93-99)
lib/segment/src/index/hnsw_index/graph_links.rs (3)
lib/segment/src/index/hnsw_index/graph_links/serializer.rs (1)
  • serialize_graph_links (22-214)
lib/segment/src/index/hnsw_index/graph_links/view.rs (2)
  • load (103-112)
  • links (210-241)
lib/common/io/src/file_operations.rs (1)
  • atomic_save (10-26)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: e2e-tests
  • GitHub Check: test-consistency
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: storage-compat-test
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: integration-tests-consensus
  • GitHub Check: test-consensus-compose
  • GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: lint
  • GitHub Check: integration-tests
🔇 Additional comments (5)
lib/segment/src/index/hnsw_index/graph_links.rs (2)

217-220: Public as_bytes() is a useful addition.

Providing a unified byte view over RAM/Mmap simplifies downstream consumers and testing. No issues spotted.


380-406: Tests: atomic_save + streaming path LGTM.

The test mirrors the new on-disk path (atomic_save + serialize_graph_links) and validates both on-disk and RAM loading. Good coverage across formats.

lib/segment/src/index/hnsw_index/graph_links/serializer.rs (3)

22-31: API shift to a streaming serializer is solid.

Write+Seek requirement is justified by the header backpatch. Signature is minimal and composes well with in-memory (Cursor) and on-disk (BufWriter) use.


118-126: Accounting for varint length in offset is correct.

Including VarInt::required_space(...) ensures offsets for CompressedWithVectors reflect the actual byte footprint. Good catch; this addresses a prior question about “u64::required_space” in dev.


167-211: Header backpatching is consistent across formats.

Versioning, neighbor/offset counts, and vector metadata look correct; using LittleU64 for the compressed headers is consistent with readers. No issues spotted.

@xzfc xzfc merged commit 95d7936 into dev Aug 22, 2025
16 checks passed
@xzfc xzfc deleted the serialize_graph_links branch August 22, 2025 17:08
@coderabbitai coderabbitai bot mentioned this pull request Aug 26, 2025
timvisee pushed a commit that referenced this pull request Aug 26, 2025
* Introduce io::file_operations::atomic_save

* Replace GraphLinksSerializer with serialize_graph_links
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants