HNSW healing by xzfc · Pull Request #6543 · qdrant/qdrant

xzfc · 2025-05-16T07:52:43Z

This PR implements graph healing algorithm, allowing HNSW index to reuse the old graph even if some of the vectors are updated or deleted.

This is done by the new graph search function GraphLayersBuilder::search_shortcuts_on_level which finds shortcut links around the deleted points. (check it's documentation comment for more info)

The main graph building is still done in two stages (migrate phase, and main phase).
But in this PR,

Point migration not just copies links as-is, but it could replace links to deleted nodes with newly found shortcuts around them..
Migration is done in parallel, similar to the main phase.

By default, this feature is hidden behind the new hnsw_healing feature flag.
The threshold is 10% of missing points (if more than that, then the old index is discarded).

Benchmark on random data

$ bfb --segments 1 --indexing-threshold 1 -n 50000 --dim 100
$ curl -X PATCH 0:6333/collections/benchmark -d '{"hnsw_config":{"ef_construct":101}}' # Force index rebuild
$ bfb --skip-create --dim 100 -n 1000 # update 2%
$ bfb --skip-create --dim 100 -n 2000 # update 4%
$ bfb --skip-create --dim 100 -n 4000 # update 8%

Index build time from scratch: 4519 ms
Index build time with healing (2% updated): 117+363 = 480 ms (117 ms for migration and 363 ms for main phase)
Index build time with healing (4% updated): 245+506 = 751 ms
Index build time with healing (8% updated): 660+725 = 1385 ms

coderabbitai · 2025-05-16T08:01:59Z

📝 Walkthrough

Walkthrough

This set of changes adds a new boolean feature flag hnsw_healing to the FeatureFlags struct for controlling incremental HNSW index building with healing capabilities. The OldIndexCandidate::evaluate method now accepts feature flags and uses them to decide whether to apply healing logic, tracking valid and missing points and rejecting old indexes with excessive missing points. Point migration from an old index is parallelized and enhanced with heuristic shortcut search to fix missing links. The SearchContext constructor and related methods are refactored to separate initialization from candidate processing. Additional methods and visibility changes support these operations, and logging is updated to report reused and healed points. Other minor refactorings include a new helper method for fullness checks in a fixed-length priority queue and a new specialized search method for shortcuts on graph levels.

Possibly related PRs

Incremental HNSW index building: append-only case #6325: Both PRs modify the HNSW index building process, introduce feature flags for incremental or healing capabilities, and change internal search context initialization, directly connecting them in terms of HNSW incremental build and healing logic.
Heuristic optimization #6501: Both PRs modify GraphLayersBuilder and SearchContext initialization, with this PR adding a new specialized search method and changing candidate processing, while the other refactors link management and heuristic pruning, showing related but distinct concerns in graph layer handling.

Suggested reviewers

timvisee

Note

⚡️ AI Code Reviews for VS Code, Cursor, Windsurf

CodeRabbit now has a plugin for VS Code, Cursor and Windsurf. This brings AI code reviews directly in the code editor. Each commit is reviewed immediately, finding bugs before the PR is raised. Seamless context handoff to your AI code agent ensures that you can easily incorporate review feedback.
Learn more here.

Note

⚡️ Faster reviews with caching

CodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 16th. To opt out, configure Review - Disable Cache at either the organization or repository level. If you prefer to disable all data retention across your organization, simply turn off the Data Retention setting under your Organization Settings.
Enjoy the performance boost—your workflow just got faster.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1481fd0 and 34b57f7.

📒 Files selected for processing (2)

lib/segment/src/index/hnsw_index/graph_layers_builder.rs (6 hunks)
lib/segment/src/index/hnsw_index/hnsw.rs (11 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

lib/segment/src/index/hnsw_index/hnsw.rs

⏰ Context from checks skipped due to timeout of 90000ms (13)

GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: integration-tests-consensus
GitHub Check: integration-tests
GitHub Check: test-consistency
GitHub Check: test-low-resources
GitHub Check: test-consensus-compose
GitHub Check: Basic TLS/HTTPS tests
GitHub Check: rust-tests (windows-latest)
GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: rust-tests (macos-latest)
GitHub Check: storage-compat-test
GitHub Check: lint

🔇 Additional comments (8)

lib/segment/src/index/hnsw_index/graph_layers_builder.rs (8)

7-7: Added necessary import for the new functionality.

The FixedLengthPriorityQueue import is required for the new graph healing algorithm to manage and return search results.

354-358: Improved parameter naming for better clarity.

The parameter renaming from levels to links_by_level better describes that this parameter represents links organized by level, making the code more self-descriptive.

387-388: Refactored search context initialization for consistency.

The initialization of SearchContext has been separated from candidate processing, making the code more consistent with other search implementations. This two-step approach allows more flexibility when initializing search operations.

501-538: Well-documented dual-graph search strategy for healing.

The new search_shortcuts_on_level method introduces a specialized search algorithm that:

Uses both old and new graph representations (dual-graph approach)

Implements depth-first search instead of breadth-first

Traverses through deleted points to find valid shortcuts

The extensive documentation clearly explains the approach and reasoning.

539-564: Effective initialization of the search algorithm.

The search initialization properly sets up:

Search context with appropriate ef_construct parameter

Buffers for neighbors and scores

Selection logic for finding entry points into the "deleted" sub-graph

The code correctly filters potential entry points based on their status in the new index.

565-608: Smart graph traversal with adaptive neighbor selection.

This section implements a key innovation - dynamically choosing which graph to use for neighbor expansion based on whether a point is "healed" already. This adaptive approach allows the algorithm to leverage existing connections in the new graph when possible, maximizing information reuse.

609-635: Efficient candidate processing with point-type awareness.

The algorithm intelligently differentiates between:

Points on the "border" between deleted and non-deleted regions (added to result)

Deleted points (added to search stack for further traversal)

This approach effectively identifies shortcuts around deleted regions, maintaining connectivity in the graph when some vectors are updated or deleted.

636-638: Clean return of search results.

The method appropriately returns the priority queue containing the discovered shortcuts, allowing the caller to use these results for healing the graph structure.

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🔭 Outside diff range comments (1)

lib/segment/src/index/hnsw_index/hnsw.rs (1)
1402-1478: 💡 Verification agent
🧩 Analysis chain

Consider tolerant vector comparison to reduce false-negative reuse rate

old_vector == new_vector performs an exact floating-point comparison.
Even when a vector has not semantically changed, different storage
back-ends (e.g. serialization, quantization-round-trip, SIMD alignment)
may re-emit the very same numbers with minor binary drift, causing
values that are numerically equal to fail an == test.
As soon as that happens the point is treated as “missing”, forcing the
healing path and potentially discarding a large portion of an otherwise
good graph.

A cheap alternative is to fall back to an approximate check that is
already available in the code base (e.g. cosine/L2 similarity ≥
1-1e-6).
- if old_vector == new_vector {
+ // Treat vectors as equal if the distance is below a tiny epsilon
+ if metric::cosine::distance(&old_vector, &new_vector) <= 1e-6 {
This keeps the incremental-build fast path effective while still
catching truly modified vectors.

🏁 Script executed:
#!/bin/bash
# Search for metric module and cosine/distance functions
rg -n "mod metric" .
rg -n "cosine" .
rg -n "distance(" .

# Find definition and usage of get_vector to inspect return type
rg -n "fn get_vector" .
rg -n "get_vector(" .
Length of output: 32005

🏁 Script executed:
#!/bin/bash
# List files in the metric module
echo "=== Metric module files ==="
ls lib/segment/src/spaces/metric

# Search for any distance function definitions in metric module
echo -e "\n=== Definitions of fn distance ==="
rg -n "fn distance" lib/segment/src/spaces/metric

# Inspect the top of mod.rs to see submodules
echo -e "\n=== lib/segment/src/spaces/metric/mod.rs (first 200 lines) ==="
sed -n '1,200p' lib/segment/src/spaces/metric/mod.rs
Length of output: 657

🏁 Script executed:
#!/bin/bash
# 1. List the contents of the spaces directory to locate the metric module
echo "=== lib/segment/src/spaces directory ==="
ls -R lib/segment/src/spaces

# 2. Search for any definitions of `fn distance` across Rust files
echo -e "\n=== Definitions of `fn distance` ==="
rg -n 'fn distance' -g '*.rs'

# 3. Search for invocations of `distance(` to see how it’s called
echo -e "\n=== Usages of `distance(` ==="
rg -n 'distance\s*\(' -g '*.rs'
Length of output: 14578

🏁 Script executed:
#!/bin/bash
# Show the contents of the metric module file to find how distance functions are exposed
echo "=== lib/segment/src/spaces/metric.rs ==="
sed -n '1,200p' lib/segment/src/spaces/metric.rs

# Locate the Distance enum declaration
echo -e "\n=== enum Distance in lib/segment/src/types.rs ==="
rg -n "enum Distance" -g lib/segment/src/types.rs

# Locate the impl block for Distance to see its method signature
echo -e "\n=== impl Distance in lib/segment/src/types.rs ==="
rg -n "impl Distance" -g lib/segment/src/types.rs
Length of output: 1303
Use an approximate‐equality check instead of == for floating-point vectors

Exact old_vector == new_vector will drop points whose values only differ by tiny fp noise. To avoid spuriously entering the healing path, replace it with an element-wise epsilon test. For example, in lib/segment/src/index/hnsw_index/hnsw.rs around line 1435:
-    if old_vector == new_vector {
+    // Treat vectors as equal if all components differ by at most EPS
+    const EPS: f32 = 1e-6;
+    if old_vector.len() == new_vector.len()
+        && old_vector
+            .iter()
+            .zip(new_vector.iter())
+            .all(|(a, b)| (a - b).abs() <= EPS)
    {
         old_to_new[old_offset as usize] = Some(new_offset);
         valid_points += 1;
This simple check avoids pulling in non-existent metric::cosine::distance functions and keeps the fast path intact while only rejecting truly changed vectors.

🧹 Nitpick comments (2)

lib/segment/src/index/hnsw_index/hnsw.rs (1)

386-407: Possible data race between concurrent healing tasks

to_heal.into_par_iter().try_for_each(|old_offset| { old_index.migrate_point(old_offset, &graph_layers_builder) })

migrate_point mutates shared state inside graph_layers_builder
(ready_list, link containers, entry-points). Internally the builder
relies on RwLock/Mutex for safety, so the code compiles, but:

every healed point writes the global ready_list; with par_iter
there will be heavy cross-thread contention on the same lock,
significantly degrading throughput for large batches;

a thread may call search_shortcuts_on_level before another thread
marks a neighbour as “ready”, therefore exploring a slower “old-graph”
path and producing sub-optimal shortcuts.

If practical, heal sequentially per level or switch to a two-phase
approach:

build a local Vec<(point_id, links)> in parallel;

merge results into the builder in a single threaded pass.

That removes lock contention and eliminates the ready-list timing race.
lib/segment/src/index/hnsw_index/graph_layers_builder.rs (1)
497-608: Heavy lock usage & per-call allocations may hurt healing performance

search_shortcuts_on_level is invoked for every healed point and:

Acquires ready_list.read() on every neighbour inspection,
potentially thousands of times per call.

Allocates new Vecs (old_offsets, scores_buffer, pending)
for each call, although their maximum length is bounded by 2 * M.

Under large update batches this becomes a hot path.

Recommended micro-optimisations:
- let mut old_offsets: Vec<PointOffsetType> = Vec::with_capacity(2 * limit);
- let mut scores_buffer = Vec::with_capacity(limit);
+ // Re-use small vec stacks on the caller thread to avoid heap traffic
+ let mut old_offsets = smallvec::SmallVec::<[PointOffsetType; 32]>::new();
+ let mut scores_buffer = smallvec::SmallVec::<[f32; 32]>::new();

- let mut pending = Vec::new();
+ let mut pending = VecDeque::with_capacity(limit);
and cache the ready_list guard once per call:
let ready = self.ready_list.read();
...
if ready[new_offset as usize] { ... }
This removes a large chunk of lock acquisitions and allocations without
changing semantics.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5b7daee and 6010008.

📒 Files selected for processing (6)

lib/common/common/src/fixed_length_priority_queue.rs (2 hunks)
lib/common/common/src/flags.rs (3 hunks)
lib/segment/src/index/hnsw_index/graph_layers.rs (1 hunks)
lib/segment/src/index/hnsw_index/graph_layers_builder.rs (5 hunks)
lib/segment/src/index/hnsw_index/hnsw.rs (11 hunks)
lib/segment/src/index/hnsw_index/search_context.rs (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (1)

lib/segment/src/index/hnsw_index/graph_layers.rs (1)

lib/segment/src/index/hnsw_index/search_context.rs (1)

new (16-21)

⏰ Context from checks skipped due to timeout of 90000ms (13)

GitHub Check: Basic TLS/HTTPS tests
GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: test-low-resources
GitHub Check: test-consensus-compose
GitHub Check: integration-tests-consensus
GitHub Check: rust-tests (macos-latest)
GitHub Check: test-consistency
GitHub Check: rust-tests (windows-latest)
GitHub Check: integration-tests
GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: storage-compat-test
GitHub Check: lint

🔇 Additional comments (8)

lib/common/common/src/fixed_length_priority_queue.rs (2)

48-51: Good refactoring using the new is_full() method.

The push method now leverages the newly added is_full() method instead of directly comparing the length. This improves code readability and maintainability.

94-97: Well-designed helper method for queue capacity check.

The new is_full() method encapsulates the logic for checking if the queue has reached its fixed capacity. This is a clean extraction of a common operation that was previously embedded in the push method, improving code reusability and readability.

lib/common/common/src/flags.rs (4)

25-26: LGTM: New feature flag for HNSW healing.

The new boolean flag is well-documented with a clear purpose statement.

35-35: LGTM: Appropriate default value for new feature flag.

Setting the default value to false for the new hnsw_healing flag is appropriate for a new experimental feature.

54-54: LGTM: Updated destructuring pattern.

Correctly added the new hnsw_healing flag to the destructuring pattern in init_feature_flags.

61-61: LGTM: Updated "all" flag handling.

Properly enabled the new hnsw_healing flag when the "all" flag is set to true.

lib/segment/src/index/hnsw_index/graph_layers.rs (1)

106-107: Good decoupling of search context creation from candidate addition.

This change separates the creation of the SearchContext from the addition of the initial candidate point. This approach is more flexible and aligns with modifications in the SearchContext constructor, improving control over the search initialization process.

lib/segment/src/index/hnsw_index/search_context.rs (1)

16-20: Improved SearchContext constructor design.

The constructor has been simplified to only initialize an empty context with the specified capacity, removing the requirement for an initial entry point. This change supports more flexible usage patterns where candidates can be added explicitly after initialization, which is particularly useful for the new HNSW healing functionality.

coderabbitai · 2025-05-16T08:02:02Z

lib/segment/src/index/hnsw_index/hnsw.rs

+    fn migrate_point(
+        &self,
+        src_old: PointOffsetType,
+        builder: &GraphLayersBuilder,
+    ) -> OperationResult<()> {
        let links = &self.index.graph.links;
        let point_level = links.point_level(src_old);

+        let hardware_counter = HardwareCounterCell::disposable(); // Internal operation. No measurements needed
+        let vector_storage = self.index.vector_storage.borrow();
+        let old_scorer = new_raw_scorer(
+            vector_storage.get_vector(src_old).as_vec_ref().into(),
+            &vector_storage,
+            hardware_counter,
+        )?;
+
        let links = (0..=point_level).map(|level| {
-            links
-                .links(src_old, level)
-                .take(if level == 0 { self.m0 } else { self.m })
-                .filter_map(|dst_old| self.old_to_new[dst_old as usize])
-                .collect()
+            let level_m = if level == 0 { self.m0 } else { self.m };
+            let mut new_links = Vec::new();
+            let mut need_fixing = 0;
+
+            for link in links.links(src_old, level).take(level_m) {
+                if let Some(new_link) = self.old_to_new[link as usize] {
+                    new_links.push(new_link);
+                } else {
+                    need_fixing += 1;
+                }
+            }
+
+            if need_fixing > 0 {
+                let shortcuts = builder
+                    .search_shortcuts_on_level(src_old, level, old_scorer.as_ref(), self)
+                    .into_iter_sorted();
+                let mut container = LinksContainer::with_capacity(level_m);
+                let new_scorer = |a, b| {
+                    let a_old = self.new_to_old[a as usize].unwrap();
+                    let b_old = self.new_to_old[b as usize].unwrap();
+                    old_scorer.score_internal(a_old, b_old)
+                };
+                container.fill_from_sorted_with_heuristic(shortcuts, need_fixing, new_scorer);
+                new_links.extend_from_slice(container.links());
+            }
+
+            new_links
        });
-        Some(links.collect())
+
+        builder.add_new_point(self.old_to_new[src_old as usize].unwrap(), links.collect());
+
+        Ok(())
    }


⚠️ Potential issue

unwrap() on internal mapping can panic in inconsistent states

new_scorer assumes that every shortcut returned by
search_shortcuts_on_level has a valid back-mapping:

let a_old = self.new_to_old[a as usize].unwrap();

While the surrounding logic should guarantee this, any future change
in search_shortcuts_on_level or misuse of LinksContainer could break
the invariant and bring the whole build down with an unexpected panic.

A safer pattern is to propagate the error explicitly:

- let a_old = self.new_to_old[a as usize].unwrap(); + let a_old = self.new_to_old[a as usize] + .ok_or_else(|| OperationError::service_error("Inconsistent mapping (a)"))?;

Do this for both a_old and b_old, or at minimum replace unwrap
with debug_assert! so that release builds stay panic-free.

🤖 Prompt for AI Agents

In lib/segment/src/index/hnsw_index/hnsw.rs around lines 1555 to 1604, the use of unwrap() on self.new_to_old[a as usize] and self.new_to_old[b as usize] inside the new_scorer closure can cause panics if the internal mapping is inconsistent. To fix this, replace unwrap() with explicit error handling by propagating the error or, at minimum, replace unwrap() calls with debug_assert! checks to ensure release builds do not panic while still catching issues in debug builds.

generall

There are a lot of thing which require more detailed explanation. I propose to schedule a call to comment stuff together.

generall · 2025-05-18T16:11:13Z

lib/segment/src/index/hnsw_index/hnsw.rs

        vector_storage: &VectorStorageEnum,
        id_tracker: &IdTrackerSS,
    ) -> Option<Self> {
+        feature_flags.incremental_hnsw_building.then_some(())?;


I would rather made more explicit.

generall · 2025-05-18T16:12:08Z

lib/segment/src/index/hnsw_index/hnsw.rs

            // Old index has deleted points.
-            // FIXME: Not supported yet.
-            return None;
+            feature_flags.hnsw_healing.then_some(())?;


same, this ie might be compact, but not obvious use of ?

generall · 2025-05-18T16:14:03Z

lib/segment/src/index/hnsw_index/hnsw.rs

-                    // FIXME: Not supported yet.
-                    return None;
+                    missing_points += 1;
+                    feature_flags.hnsw_healing.then_some(())?;


generall · 2025-05-18T16:14:21Z

lib/segment/src/index/hnsw_index/hnsw.rs

-                        // FIXME: Not supported yet.
-                        return None;
+                        missing_points += 1;
+                        feature_flags.hnsw_healing.then_some(())?;


generall · 2025-05-18T16:14:31Z

lib/segment/src/index/hnsw_index/hnsw.rs

+            {
+                // Vector is deleted, but still present in the graph.
+                missing_points += 1;
+                feature_flags.hnsw_healing.then_some(())?;


generall · 2025-05-18T16:21:53Z

lib/segment/src/index/hnsw_index/hnsw.rs

-                .filter_map(|dst_old| self.old_to_new[dst_old as usize])
-                .collect()
+            let level_m = if level == 0 { self.m0 } else { self.m };
+            let mut new_links = Vec::new();


Suggested change

let mut new_links = Vec::new();

let mut new_links = Vec::with_capacity(level_m);

generall · 2025-05-18T16:51:40Z

lib/segment/src/index/hnsw_index/hnsw.rs

-                    if let Some(links) = old_index.get_links(vector_id) {
-                        graph_layers_builder.add_new_point(vector_id, links);
+                    if let Some(old_offset) = old_index.new_to_old[vector_id as usize] {
+                        to_heal.push(old_offset);


It looks like the only condition for to_heal is that point exists in another graph. Which is confusing, because we already could re-use points before. Do we call all migrations healing, or I am missing something?

generall · 2025-05-18T17:16:53Z

lib/segment/src/index/hnsw_index/hnsw.rs

            }
        }

+        for old_offset in 0..old_to_new.len() {


This section requires a comment

generall · 2025-05-18T17:39:43Z

lib/segment/src/index/hnsw_index/hnsw.rs


+        let hardware_counter = HardwareCounterCell::disposable(); // Internal operation. No measurements needed
+        let vector_storage = self.index.vector_storage.borrow();
+        let old_scorer = new_raw_scorer(


this requires comment

generall · 2025-05-18T17:41:56Z

lib/segment/src/index/hnsw_index/hnsw.rs

+            hardware_counter,
+        )?;
+
        let links = (0..=point_level).map(|level| {


links -> links_per_level, otherwise it is not clear without a typehint

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

lib/segment/src/index/hnsw_index/hnsw.rs (1)
1621-1623: ⚠️ Potential issue

unwrap() on internal mapping can panic in inconsistent states

new_scorer assumes that every shortcut returned by search_shortcuts_on_level has a valid back-mapping:
let a_old = self.new_to_old[a as usize].unwrap();
While the surrounding logic should guarantee this, any future change in search_shortcuts_on_level or misuse of LinksContainer could break the invariant and bring the whole build down with an unexpected panic.

A safer pattern is to propagate the error explicitly:
let new_scorer = |a, b| {
-    let a_old = self.new_to_old[a as usize].unwrap();
-    let b_old = self.new_to_old[b as usize].unwrap();
-    old_scorer.score_internal(a_old, b_old)
+    match (self.new_to_old[a as usize], self.new_to_old[b as usize]) {
+        (Some(a_old), Some(b_old)) => old_scorer.score_internal(a_old, b_old),
+        _ => {
+            // This should not happen, but handle it gracefully
+            debug_assert!(false, "Missing mapping in new_scorer");
+            0.0 // Return worst possible score for invalid mappings
+        }
+    }
};

🧹 Nitpick comments (3)

lib/segment/src/index/hnsw_index/hnsw.rs (2)

1496-1506: This section requires a comment

This block identifies points that once existed in the old index but were deleted in its tracker, marking them as missing points for healing purposes.

Add a more detailed comment explaining this loop's purpose:

        // Loop 1: use both id trackers to find matching external ids.
        for item in itertools::merge_join_by(
            id_tracker.iter_from(None),
            old_id_tracker.iter_from(None),
            |(new_external_id, _), (old_external_id, _)| new_external_id.cmp(old_external_id),
        ) {
-        // Loop 2: find points that once were in the old index, but deleted in
-        // its id tracker.
+        // Loop 2: Find points that were deleted in the old index's tracker
+        // but still exist in the graph structure. These points need healing
+        // since they have connections in the graph but no valid vectors.
        for old_offset in 0..old_to_new.len() {
            let old_offset = old_offset as PointOffsetType;
            if old_id_tracker.is_deleted_point(old_offset) && old_graph_has_point(old_offset) {
                missing_points += 1;
                if feature_flags.hnsw_healing {
                    return None;
                }
            }
        }

1593-1630: This requires comment

The links_on_level method is complex and would benefit from a clear explanation of how it identifies and fixes links when migrating points from an old index.

Add a detailed comment explaining the method's purpose and approach:

+    /// Reconstructs links for a point on a specific level during migration.
+    /// 
+    /// This method performs two key operations:
+    /// 1. Preserves valid links that point to points still existing in the new index
+    /// 2. For links pointing to deleted/missing points, finds alternative shortcuts
+    ///    using a specialized graph search to maintain connectivity
    fn links_on_level(
        &self,
        src_old: PointOffsetType,
        level: usize,
        builder: &GraphLayersBuilder,
        old_scorer: &dyn RawScorer,
    ) -> Vec<PointOffsetType> {
        let links = &self.index.graph.links;

        let level_m = if level == 0 { self.m0 } else { self.m };
        let mut new_links = Vec::with_capacity(level_m);
        let mut need_fixing = 0;

        for link in links.links(src_old, level).take(level_m) {
            if let Some(new_link) = self.old_to_new[link as usize] {
                new_links.push(new_link);
            } else {
                need_fixing += 1;
            }
        }

        if need_fixing > 0 {
            let shortcuts = builder
                .search_shortcuts_on_level(src_old, level, old_scorer, self)
                .into_iter_sorted();
            let mut container = LinksContainer::with_capacity(level_m);
            let new_scorer = |a, b| {
                let a_old = self.new_to_old[a as usize].unwrap();
                let b_old = self.new_to_old[b as usize].unwrap();
                old_scorer.score_internal(a_old, b_old)
            };
            container.fill_from_sorted_with_heuristic(shortcuts, need_fixing, new_scorer);
            new_links.extend_from_slice(container.links());
        }

        new_links
    }

lib/segment/src/index/hnsw_index/graph_layers_builder.rs (1)

526-612: Comprehensive implementation of shortcut search

The search_shortcuts_on_level method effectively implements a specialized search that can traverse deleted points to find valid shortcuts. The implementation is thorough and covers edge cases well.

There are a few debug assertions that could be improved with more informative error messages to help diagnose issues in development.

Add more descriptive messages to debug assertions:
debug_assert!(
    false,
-    "The new graph points expected to be a subset of the old graph points"
+    "Found a point in new graph that doesn't exist in old graph. Point ID: {new_link}"
);

debug_assert!(
    offset_new_ready.is_none(),
-    "For ready points, every neighbor expected to be in the new graph"
+    "Found a ready point with neighbors missing from new graph. Point ID: {candidate_old.idx}"
);

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6010008 and 1481fd0.

📒 Files selected for processing (2)

lib/segment/src/index/hnsw_index/graph_layers_builder.rs (6 hunks)
lib/segment/src/index/hnsw_index/hnsw.rs (11 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (13)

GitHub Check: test-low-resources
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: test-consistency
GitHub Check: Basic TLS/HTTPS tests
GitHub Check: integration-tests-consensus
GitHub Check: test-consensus-compose
GitHub Check: integration-tests
GitHub Check: rust-tests (macos-latest)
GitHub Check: rust-tests (windows-latest)
GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: lint
GitHub Check: storage-compat-test

🔇 Additional comments (16)

lib/segment/src/index/hnsw_index/hnsw.rs (9)

15-15: Good addition of FeatureFlags import

The code now imports the FeatureFlags struct, which will be used to control the new healing functionality.

261-267: Feature flag gating for incremental building

Properly gates the old index evaluation with feature flags, ensuring compatibility with existing configurations.

387-406: Good implementation of parallel migration

The migration process has been parallelized using the thread pool, which should significantly improve performance when reusing the index. The timers also help with performance monitoring.

1383-1384: Well-structured data model for tracking point status

Adding the missing_points field allows tracking points that need healing, which is essential for determining whether healing is more efficient than rebuilding.

1408-1411: Proper feature flag check

Early return if incremental building is disabled, which prevents unnecessary computation.

1437-1441: Improved point validation

The new helper function old_graph_has_point provides a clear way to check if a point is included in the old graph by looking for outgoing links at level 0.

1512-1519: Well-designed healing threshold

Limiting healing to cases where less than 10% of points are missing is a good optimization. If too many points need healing, rebuilding from scratch is likely more efficient.

1542-1546: Improved logging for operational visibility

Added detailed logging of reused and healed points, which helps with monitoring and debugging the healing process.

1564-1566: Useful accessor method

The graph() method provides clean access to the internal graph structure, which is needed for the healing process.

lib/segment/src/index/hnsw_index/graph_layers_builder.rs (7)

6-8: Appropriate import updates

Updated imports to support the new functionality while maintaining a clean dependency structure.

354-358: Improved parameter naming

Changed parameter name from levels to links_by_level to better reflect the actual content of the parameter, making the code more self-documenting.

387-388: Updated initialization pattern for SearchContext

Split the search context initialization into separate creation and processing steps, improving consistency with other parts of the codebase and making the flow clearer.

501-525: Well-documented dual-graph search algorithm

The documentation for the search_shortcuts_on_level method clearly explains the dual-graph search approach and how it differs from regular search. This is essential for understanding this complex algorithm.

554-559: Efficient pruning of search space

This early termination check improves efficiency by skipping candidates that won't improve the results, which is particularly important for a search algorithm that might traverse many deleted points.

564-578: Conditional graph traversal based on point readiness

This section intelligently decides whether to use the new or old graph for traversal based on whether the point is ready in the new graph, implementing the dual-graph search strategy described in the comments.

601-608: Careful handling of search results

The code distinguishes between points that should be added to the search results (points with mappings to the new index) and points that should be queued for further exploration (points without mappings), implementing the traversal strategy described in the comments.

coderabbitai · 2025-05-19T12:20:08Z

lib/segment/src/index/hnsw_index/hnsw.rs

+    /// Migrate point from old index to new index.
+    fn migrate_point(
+        &self,
+        src_old: PointOffsetType,
+        builder: &GraphLayersBuilder,
+    ) -> OperationResult<()> {
        let links = &self.index.graph.links;
        let point_level = links.point_level(src_old);

-        let links = (0..=point_level).map(|level| {
-            links
-                .links(src_old, level)
-                .take(if level == 0 { self.m0 } else { self.m })
-                .filter_map(|dst_old| self.old_to_new[dst_old as usize])
-                .collect()
-        });
-        Some(links.collect())
+        let hardware_counter = HardwareCounterCell::disposable(); // Internal operation. No measurements needed
+        let vector_storage = self.index.vector_storage.borrow();
+        let old_scorer = new_raw_scorer(
+            vector_storage.get_vector(src_old).as_vec_ref().into(),
+            &vector_storage,
+            hardware_counter,
+        )?;
+
+        let links_by_level: Vec<Vec<PointOffsetType>> = (0..=point_level)
+            .map(|level| self.links_on_level(src_old, level, builder, old_scorer.as_ref()))
+            .collect();
+
+        builder.add_new_point(self.old_to_new[src_old as usize].unwrap(), links_by_level);
+
+        Ok(())
+    }


⚠️ Potential issue

This migration function needs additional error handling

The migrate_point function is a core part of the healing process, migrating a point from the old index to the new one. However, it uses .unwrap() on the point access which could panic if the mapping becomes inconsistent.

Safer handling of potential mapping errors can prevent runtime panics:

fn migrate_point( &self, src_old: PointOffsetType, builder: &GraphLayersBuilder, ) -> OperationResult<()> { let links = &self.index.graph.links; let point_level = links.point_level(src_old); let hardware_counter = HardwareCounterCell::disposable(); // Internal operation. No measurements needed let vector_storage = self.index.vector_storage.borrow(); let old_scorer = new_raw_scorer( vector_storage.get_vector(src_old).as_vec_ref().into(), &vector_storage, hardware_counter, )?; let links_by_level: Vec<Vec<PointOffsetType>> = (0..=point_level) .map(|level| self.links_on_level(src_old, level, builder, old_scorer.as_ref())) .collect(); - builder.add_new_point(self.old_to_new[src_old as usize].unwrap(), links_by_level); + if let Some(new_id) = self.old_to_new[src_old as usize] { + builder.add_new_point(new_id, links_by_level); + } else { + return Err(OperationError::service_error( + format!("Inconsistent mapping for point {}", src_old) + )); + } Ok(()) }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

/// Migrate point from old index to new index.

fn migrate_point(

&self,

src_old: PointOffsetType,

builder: &GraphLayersBuilder,

) -> OperationResult<()> {

let links = &self.index.graph.links;

let point_level = links.point_level(src_old);

let links = (0..=point_level).map(|level| {

links

.links(src_old, level)

.take(if level == 0 { self.m0 } else { self.m })

.filter_map(|dst_old| self.old_to_new[dst_old as usize])

.collect()

});

Some(links.collect())

let hardware_counter = HardwareCounterCell::disposable(); // Internal operation. No measurements needed

let vector_storage = self.index.vector_storage.borrow();

let old_scorer = new_raw_scorer(

vector_storage.get_vector(src_old).as_vec_ref().into(),

&vector_storage,

hardware_counter,

)?;

let links_by_level: Vec<Vec<PointOffsetType>> = (0..=point_level)

.map(|level| self.links_on_level(src_old, level, builder, old_scorer.as_ref()))

.collect();

builder.add_new_point(self.old_to_new[src_old as usize].unwrap(), links_by_level);

Ok(())

}

/// Migrate point from old index to new index.

fn migrate_point(

&self,

src_old: PointOffsetType,

builder: &GraphLayersBuilder,

) -> OperationResult<()> {

let links = &self.index.graph.links;

let point_level = links.point_level(src_old);

let hardware_counter = HardwareCounterCell::disposable(); // Internal operation. No measurements needed

let vector_storage = self.index.vector_storage.borrow();

let old_scorer = new_raw_scorer(

vector_storage.get_vector(src_old).as_vec_ref().into(),

&vector_storage,

hardware_counter,

)?;

let links_by_level: Vec<Vec<PointOffsetType>> = (0..=point_level)

.map(|level| self.links_on_level(src_old, level, builder, old_scorer.as_ref()))

.collect();

- builder.add_new_point(self.old_to_new[src_old as usize].unwrap(), links_by_level);

+ if let Some(new_id) = self.old_to_new[src_old as usize] {

+ builder.add_new_point(new_id, links_by_level);

+ } else {

+ return Err(OperationError::service_error(

+ format!("Inconsistent mapping for point {}", src_old)

+ ));

+ }

Ok(())

}

🤖 Prompt for AI Agents

In lib/segment/src/index/hnsw_index/hnsw.rs around lines 1568 to 1592, the migrate_point function uses unwrap() on self.old_to_new[src_old as usize], which can cause a panic if the mapping is inconsistent. Replace unwrap() with proper error handling by checking if the mapping exists and returning an error if it does not, to prevent runtime panics and ensure safer migration.

* Let SearchContext::new create empty context * FixedLengthPriorityQueue: add `is_full()` * Implement HNSW healing * Apply review suggestions * review comments and docstrcigs --------- Co-authored-by: generall <andrey@vasnetsov.com>

xzfc added 3 commits May 15, 2025 11:04

Let SearchContext::new create empty context

18eb3c4

FixedLengthPriorityQueue: add is_full()

2802f65

Implement HNSW healing

6010008

xzfc added this to the Incremental HNSW building milestone May 16, 2025

xzfc requested review from IvanPleshkov and generall May 16, 2025 07:52

coderabbitai bot reviewed May 16, 2025

View reviewed changes

github-actions bot mentioned this pull request May 16, 2025

Flaky test index::tests::hw_counter_test::test_hw_counter_for_plain_sparse_search #6231

Closed

generall reviewed May 18, 2025

View reviewed changes

Apply review suggestions

1481fd0

coderabbitai bot reviewed May 19, 2025

View reviewed changes

review comments and docstrcigs

34b57f7

generall approved these changes May 19, 2025

View reviewed changes

xzfc merged commit e5d3f4e into dev May 19, 2025
17 checks passed

xzfc deleted the hnsw-incremental-healing branch May 19, 2025 15:47

xzfc mentioned this pull request May 23, 2025

Invert feature_flags.hnsw_healing #6589

Merged

coderabbitai bot mentioned this pull request Jun 2, 2025

migrate_point_on_level: heal up to level_m #6624

Merged

xzfc mentioned this pull request Jun 6, 2025

Insert backlinks during healing #6650

Merged

coderabbitai bot mentioned this pull request Jun 25, 2025

Configurable HNSW healing threshold #6756

Merged

coderabbitai bot mentioned this pull request Oct 14, 2025

Refactor: merge _search_on_level into search_on_level #7403

Merged

	let mut new_links = Vec::new();
	let mut new_links = Vec::with_capacity(level_m);

Conversation

xzfc commented May 16, 2025

Benchmark on random data

Uh oh!

coderabbitai bot commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Possibly related PRs

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 16, 2025

Choose a reason for hiding this comment

Uh oh!

generall left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented May 16, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)