Skip to content

Fix deadlock on Actix RT in Collection Info#7277

Merged
agourlay merged 1 commit intodevfrom
local-shard-info-blocking-thread
Sep 22, 2025
Merged

Fix deadlock on Actix RT in Collection Info#7277
agourlay merged 1 commit intodevfrom
local-shard-info-blocking-thread

Conversation

@agourlay
Copy link
Member

Follow up on #7267

Same rational and same fix to remove blocking computation from Actix runtime.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 22, 2025

📝 Walkthrough

Walkthrough

  • In lib/collection/src/shards/local_shard/mod.rs, per-segment statistics for shard info are moved into a spawned blocking task that iterates segments and accumulates payload schema, indexed_vectors_count, points_count, and segments_count. The async caller awaits the task, logs on failure, and falls back to defaults. A cloned segments handle is used within the blocking task. Status and optimizer_status retrieval remain unchanged.
  • In tests/consensus_tests/test_shard_snapshot_deadlock.py, a background polling helper loop loop_collection_info(peer_url) is added and launched during the test. ThreadPoolExecutor max_workers is increased from 5 to 6.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • timvisee
  • ffuugoo
  • KShivendu

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title "Fix deadlock on Actix RT in Collection Info" succinctly and accurately summarizes the primary change in this PR: moving blocking computation out of the Actix runtime to prevent deadlocks in the collection info path. It is concise, focused on the main issue and subsystem, and contains no extraneous noise. This makes the intent clear to reviewers and to teammates scanning the project history.
Description Check ✅ Passed The description states this is a follow-up to PR #7267 and that it applies the same rationale and fix to remove blocking computation from the Actix runtime, which directly matches the changes in the collection info code and accompanying test adjustments. Although brief, the description is on-topic and conveys the intent of the changes. It is therefore acceptable for this lenient check.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch local-shard-info-blocking-thread

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
tests/consensus_tests/test_shard_snapshot_deadlock.py (1)

27-30: Make loop_collection_info resilient and stoppable

As written, any transient RequestException will terminate the loop and reduce the intended pressure. Also consider an optional stop flag to allow graceful shutdown in finally.

Apply:

-def loop_collection_info(peer_url):
-    while True:
-        requests.get(f"{peer_url}/collections/test_collection", timeout = 0.5)
+def loop_collection_info(peer_url, stop=None, session=None):
+    sess = session or requests.Session()
+    try:
+        while stop is None or not stop.is_set():
+            try:
+                sess.get(f"{peer_url}/collections/test_collection", timeout=0.5)
+            except requests.RequestException:
+                # ignore and continue to keep pressure on the endpoint
+                pass
+    finally:
+        if session is None:
+            sess.close()

Optionally mirror this pattern in the other loop_* helpers and pass a shared threading.Event you set in the finally block before executor shutdown. Happy to provide a full diff across the file if you want that change.

lib/collection/src/shards/local_shard/mod.rs (2)

1103-1129: Good: moves per‑segment aggregation off Actix via spawn_blocking

This removes blocking work from the RT thread while keeping segment access under a blocking read. LGTM. One minor improvement: consider taking a snapshot of segment handles (if feasible) under the short read lock and then iterating them after dropping the holder read lock to minimize contention.


1130-1136: Streamline error handling and keep context

Current pattern logs Err then unwrap_or_default(). A compact match avoids double handling and lets you add context once.

Apply:

-        let segment_info = tokio::task::spawn_blocking(move || {
+        let segment_info = tokio::task::spawn_blocking(move || {
             // ... unchanged ...
-        })
-        .await;
-
-        if let Err(err) = &segment_info {
-            log::error!("Failed to get local shard info: {err}");
-        }
-
-        let (schema, indexed_vectors_count, points_count, segments_count) =
-            segment_info.unwrap_or_default();
+        }).await;
+
+        let (schema, indexed_vectors_count, points_count, segments_count) = match segment_info {
+            Ok(res) => res,
+            Err(err) => {
+                log::error!("Failed to get local shard info: {err}");
+                Default::default()
+            }
+        };

Also, please confirm that merging PayloadIndexInfo by summing only points (keeping other fields from the first seen segment) matches intended semantics across segments.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e86d60e and 2de4e2d.

📒 Files selected for processing (2)
  • lib/collection/src/shards/local_shard/mod.rs (2 hunks)
  • tests/consensus_tests/test_shard_snapshot_deadlock.py (3 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs

📄 CodeRabbit inference engine (.github/review-rules.md)

**/*.rs: Prefer explicit SomeType::from(x) over implicit x.into() in Rust code
Do not use transmute_from_u8, transmute_to_u8, transmute_from_u8_to_slice, transmute_from_u8_to_mut_slice, transmute_to_u8_slice in new code; use bytemuck or zerocopy instead

Files:

  • lib/collection/src/shards/local_shard/mod.rs
**/src/**/*.rs

📄 CodeRabbit inference engine (.github/review-rules.md)

**/src/**/*.rs: Prefer exhaustive match arms over a catch-all _ arm to avoid missing new enum variants (except in tests/benchmarks or when provably safe)
Prefer explicit field ignoring with : _ over .. in struct patterns (except in tests/benchmarks or when provably safe)

Files:

  • lib/collection/src/shards/local_shard/mod.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: e2e-tests
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-consensus-compose
  • GitHub Check: integration-tests
  • GitHub Check: integration-tests-consensus
  • GitHub Check: lint
  • GitHub Check: storage-compat-test
  • GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests (windows-latest)
🔇 Additional comments (2)
tests/consensus_tests/test_shard_snapshot_deadlock.py (2)

59-59: Bumping ThreadPoolExecutor to 6 workers is appropriate

Matches the additional background task. No concerns.


76-77: Adding collection info pressure is spot on

This exercises the newly offloaded Collection Info path during the deadlock scenario. LGTM.

@agourlay agourlay requested a review from timvisee September 22, 2025 09:10
@agourlay agourlay merged commit 04d8d05 into dev Sep 22, 2025
21 of 22 checks passed
@agourlay agourlay deleted the local-shard-info-blocking-thread branch September 22, 2025 09:14
@timvisee timvisee mentioned this pull request Sep 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants