Fix deadlock on Actix RT in Collection Info#7277
Conversation
📝 WalkthroughWalkthrough
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
tests/consensus_tests/test_shard_snapshot_deadlock.py (1)
27-30: Make loop_collection_info resilient and stoppableAs written, any transient RequestException will terminate the loop and reduce the intended pressure. Also consider an optional stop flag to allow graceful shutdown in finally.
Apply:
-def loop_collection_info(peer_url): - while True: - requests.get(f"{peer_url}/collections/test_collection", timeout = 0.5) +def loop_collection_info(peer_url, stop=None, session=None): + sess = session or requests.Session() + try: + while stop is None or not stop.is_set(): + try: + sess.get(f"{peer_url}/collections/test_collection", timeout=0.5) + except requests.RequestException: + # ignore and continue to keep pressure on the endpoint + pass + finally: + if session is None: + sess.close()Optionally mirror this pattern in the other loop_* helpers and pass a shared threading.Event you set in the finally block before executor shutdown. Happy to provide a full diff across the file if you want that change.
lib/collection/src/shards/local_shard/mod.rs (2)
1103-1129: Good: moves per‑segment aggregation off Actix via spawn_blockingThis removes blocking work from the RT thread while keeping segment access under a blocking read. LGTM. One minor improvement: consider taking a snapshot of segment handles (if feasible) under the short read lock and then iterating them after dropping the holder read lock to minimize contention.
1130-1136: Streamline error handling and keep contextCurrent pattern logs Err then unwrap_or_default(). A compact match avoids double handling and lets you add context once.
Apply:
- let segment_info = tokio::task::spawn_blocking(move || { + let segment_info = tokio::task::spawn_blocking(move || { // ... unchanged ... - }) - .await; - - if let Err(err) = &segment_info { - log::error!("Failed to get local shard info: {err}"); - } - - let (schema, indexed_vectors_count, points_count, segments_count) = - segment_info.unwrap_or_default(); + }).await; + + let (schema, indexed_vectors_count, points_count, segments_count) = match segment_info { + Ok(res) => res, + Err(err) => { + log::error!("Failed to get local shard info: {err}"); + Default::default() + } + };Also, please confirm that merging
PayloadIndexInfoby summing onlypoints(keeping other fields from the first seen segment) matches intended semantics across segments.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
lib/collection/src/shards/local_shard/mod.rs(2 hunks)tests/consensus_tests/test_shard_snapshot_deadlock.py(3 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs
📄 CodeRabbit inference engine (.github/review-rules.md)
**/*.rs: Prefer explicit SomeType::from(x) over implicit x.into() in Rust code
Do not use transmute_from_u8, transmute_to_u8, transmute_from_u8_to_slice, transmute_from_u8_to_mut_slice, transmute_to_u8_slice in new code; use bytemuck or zerocopy instead
Files:
lib/collection/src/shards/local_shard/mod.rs
**/src/**/*.rs
📄 CodeRabbit inference engine (.github/review-rules.md)
**/src/**/*.rs: Prefer exhaustive match arms over a catch-all _ arm to avoid missing new enum variants (except in tests/benchmarks or when provably safe)
Prefer explicit field ignoring with : _ over .. in struct patterns (except in tests/benchmarks or when provably safe)
Files:
lib/collection/src/shards/local_shard/mod.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: e2e-tests
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-consensus-compose
- GitHub Check: integration-tests
- GitHub Check: integration-tests-consensus
- GitHub Check: lint
- GitHub Check: storage-compat-test
- GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: rust-tests (windows-latest)
🔇 Additional comments (2)
tests/consensus_tests/test_shard_snapshot_deadlock.py (2)
59-59: Bumping ThreadPoolExecutor to 6 workers is appropriateMatches the additional background task. No concerns.
76-77: Adding collection info pressure is spot onThis exercises the newly offloaded Collection Info path during the deadlock scenario. LGTM.
Follow up on #7267
Same rational and same fix to remove blocking computation from Actix runtime.