Skip to content

ignore shards with no active replicas#7707

Merged
generall merged 1 commit intodevfrom
handle-unavailable-shard-key
Dec 9, 2025
Merged

ignore shards with no active replicas#7707
generall merged 1 commit intodevfrom
handle-unavailable-shard-key

Conversation

@generall
Copy link
Member

@generall generall commented Dec 6, 2025

Motivation

For tiered multitenancy we recommend to create a shard_key with Patrial state.
It works, however, while the patrial shard key exist, the collection info API stops being available, because one of the shard ids have no active replicas.

Solution

This PR introduces an exception logic for the case of custom sharding and unavalable shard.

@KShivendu KShivendu self-requested a review December 8, 2025 10:04
@generall generall force-pushed the refactor-replica-set branch from 4628518 to 27040c4 Compare December 9, 2025 12:34
Base automatically changed from refactor-replica-set to dev December 9, 2025 13:09
@generall generall force-pushed the handle-unavailable-shard-key branch from 8af2106 to 61bb4f2 Compare December 9, 2025 13:10
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 9, 2025

📝 Walkthrough

Walkthrough

This change adds shard activity checking functionality across the replica set and shard holder modules. A new public method shard_is_active() is introduced to ShardReplicaSet that checks if any replica among peers is active. This method is then integrated into ShardHolder::select_shards to filter out inactive shards when using custom sharding mode. A test helper function scroll_points() is added to facilitate testing the scroll API endpoint during shard validation scenarios.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

  • The new shard_is_active() method contains straightforward iteration logic with a boolean return
  • The integration in ShardHolder uses a simple conditional check based on the sharding method flag
  • Test helper function is minimal with basic HTTP request and assertion logic
  • Changes are localized and follow existing patterns
  • Key area for attention: verify that the is_custom_sharding flag derivation correctly identifies custom sharding scenarios and that the filtering logic applies appropriately only in that case

Possibly related PRs

Suggested reviewers

  • timvisee
  • KShivendu

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'ignore shards with no active replicas' is directly related to the main change: adding logic to filter out shards without active replicas during custom sharding.
Description check ✅ Passed The description explains the motivation and solution, correctly identifying that the changes address collection info API failures when shards have no active replicas under custom sharding.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch handle-unavailable-shard-key

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
lib/collection/src/shards/replica_set/mod.rs (1)

1083-1093: Shard-wide activity check is correct; could reuse existing helpers

The implementation correctly reports the shard as active if any peer is in an is_active state, which is exactly what you need to distinguish “all Partial/Dead” shards in custom sharding. Note that this uses ReplicaState::is_active, so shards that are only ActiveRead will be treated as inactive, which aligns with how ActiveRead is treated as unhealthy in other health paths. Based on learnings, this is desirable.

For consistency with the rest of the codebase you could optionally implement this in terms of existing helpers, e.g.:

pub fn shard_is_active(&self) -> bool {
    let state = self.replica_state.read();
    !state.active_peers().is_empty()
    // or: !self.active_shards(false).is_empty()
}
lib/collection/src/shards/shard_holder/mod.rs (1)

588-591: Custom-sharding-only filtering of inactive shards in select_shards(All) looks right

This is_custom_sharding flag plus the !shard.shard_is_active() guard cleanly restricts the “skip inactive shards” behavior to custom sharding, leaving auto sharding unchanged and keeping the existing resharding-up guard intact. That matches the intent to ignore shard keys that currently have no active replicas without masking unexpected full-shard outages in auto mode.

As a tiny style nit, you could write the flag as let is_custom_sharding = matches!(self.sharding_method, ShardingMethod::Custom); but the current match is perfectly fine.

Also applies to: 606-613

tests/consensus_tests/test_tenant_promotion.py (1)

29-42: scroll_points helper correctly exercises the scroll API

The helper is straightforward and matches the existing HTTP-test style: it hits /points/scroll, asserts HTTP OK via assert_http_ok, and returns the points, with shard_key passed through so you can test both global and keyed reads. That should be enough to surface the regression around shards with no active replicas.

If you ever want Ruff F405 to be fully quiet in this file, you could replace the star imports with explicit ones (e.g., import requests and from .assertions import assert_http_ok / from .utils import get_collection_info), but that’s purely a linting concern and consistent with other consensus tests.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 152172f and 61bb4f2.

📒 Files selected for processing (3)
  • lib/collection/src/shards/replica_set/mod.rs (1 hunks)
  • lib/collection/src/shards/shard_holder/mod.rs (2 hunks)
  • tests/consensus_tests/test_tenant_promotion.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.rs

📄 CodeRabbit inference engine (.github/review-rules.md)

**/*.rs: Prefer explicit SomeType::from(x) over implicit x.into() in Rust
In new code, don't use transmute_from_u8, transmute_to_u8, transmute_from_u8_to_slice, transmute_from_u8_to_mut_slice, transmute_to_u8_slice - use bytemuck or zerocopy crates instead
Prefer explicit exhaustive matches over catch-all _ arm in Rust match expressions, so we don't miss handling new enum variants when they are added later (exceptions: tests and benchmarks, or when sure new variants won't affect logic)
Prefer explicit field ignoring using : _ over using .. in Rust struct destructuring (exceptions: tests and benchmarks, or when sure new fields won't affect logic)

Files:

  • lib/collection/src/shards/shard_holder/mod.rs
  • lib/collection/src/shards/replica_set/mod.rs
🧠 Learnings (2)
📓 Common learnings
Learnt from: generall
Repo: qdrant/qdrant PR: 7474
File: src/common/health.rs:328-330
Timestamp: 2025-11-10T14:50:02.722Z
Learning: In Qdrant's `ReplicaState`, the `ActiveRead` variant is intentionally considered unhealthy (returns `false` from `is_healthy()`) because it requires a shard transfer operation to finish. This is the correct behavior for health checks.
📚 Learning: 2025-11-10T14:50:02.722Z
Learnt from: generall
Repo: qdrant/qdrant PR: 7474
File: src/common/health.rs:328-330
Timestamp: 2025-11-10T14:50:02.722Z
Learning: In Qdrant's `ReplicaState`, the `ActiveRead` variant is intentionally considered unhealthy (returns `false` from `is_healthy()`) because it requires a shard transfer operation to finish. This is the correct behavior for health checks.

Applied to files:

  • lib/collection/src/shards/replica_set/mod.rs
🧬 Code graph analysis (1)
tests/consensus_tests/test_tenant_promotion.py (2)
tests/consensus_tests/assertions.py (1)
  • assert_http_ok (6-13)
tests/consensus_tests/utils.py (1)
  • get_collection_info (314-318)
🪛 Ruff (0.14.8)
tests/consensus_tests/test_tenant_promotion.py

30-30: requests may be undefined, or defined from star imports

(F405)


39-39: assert_http_ok may be undefined, or defined from star imports

(F405)


140-140: get_collection_info may be undefined, or defined from star imports

(F405)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: Build Qdrant Edge Python bindings
  • GitHub Check: e2e-tests
  • GitHub Check: integration-tests-consensus
  • GitHub Check: lint
  • GitHub Check: test-consensus-compose
  • GitHub Check: test-consistency
  • GitHub Check: integration-tests
🔇 Additional comments (1)
tests/consensus_tests/test_tenant_promotion.py (1)

136-141: New global scroll and collection-info checks cover the intended regression

Calling scroll_points(..., shard_key=None, ...) followed by get_collection_info(...) right after creating a Partial tenant shard ensures both the global scroll path and collection info endpoint stay available in the presence of shard keys without active replicas, which directly validates the behavior introduced in this PR.

@generall generall requested a review from timvisee December 9, 2025 15:44
Copy link
Member

@KShivendu KShivendu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and working as expected. Thanks!

@generall generall merged commit 12815e8 into dev Dec 9, 2025
15 checks passed
@generall generall deleted the handle-unavailable-shard-key branch December 9, 2025 15:55
timvisee pushed a commit that referenced this pull request Dec 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants