ignore shards with no active replicas by generall · Pull Request #7707 · qdrant/qdrant

generall · 2025-12-06T21:30:44Z

Motivation

For tiered multitenancy we recommend to create a shard_key with Patrial state.
It works, however, while the patrial shard key exist, the collection info API stops being available, because one of the shard ids have no active replicas.

Solution

This PR introduces an exception logic for the case of custom sharding and unavalable shard.

coderabbitai · 2025-12-09T13:17:49Z

📝 Walkthrough

Walkthrough

This change adds shard activity checking functionality across the replica set and shard holder modules. A new public method shard_is_active() is introduced to ShardReplicaSet that checks if any replica among peers is active. This method is then integrated into ShardHolder::select_shards to filter out inactive shards when using custom sharding mode. A test helper function scroll_points() is added to facilitate testing the scroll API endpoint during shard validation scenarios.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

The new shard_is_active() method contains straightforward iteration logic with a boolean return
The integration in ShardHolder uses a simple conditional check based on the sharding method flag
Test helper function is minimal with basic HTTP request and assertion logic
Changes are localized and follow existing patterns
Key area for attention: verify that the is_custom_sharding flag derivation correctly identifies custom sharding scenarios and that the filtering logic applies appropriately only in that case

Possibly related PRs

check if we try to deactivate last initializing replica #6755: Modifies shard activity semantics by changing is_last_active_replica to is_last_source_of_truth_replica and refactoring related active-peer logic in ShardReplicaSet
Create new custom shards in Initializing state instead of Active #6778: Adds peer_is_initializing() and refactors active state checks within ShardReplicaSet implementation
Mark *new* local replicas as locally-disabled, when recovering Raft snapshot #7684: Modifies ShardReplicaSet implementation including changes to add_locally_disabled and replica state handling

Suggested reviewers

timvisee
KShivendu

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'ignore shards with no active replicas' is directly related to the main change: adding logic to filter out shards without active replicas during custom sharding.
Description check	✅ Passed	The description explains the motivation and solution, correctly identifying that the changes address collection info API failures when shards have no active replicas under custom sharding.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch handle-unavailable-shard-key

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

lib/collection/src/shards/replica_set/mod.rs (1)
1083-1093: Shard-wide activity check is correct; could reuse existing helpers

The implementation correctly reports the shard as active if any peer is in an is_active state, which is exactly what you need to distinguish “all Partial/Dead” shards in custom sharding. Note that this uses ReplicaState::is_active, so shards that are only ActiveRead will be treated as inactive, which aligns with how ActiveRead is treated as unhealthy in other health paths. Based on learnings, this is desirable.

For consistency with the rest of the codebase you could optionally implement this in terms of existing helpers, e.g.:
pub fn shard_is_active(&self) -> bool {
    let state = self.replica_state.read();
    !state.active_peers().is_empty()
    // or: !self.active_shards(false).is_empty()
}
lib/collection/src/shards/shard_holder/mod.rs (1)

588-591: Custom-sharding-only filtering of inactive shards in select_shards(All) looks right

This is_custom_sharding flag plus the !shard.shard_is_active() guard cleanly restricts the “skip inactive shards” behavior to custom sharding, leaving auto sharding unchanged and keeping the existing resharding-up guard intact. That matches the intent to ignore shard keys that currently have no active replicas without masking unexpected full-shard outages in auto mode.

As a tiny style nit, you could write the flag as let is_custom_sharding = matches!(self.sharding_method, ShardingMethod::Custom); but the current match is perfectly fine.

Also applies to: 606-613

tests/consensus_tests/test_tenant_promotion.py (1)

29-42: scroll_points helper correctly exercises the scroll API

The helper is straightforward and matches the existing HTTP-test style: it hits /points/scroll, asserts HTTP OK via assert_http_ok, and returns the points, with shard_key passed through so you can test both global and keyed reads. That should be enough to surface the regression around shards with no active replicas.

If you ever want Ruff F405 to be fully quiet in this file, you could replace the star imports with explicit ones (e.g., import requests and from .assertions import assert_http_ok / from .utils import get_collection_info), but that’s purely a linting concern and consistent with other consensus tests.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 152172f and 61bb4f2.

📒 Files selected for processing (3)

lib/collection/src/shards/replica_set/mod.rs (1 hunks)
lib/collection/src/shards/shard_holder/mod.rs (2 hunks)
tests/consensus_tests/test_tenant_promotion.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.rs

📄 CodeRabbit inference engine (.github/review-rules.md)

**/*.rs: Prefer explicit SomeType::from(x) over implicit x.into() in Rust
In new code, don't use transmute_from_u8, transmute_to_u8, transmute_from_u8_to_slice, transmute_from_u8_to_mut_slice, transmute_to_u8_slice - use bytemuck or zerocopy crates instead
Prefer explicit exhaustive matches over catch-all _ arm in Rust match expressions, so we don't miss handling new enum variants when they are added later (exceptions: tests and benchmarks, or when sure new variants won't affect logic)
Prefer explicit field ignoring using : _ over using .. in Rust struct destructuring (exceptions: tests and benchmarks, or when sure new fields won't affect logic)

Files:

lib/collection/src/shards/shard_holder/mod.rs
lib/collection/src/shards/replica_set/mod.rs

🧠 Learnings (2)

📓 Common learnings

Learnt from: generall
Repo: qdrant/qdrant PR: 7474
File: src/common/health.rs:328-330
Timestamp: 2025-11-10T14:50:02.722Z
Learning: In Qdrant's `ReplicaState`, the `ActiveRead` variant is intentionally considered unhealthy (returns `false` from `is_healthy()`) because it requires a shard transfer operation to finish. This is the correct behavior for health checks.

📚 Learning: 2025-11-10T14:50:02.722Z

Learnt from: generall
Repo: qdrant/qdrant PR: 7474
File: src/common/health.rs:328-330
Timestamp: 2025-11-10T14:50:02.722Z
Learning: In Qdrant's `ReplicaState`, the `ActiveRead` variant is intentionally considered unhealthy (returns `false` from `is_healthy()`) because it requires a shard transfer operation to finish. This is the correct behavior for health checks.

Applied to files:

lib/collection/src/shards/replica_set/mod.rs

🧬 Code graph analysis (1)

tests/consensus_tests/test_tenant_promotion.py (2)

tests/consensus_tests/assertions.py (1)

assert_http_ok (6-13)

tests/consensus_tests/utils.py (1)

get_collection_info (314-318)

🪛 Ruff (0.14.8)

tests/consensus_tests/test_tenant_promotion.py

30-30: requests may be undefined, or defined from star imports

(F405)

39-39: assert_http_ok may be undefined, or defined from star imports

(F405)

140-140: get_collection_info may be undefined, or defined from star imports

(F405)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)

GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: rust-tests (macos-latest)
GitHub Check: rust-tests (windows-latest)
GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: Build Qdrant Edge Python bindings
GitHub Check: e2e-tests
GitHub Check: integration-tests-consensus
GitHub Check: lint
GitHub Check: test-consensus-compose
GitHub Check: test-consistency
GitHub Check: integration-tests

🔇 Additional comments (1)

tests/consensus_tests/test_tenant_promotion.py (1)

136-141: New global scroll and collection-info checks cover the intended regression

Calling scroll_points(..., shard_key=None, ...) followed by get_collection_info(...) right after creating a Partial tenant shard ensures both the global scroll path and collection info endpoint stay available in the presence of shard keys without active replicas, which directly validates the behavior introduced in this PR.

KShivendu

LGTM and working as expected. Thanks!

KShivendu self-requested a review December 8, 2025 10:04

generall force-pushed the refactor-replica-set branch from 4628518 to 27040c4 Compare December 9, 2025 12:34

Base automatically changed from refactor-replica-set to dev December 9, 2025 13:09

ignore shards with no active replicas

61bb4f2

generall force-pushed the handle-unavailable-shard-key branch from 8af2106 to 61bb4f2 Compare December 9, 2025 13:10

coderabbitai bot reviewed Dec 9, 2025

View reviewed changes

generall requested a review from timvisee December 9, 2025 15:44

KShivendu approved these changes Dec 9, 2025

View reviewed changes

generall merged commit 12815e8 into dev Dec 9, 2025
15 checks passed

generall deleted the handle-unavailable-shard-key branch December 9, 2025 15:55

timvisee pushed a commit that referenced this pull request Dec 18, 2025

ignore shards with no active replicas (#7707)

2101fbd

coderabbitai bot mentioned this pull request Jan 5, 2026

Handle operations when having no shards with custom sharding #7856

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ignore shards with no active replicas#7707

ignore shards with no active replicas#7707
generall merged 1 commit intodevfrom
handle-unavailable-shard-key

generall commented Dec 6, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Dec 9, 2025

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

KShivendu left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

generall commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Solution

Uh oh!

coderabbitai bot commented Dec 9, 2025

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

KShivendu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

generall commented Dec 6, 2025 •

edited

Loading