Skip to content

Dead shards in /metrics#7310

Merged
timvisee merged 2 commits intodevfrom
metrics-dead-shards
Oct 29, 2025
Merged

Dead shards in /metrics#7310
timvisee merged 2 commits intodevfrom
metrics-dead-shards

Conversation

@JojiiOfficial
Copy link
Contributor

Depends on #7307

Adds dead-shards metrics to /metrics API

@JojiiOfficial JojiiOfficial force-pushed the metrics-unindexed-vectors branch from 892735f to f4288d7 Compare September 26, 2025 15:39
@JojiiOfficial JojiiOfficial force-pushed the metrics-dead-shards branch 2 times, most recently from 1d650d0 to 210fc80 Compare September 26, 2025 15:47
@JojiiOfficial JojiiOfficial force-pushed the metrics-unindexed-vectors branch from 51f9bf3 to c97e534 Compare October 22, 2025 14:23
@JojiiOfficial JojiiOfficial force-pushed the metrics-dead-shards branch 2 times, most recently from 9dd2e19 to 97fea54 Compare October 22, 2025 15:05
@JojiiOfficial JojiiOfficial force-pushed the metrics-unindexed-vectors branch from 7001789 to a580671 Compare October 23, 2025 09:06
Copy link
Member

@timvisee timvisee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the review remark below, I think that description is a bit more on-point:

Other than that, all good!

@JojiiOfficial JojiiOfficial force-pushed the metrics-unindexed-vectors branch from 2f25d7d to d382289 Compare October 29, 2025 14:14
@JojiiOfficial JojiiOfficial force-pushed the metrics-unindexed-vectors branch 2 times, most recently from 6a6386c to e7cbba6 Compare October 29, 2025 15:41
Base automatically changed from metrics-unindexed-vectors to dev October 29, 2025 15:44
JojiiOfficial and others added 2 commits October 29, 2025 16:44
Co-authored-by: Tim Visée <tim+github@visee.me>
@timvisee timvisee merged commit c1e5b17 into dev Oct 29, 2025
12 checks passed
@timvisee timvisee deleted the metrics-dead-shards branch October 29, 2025 15:48
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 29, 2025

📝 Walkthrough

Walkthrough

The change adds global dead shard tracking to the metrics system in src/common/metrics.rs. A new total_dead_shards accumulator is introduced and incremented while iterating through collections. A new gauge metric dead_shards_total is populated with this accumulated value, providing a global view of non-active shard replicas across all collections alongside existing per-collection metrics.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Single file modification with straightforward metric addition
  • Accumulator pattern is likely consistent with existing code style
  • No public API changes or complex logic density

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title "Dead shards in /metrics" is directly and specifically related to the main change in the pull request. The changeset adds a new metric "dead_shards_total" to track dead shards across collections in the /metrics API, which is exactly what the title communicates. The title is concise, clear, and specific enough for a developer scanning history to understand the primary change without being overly verbose.
Description Check ✅ Passed The pull request description states "Adds dead-shards metrics to /metrics API" which is directly related to the changeset that introduces a new "dead_shards_total" metric for tracking total dead shards across collections. The description also appropriately notes the dependency on PR #7307. The description is clearly on-topic and directly addresses what is being changed in this pull request.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch metrics-dead-shards

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c890d3a and 9f39ddf.

📒 Files selected for processing (1)
  • src/common/metrics.rs (4 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs

📄 CodeRabbit inference engine (.github/review-rules.md)

**/*.rs: Prefer explicit SomeType::from(x) over implicit x.into() in Rust code
Do not use transmute_from_u8, transmute_to_u8, transmute_from_u8_to_slice, transmute_from_u8_to_mut_slice, transmute_to_u8_slice in new code; use bytemuck or zerocopy instead

Files:

  • src/common/metrics.rs
**/src/**/*.rs

📄 CodeRabbit inference engine (.github/review-rules.md)

**/src/**/*.rs: Prefer exhaustive match arms over a catch-all _ arm to avoid missing new enum variants (except in tests/benchmarks or when provably safe)
Prefer explicit field ignoring with : _ over .. in struct patterns (except in tests/benchmarks or when provably safe)

Files:

  • src/common/metrics.rs
🧬 Code graph analysis (1)
src/common/metrics.rs (2)
lib/collection/src/collection/mod.rs (1)
  • state (542-562)
src/actix/api/service_api.rs (1)
  • metrics (67-96)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: lint
  • GitHub Check: e2e-tests
  • GitHub Check: integration-tests-consensus
  • GitHub Check: test-consensus-compose
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-consistency
  • GitHub Check: integration-tests
🔇 Additional comments (2)
src/common/metrics.rs (2)

197-197: LGTM - Variable initialization is correct.

The accumulator is properly initialized and placed alongside other metric accumulators.


355-360: Metric definition structure is correct.

The metric family definition is properly structured with an appropriate GAUGE type and clear description. However, ensure the calculation logic at lines 297-302 is fixed to match this description.

Comment on lines +297 to +302
total_dead_shards += collection
.shards
.iter()
.flatten()
.filter(|i| i.replicate_states.values().any(|state| !state.is_active()))
.count();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Code counts shards, not replicas as described.

The current implementation counts shards that have at least one non-active replica, but the metric description at line 357 states "total amount of shard replicas in non-active state". This mismatch will produce incorrect metric values.

For example, if a shard has 3 replicas with 2 inactive, this code counts 1 (one problematic shard) rather than 2 (two inactive replicas).

Apply this diff to count actual non-active replicas:

-            total_dead_shards += collection
-                .shards
-                .iter()
-                .flatten()
-                .filter(|i| i.replicate_states.values().any(|state| !state.is_active()))
-                .count();
+            total_dead_shards += collection
+                .shards
+                .iter()
+                .flatten()
+                .flat_map(|shard| shard.replicate_states.values())
+                .filter(|state| !state.is_active())
+                .count();
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
total_dead_shards += collection
.shards
.iter()
.flatten()
.filter(|i| i.replicate_states.values().any(|state| !state.is_active()))
.count();
total_dead_shards += collection
.shards
.iter()
.flatten()
.flat_map(|shard| shard.replicate_states.values())
.filter(|state| !state.is_active())
.count();
🤖 Prompt for AI Agents
In src/common/metrics.rs around lines 297 to 302, the code currently counts
shards that contain any non-active replica instead of counting the actual number
of non-active replicas; replace the shard-level filter/count with logic that
iterates the shard replica states and sums each replica where state.is_active()
is false (i.e., count all non-active replicate_states values across all shards
and collections) so the metric reflects the total number of non-active replicas
rather than the number of affected shards.

@coderabbitai coderabbitai bot mentioned this pull request Oct 30, 2025
3 tasks
timvisee added a commit that referenced this pull request Nov 14, 2025
* Dead shards in /metrics

* Update src/common/metrics.rs

Co-authored-by: Tim Visée <tim+github@visee.me>

---------

Co-authored-by: Tim Visée <tim+github@visee.me>
@timvisee timvisee mentioned this pull request Nov 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants