Skip to content

Conversation

@hanabi1224
Copy link
Contributor

@hanabi1224 hanabi1224 commented Sep 15, 2025

Summary of changes

Changes introduced in this pull request:

  • add prometheus metrics actor_version, network_version and network_version_revision

Reference issue to close (if applicable)

Closes #5141

Other information and links

Change checklist

  • I have performed a self-review of my own code,
  • I have made corresponding changes to the documentation. All new code adheres to the team's documentation standards,
  • I have added tests that prove my fix is effective or that my feature works (if possible),
  • I have made sure the CHANGELOG is up-to-date. All user-facing changes should be reflected in this document.

Summary by CodeRabbit

  • New Features
    • Added Prometheus metrics: network_version, network_version_revision, actor_version; refined network height reporting (head & expected).
  • Public API
    • New network_version_revision(epoch) API; new method to retrieve actor bundle metadata; added conversion NetworkVersion → u32.
  • Documentation
    • Normalized CHANGELOG wording and updated Metrics reference with examples for the new metrics.
  • Tests
    • Added tests validating network version and revision behavior.
  • Chores
    • Removed legacy head_epoch gauge metric.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 15, 2025

Walkthrough

Adds Prometheus metrics for network_version, network_version_revision, and actor_version; threads Arc‑wrapped Weak-backed closures from the daemon into metrics initialization; introduces NetworkVersionCollector and DescriptorEncoder encoding; makes network version logic height-driven; adds StateTree actor-bundle lookup and NetworkVersion→u32 conversion.

Changes

Cohort / File(s) Summary
Changelog & Docs
CHANGELOG.md, docs/docs/users/reference/metrics.md
Minor changelog formatting; document and add examples for network_version, network_version_revision, and actor_version; update build_info example.
Daemon wiring
src/daemon/mod.rs
Add Arc-wrapped Weak-backed closures get_chain_head_height and get_chain_head_actor_version; pass chain_config and closures into init_prometheus; import StateTree for actor-version retrieval; reorganize imports.
Metrics init
src/metrics/mod.rs
Extend init_prometheus signature to accept Arc<ChainConfig>, head-height and actor-version getters; register NetworkVersionCollector; keep existing collectors and HTTP routes.
Network collectors
src/networks/metrics.rs
Make NetworkHeightCollector generic over a head-height closure and move encoding to DescriptorEncoder; add new generic NetworkVersionCollector<F1,F2> that emits network_version, network_version_revision, and actor_version.
Network version logic & enums
src/networks/mod.rs
Derive EnumIter for Height and add many variants; convert version mapping to height-driven logic; add network_version_revision(&self, epoch) -> i64; add unit tests for calibnet; re-export ActorBundleMetadata.
Shim / state tree
src/shim/state_tree.rs
Add pub fn get_actor_bundle_metadata(&self) -> anyhow::Result<&ActorBundleMetadata> to resolve actor bundle metadata from the system actor.
Shim / version conversion
src/shim/version.rs
Add impl From<NetworkVersion> for u32.
Chain store & chain_sync metrics
src/chain/store/chain_store.rs, src/chain_sync/metrics.rs
Remove HEAD_EPOCH gauge registration and its set call; simplify imports in chain_store; adjust metrics surface.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Daemon
  participant ChainStore
  participant Metrics as metrics::init_prometheus
  participant Registry
  participant HeightCollector as NetworkHeightCollector
  participant VersionCollector as NetworkVersionCollector
  participant Prom as Prometheus

  Daemon->>ChainStore: create Weak reference
  Daemon->>Daemon: define Arc-wrapped closures get_chain_head_height(), get_chain_head_actor_version()
  Daemon->>Metrics: call init_prometheus(..., chain_config, get_chain_head_height, get_chain_head_actor_version)
  activate Metrics
  Metrics->>Registry: register HeightCollector(get_chain_head_height)
  Metrics->>Registry: register VersionCollector(chain_config, get_chain_head_height, get_chain_head_actor_version)
  deactivate Metrics

  Prom->>Registry: scrape /metrics
  Registry->>HeightCollector: encode()
  HeightCollector->>Daemon: invoke get_chain_head_height()
  Registry-->>Prom: export head_epoch, expected_network_height
  Registry->>VersionCollector: encode()
  VersionCollector->>Daemon: invoke get_chain_head_height()
  VersionCollector->>Daemon: invoke get_chain_head_actor_version()
  VersionCollector->>Registry: compute network_version, network_version_revision, actor_version
  Registry-->>Prom: export network_version, network_version_revision, actor_version
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • akaladarshi
  • LesnyRumcajs
  • sudo-shashank

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Linked Issues Check ⚠️ Warning The PR implements the core requirement of issue #5141 by adding Prometheus metrics (network_version, network_version_revision, actor_version), updating init_prometheus and collectors, and adding docs and tests to expose these values. However, the linked issue also required adding the metric to the repository's monitoring/ directory, and the provided summary shows no modifications to that monitoring/ path. Because that repository-level monitoring inclusion is part of the issue's completion criteria, the issue is not fully satisfied by the current changeset. Add the necessary monitoring artifacts into the repository's monitoring/ directory (e.g., Prometheus scrape/recording rules, dashboards, or examples) that reference the new metric names and update any monitoring documentation; if those files exist elsewhere, document their location in the PR description and link them so reviewers can verify completeness.
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Out of Scope Changes Check ❓ Inconclusive Most modifications appear aligned with exposing network and actor version metrics (new collectors, init_prometheus args, StateTree helper, docs, and tests), but the PR also contains several non-trivial public/API changes—notably a large expansion of the public Height enum variants and the removal of the HEAD_EPOCH gauge updates in chain_sync/metrics and chain_store—that are not clearly required by the linked issue. From the provided summary alone it is unclear whether those changes are intentional refactors required for the metric work or unrelated scope creep, so I cannot conclusively mark them as acceptable or out-of-scope without author clarification. Request that the author document the rationale for the Height enum additions and the HEAD_EPOCH metric removal in the PR description and, if they are unrelated to the metric objective, split them into separate PRs so reviewers can evaluate them independently; if they are required, add a brief note explaining why they are necessary for the new metrics.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "feat: add prometheus metrics network_version and actor_version" is concise and correctly highlights the primary change (adding Prometheus metrics for network and actor versions); it directly relates to the core code and docs changes in the PR. It is specific and readable, and while it omits the added network_version_revision metric, that omission does not make the title misleading. Overall the title gives a clear, high-level summary suitable for scanning history.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch hm/metric-network-version

Comment @coderabbitai help to get the list of available commands and usage tips.

@hanabi1224 hanabi1224 marked this pull request as ready for review September 15, 2025 10:22
@hanabi1224 hanabi1224 requested a review from a team as a code owner September 15, 2025 10:22
@hanabi1224 hanabi1224 requested review from LesnyRumcajs and akaladarshi and removed request for a team September 15, 2025 10:22
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
CHANGELOG.md (1)

34-36: Capitalize “Prometheus” for consistency with prior entries.

Earlier entries use “Prometheus metrics”. Minor wording tweak for consistency.

- - [#6079](https://github.com/ChainSafe/forest/pull/6079) Added prometheus metrics `network_height` and `network_version`.
+ - [#6079](https://github.com/ChainSafe/forest/pull/6079) Added Prometheus metrics `network_height` and `network_version`.
src/shim/version.rs (1)

91-95: Match conversion style used elsewhere in this file.

Other impls use u32::from(...). Align for readability.

 impl From<NetworkVersion> for u32 {
     fn from(value: NetworkVersion) -> Self {
-        value.0.into()
+        u32::from(value.0)
     }
 }
src/daemon/mod.rs (1)

207-213: Avoid shadowing chain_store; reuse the same clone for both closures.

Small readability tweak; same behavior.

-        let chain_store = ctx.state_manager.chain_store().clone();
-        let get_chain_head_height = move || chain_store.heaviest_tipset().epoch();
-        let chain_store = ctx.state_manager.chain_store().clone();
-        let get_chain_head_network_version = move || {
-            let epoch = chain_store.heaviest_tipset().epoch();
-            chain_store.chain_config.network_version(epoch)
-        };
+        let chain_store = ctx.state_manager.chain_store().clone();
+        let get_chain_head_height = {
+            let cs = chain_store.clone();
+            move || cs.heaviest_tipset().epoch()
+        };
+        let get_chain_head_network_version = {
+            let cs = chain_store.clone();
+            move || {
+                let epoch = cs.heaviest_tipset().epoch();
+                cs.chain_config.network_version(epoch)
+            }
+        };
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between da70b92 and d1c8129.

📒 Files selected for processing (5)
  • CHANGELOG.md (1 hunks)
  • src/daemon/mod.rs (3 hunks)
  • src/metrics/mod.rs (3 hunks)
  • src/networks/metrics.rs (1 hunks)
  • src/shim/version.rs (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: hanabi1224
PR: ChainSafe/forest#5930
File: build.rs:64-77
Timestamp: 2025-08-13T09:43:20.301Z
Learning: hanabi1224 prefers hard compile-time errors in build scripts rather than runtime safeguards or collision detection, believing it's better to fail fast and fix root causes of issues like malformed snapshot names.
🧬 Code graph analysis (4)
src/shim/version.rs (1)
src/networks/mod.rs (1)
  • from (184-222)
src/daemon/mod.rs (3)
src/state_manager/mod.rs (1)
  • chain_store (296-298)
src/blocks/tipset.rs (2)
  • epoch (306-308)
  • epoch (543-545)
src/metrics/mod.rs (1)
  • init_prometheus (68-103)
src/metrics/mod.rs (1)
src/networks/metrics.rs (2)
  • new (30-36)
  • new (93-97)
src/networks/metrics.rs (2)
src/networks/mod.rs (3)
  • calculate_expected_epoch (571-577)
  • epoch (449-462)
  • from (184-222)
src/shim/version.rs (8)
  • from (86-88)
  • from (92-94)
  • from (98-100)
  • from (104-106)
  • from (110-112)
  • from (116-118)
  • from (122-124)
  • from (128-130)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: Build MacOS
  • GitHub Check: cargo-publish-dry-run
  • GitHub Check: Build Ubuntu
  • GitHub Check: All lint checks
  • GitHub Check: Build forest binaries on Linux AMD64
  • GitHub Check: tests
  • GitHub Check: tests-release
🔇 Additional comments (5)
src/metrics/mod.rs (1)

68-74: Wiring for network_version collector looks good.

Signature extension and registration are clear and minimally invasive.

Also applies to: 90-93

src/daemon/mod.rs (1)

215-223: init_prometheus call updated correctly.

Closure handoff matches the new API.

src/networks/metrics.rs (3)

14-37: Generic NetworkHeightCollector with live getters — LGTM.

Encodes network_height and expected_network_height correctly; uses saturating math and ChainEpoch gauge.

Also applies to: 39-77


79-117: New NetworkVersionCollector — LGTM.

Numeric encoding via u32::from(NetworkVersion) is appropriate; descriptor/help strings are clear.


4-9: Resolved — educe is declared in Cargo.toml (line 66) with the Debug feature enabled.
No further action required.

Copy link
Member

@LesnyRumcajs LesnyRumcajs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 210 to 213
let get_chain_head_network_version = move || {
let epoch = chain_store.heaviest_tipset().epoch();
chain_store.chain_config.network_version(epoch)
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it resolves the issue we had with calibnet nodes not updated? #5141 (comment)

Copy link
Contributor Author

@hanabi1224 hanabi1224 Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember using a string Gauge value breaks the Grafana scrapper. How about adding a network_version_revision to distinguish the same network version?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me rephrase; you're checking what the network is supposed to be according to the schedule. What is more useful for monitoring the health of the nodes, is checking the actual network version (though the expected version is also good to have, so consumers can set up an alarm if there's a mismatch).

Copy link
Contributor Author

@hanabi1224 hanabi1224 Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LesnyRumcajs I think with network_version_revision, the issue could be resolved. Please let me know your thoughts.

Watermelon -> v21, rev 0
WatermelonFix -> v21, rev 1
WatermelonFix2 -> v21, rev2

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/daemon/mod.rs (1)

207-224: Unify Weak handle, avoid hardcoded V0 fallback, and tighten wording

  • Reuse a single Weak for both closures to avoid duplicate downgrades.
  • Prefer a fallback derived from chain_config instead of hardcoding NetworkVersion::V0 to reduce misleading scrapes when ChainStore is unavailable.
  • Minor: tweak comment phrasing “dead lock” → “deadlock”.

Apply this diff:

-        // Use `Weak` to not dead lock GC.
-        let chain_store = Arc::downgrade(ctx.state_manager.chain_store());
-        let get_chain_head_height = move || {
-            chain_store
-                .upgrade()
-                .map(|cs| cs.heaviest_tipset().epoch())
-                .unwrap_or_default()
-        };
-        // Use `Weak` to not dead lock GC.
-        let chain_store = Arc::downgrade(ctx.state_manager.chain_store());
-        let get_chain_head_network_version = move || {
-            if let Some(cs) = chain_store.upgrade() {
-                let epoch = cs.heaviest_tipset().epoch();
-                cs.chain_config.network_version(epoch)
-            } else {
-                NetworkVersion::V0
-            }
-        };
+        // Use a Weak to avoid deadlocking GC and to not keep ChainStore alive.
+        let chain_store_weak = Arc::downgrade(ctx.state_manager.chain_store());
+        let chain_config = ctx.state_manager.chain_config().clone();
+
+        let get_chain_head_height = {
+            let chain_store_weak = chain_store_weak.clone();
+            move || {
+                chain_store_weak
+                    .upgrade()
+                    .map(|cs| cs.heaviest_tipset().epoch())
+                    .unwrap_or_default()
+            }
+        };
+
+        let get_chain_head_network_version = {
+            let chain_store_weak = chain_store_weak.clone();
+            let chain_config = chain_config.clone();
+            move || {
+                chain_store_weak
+                    .upgrade()
+                    .map(|cs| chain_config.network_version(cs.heaviest_tipset().epoch()))
+                    .unwrap_or_else(|| chain_config.network_version(ChainEpoch::default()))
+            }
+        };
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d1c8129 and 7eda899.

📒 Files selected for processing (1)
  • src/daemon/mod.rs (3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: hanabi1224
PR: ChainSafe/forest#5930
File: build.rs:64-77
Timestamp: 2025-08-13T09:43:20.301Z
Learning: hanabi1224 prefers hard compile-time errors in build scripts rather than runtime safeguards or collision detection, believing it's better to fail fast and fix root causes of issues like malformed snapshot names.
🧬 Code graph analysis (1)
src/daemon/mod.rs (3)
src/state_manager/mod.rs (1)
  • chain_store (296-298)
src/blocks/tipset.rs (2)
  • epoch (306-308)
  • epoch (543-545)
src/metrics/mod.rs (1)
  • init_prometheus (68-103)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: Build forest binaries on Linux AMD64
  • GitHub Check: cargo-publish-dry-run
  • GitHub Check: Build MacOS
  • GitHub Check: Build Ubuntu
  • GitHub Check: tests
  • GitHub Check: tests-release
  • GitHub Check: All lint checks
🔇 Additional comments (3)
src/daemon/mod.rs (3)

18-21: Import consolidation LGTM

Grouped use improves readability without behavioral change.


243-244: NetworkHeightCollector wiring LGTM

Passing the live head-height getter should make network_height accurate at scrape time.


226-234: Resolved — init_prometheus updated; no remaining 3‑arg call sites found

Found: src/daemon/mod.rs:226 (init_prometheus call); src/metrics/mod.rs:91 (NetworkVersionCollector::new called with the getter). No other init_prometheus / old 3‑arg usages detected.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/networks/mod.rs (1)

436-444: Consider caching Height iteration results for performance.

The network_version_revision function iterates through all Heights each time it's called. Since Heights are static and their NetworkVersion mappings don't change, consider caching the first height for each NetworkVersion to avoid repeated iteration in hot paths.

This could be optimized with a lazy static lookup table:

+use std::collections::HashMap;
+
+static FIRST_HEIGHT_BY_VERSION: LazyLock<HashMap<NetworkVersion, Height>> = LazyLock::new(|| {
+    let mut map = HashMap::new();
+    for height in Height::iter() {
+        let nv = NetworkVersion::from(height);
+        map.entry(nv).or_insert(height);
+    }
+    map
+});

 pub fn network_version_revision(&self, epoch: ChainEpoch) -> i64 {
     if let Some(height) = self.network_height(epoch) {
         let nv = NetworkVersion::from(height);
-        if let Some(rev0_height) = Height::iter().find(|h| NetworkVersion::from(*h) == nv) {
+        if let Some(&rev0_height) = FIRST_HEIGHT_BY_VERSION.get(&nv) {
             return (height as i64) - (rev0_height as i64);
         }
     }
     0
 }
src/networks/metrics.rs (1)

53-64: Consider documenting metric units and ranges.

The network_height metric would benefit from documentation about its units (epochs) and expected range in the metric description.

             let metric_encoder = encoder.encode_descriptor(
                 "network_height",
-                "The current network height",
+                "The current network height in epochs (blocks since genesis)",
                 None,
                 network_height.metric_type(),
             )?;
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c49a095 and 4a5eb4f.

📒 Files selected for processing (5)
  • docs/docs/users/reference/metrics.md (1 hunks)
  • src/daemon/mod.rs (3 hunks)
  • src/metrics/mod.rs (3 hunks)
  • src/networks/metrics.rs (1 hunks)
  • src/networks/mod.rs (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • docs/docs/users/reference/metrics.md
  • src/daemon/mod.rs
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: hanabi1224
PR: ChainSafe/forest#5930
File: build.rs:64-77
Timestamp: 2025-08-13T09:43:20.301Z
Learning: hanabi1224 prefers hard compile-time errors in build scripts rather than runtime safeguards or collision detection, believing it's better to fail fast and fix root causes of issues like malformed snapshot names.
🧬 Code graph analysis (3)
src/metrics/mod.rs (4)
src/networks/mod.rs (1)
  • chain_config (737-742)
src/state_manager/mod.rs (2)
  • chain_config (244-246)
  • new (181-186)
src/rpc/mod.rs (1)
  • chain_config (474-476)
src/networks/metrics.rs (2)
  • new (32-42)
  • new (100-105)
src/networks/mod.rs (2)
src/shim/version.rs (8)
  • from (86-88)
  • from (92-94)
  • from (98-100)
  • from (104-106)
  • from (110-112)
  • from (116-118)
  • from (122-124)
  • from (128-130)
src/rpc/methods/chain.rs (1)
  • calibnet (1251-1256)
src/networks/metrics.rs (1)
src/networks/mod.rs (7)
  • calculate_expected_epoch (590-596)
  • network_height (416-423)
  • epoch (468-481)
  • network_version (427-432)
  • network_version (745-755)
  • from (186-224)
  • network_version_revision (436-444)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: Build MacOS
  • GitHub Check: Build Ubuntu
  • GitHub Check: cargo-publish-dry-run
  • GitHub Check: Build forest binaries on Linux AMD64
  • GitHub Check: tests-release
  • GitHub Check: tests
  • GitHub Check: All lint checks
  • GitHub Check: Deploy to Cloudflare Pages
🔇 Additional comments (11)
src/metrics/mod.rs (3)

6-6: LGTM! Clean import additions for network metrics support.

The new imports are appropriate for the network metrics functionality being added.


72-73: Good API design using generic closures for decoupling.

The function signature update properly extends the API with the required parameters for network metrics. Using Arc<impl Fn() -> ChainEpoch + Send + Sync + 'static> provides good flexibility and thread-safety for the height getter closure.


91-93: LGTM! Collector registration follows established patterns.

The NetworkVersionCollector registration is properly integrated into the existing metrics initialization flow.

src/networks/mod.rs (5)

13-15: LGTM! Required imports for Height iteration support.

The strum imports enable the EnumIter functionality needed for the new network version revision logic.


139-177: Comprehensive Height enum expansion for all network upgrades.

The Height enum now includes all network upgrade heights through GoldenWeek (V27), properly supporting the full upgrade history. The EnumIter derive enables iteration for the revision calculation logic.


426-432: Good defensive programming with genesis network version fallback.

The network version calculation properly handles edge cases by falling back to genesis version and ensuring the result is never below the genesis network version using .max().


744-755: Excellent test coverage for network version boundaries.

The tests properly validate both network version transitions and revision calculations at critical epoch boundaries, ensuring the metrics will report accurate values.


416-423: Resolved: strict > in network_height is intentional.

Tests in src/networks/mod.rs assert network_version(1_013_134) == NetworkVersion::V20 and network_version(1_013_134 + 1) == NetworkVersion::V21, confirming upgrades activate on the epoch after the listed upgrade epoch; no change required.

src/networks/metrics.rs (3)

18-26: Good use of generics for flexible height retrieval.

The generic design with F: Fn() -> ChainEpoch provides excellent flexibility for different sources of chain height while maintaining type safety.


127-136: Clear metric for distinguishing same-version upgrades.

The network_version_revision metric elegantly handles the case where multiple upgrades don't bump the network version (e.g., WatermelonFix, WatermelonFix2 all at V21), providing valuable operational insight.


117-117: Conversion safe — no action required

impl From for u32 delegates to the inner value (value.0.into()), and NetworkVersion is created from arbitrary u32 via NetworkVersion_latest::new(...) (see src/shim/version.rs), so u32::from(network_version) is safe for all NetworkVersion values.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
src/daemon/mod.rs (2)

210-215: Head-height getter fallback: confirm semantics and observability.

If the Weak can’t upgrade, the metric reports height 0. That could silently skew network_height/network_version during teardown/restart windows. Consider logging once on fallback or documenting this behavior in the metric HELP so operators know 0 can mean “store unavailable.”


208-209: Nit: tighten the comment.

Change “dead lock” → “deadlock” and clarify the reason (avoiding keeping the store alive and blocking GC).

-        // Use `Weak` to not dead lock GC.
+        // Use a Weak to avoid keeping the chain store alive and risking GC deadlock.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3b5bd97 and 1bea7fe.

📒 Files selected for processing (1)
  • src/daemon/mod.rs (4 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: hanabi1224
PR: ChainSafe/forest#5930
File: build.rs:64-77
Timestamp: 2025-08-13T09:43:20.301Z
Learning: hanabi1224 prefers hard compile-time errors in build scripts rather than runtime safeguards or collision detection, believing it's better to fail fast and fix root causes of issues like malformed snapshot names.
🧬 Code graph analysis (1)
src/daemon/mod.rs (4)
src/state_manager/mod.rs (3)
  • chain_store (296-298)
  • new (181-186)
  • chain_config (244-246)
src/networks/metrics.rs (2)
  • new (32-42)
  • new (100-105)
src/metrics/mod.rs (3)
  • new (150-152)
  • init_prometheus (68-104)
  • default_registry (26-28)
src/daemon/context.rs (1)
  • chain_config (65-67)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: Build Ubuntu
  • GitHub Check: cargo-publish-dry-run
  • GitHub Check: Build MacOS
  • GitHub Check: Build forest binaries on Linux AMD64
  • GitHub Check: tests
  • GitHub Check: tests-release
  • GitHub Check: All lint checks
  • GitHub Check: Check
🔇 Additional comments (3)
src/daemon/mod.rs (3)

18-21: LGTM: import consolidation reads clearer.


232-241: LGTM: registering NetworkHeightCollector with live head-height source.

Passing the getter through aligns the collector with the node’s current head.


558-560: LGTM: reset the default Prometheus registry before (re)startup.

Prevents duplicate collectors across restarts within the same process.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
src/daemon/mod.rs (1)

234-249: Fix: spawned future must be 'static — use async move and clone captured Arcs

As noted in the earlier review, JoinSet::spawn requires a 'static future. The current async block borrows stack locals and will fail to compile or is fragile.

Apply this diff:

         services.spawn({
             let chain_config = ctx.chain_config().clone();
-            let get_chain_head_height = get_chain_head_height.clone();
-            async {
+            let get_chain_head_height = std::sync::Arc::clone(&get_chain_head_height);
+            let get_chain_head_actor_version =
+                std::sync::Arc::clone(&get_chain_head_actor_version);
+            async move {
                 crate::metrics::init_prometheus(
                     prometheus_listener,
                     db_directory,
                     db,
                     chain_config,
                     get_chain_head_height,
-                    get_chain_head_actor_version,
+                    get_chain_head_actor_version,
                 )
                 .await
                 .context("Failed to initiate prometheus server")
             }
         });
🧹 Nitpick comments (8)
src/shim/state_tree.rs (1)

200-207: Resolve actor bundle metadata efficiently; minor ergonomics

Looks correct and returns a stable reference. Two nits:

  • This does a linear scan over ACTOR_BUNDLES_METADATA on every call. If this ends up hot, consider a secondary map keyed by system actor code Cid for O(1) lookup.
  • The returned reference is effectively 'static; current signature is fine, but adding a brief doc comment on lifetime/source (static registry) would help future maintainers.

If you expect this method to be called per-scrape for metrics, we can add a cache keyed by the resolved system code. Want a small follow-up patch?

src/daemon/mod.rs (2)

219-233: Avoid per-scrape heavy state reconstruction (optional)

Each scrape rebuilds a StateTree and scans bundle metadata. If scrape frequency is high, this can add load. Consider caching the actor version by epoch (or tipset key) with a small TTL, invalidated on head changes.

I can wire this to the existing head-change publisher and only recompute on Apply. Interested?


577-579: Resetting the global Prometheus registry: confirm lifecycle

Overwriting the default registry is fine at process start, but on internal restarts it can drop already-registered custom collectors if any other tasks still hold references. Confirm there’s no concurrent metrics server running when this executes.

If you want a safer pattern, we can guard this behind a “first init” OnceCell or replace it with an explicit “clear” API on the registry.

src/networks/mod.rs (4)

185-224: Validate NV mappings for new heights

Mappings for Teep/Tock/TockFix/GoldenWeek imply NV25–NV27. Ensure these align with lotus/build constants for each network; drift here will misreport network_version.

If helpful, I can add a small test asserting that for each Height the NV matches lotus’ tables.


416-423: Off-by-one boundary: upgrade effective at epoch or epoch+1?

network_height uses epoch > info.epoch, which makes the upgrade effective starting at info.epoch + 1. Your calibnet test expects that, but please confirm this matches canonical behavior across networks (Lotus often treats upgrades as effective at the named epoch). If not universally true, we may need a per-network flag or >=.

I can add cross-check tests against known mainnet/calibnet boundaries to lock this down.


434-444: Revision math depends on enum order; add a guard

Computing revision as height - first_height_with_same_nv relies on Height variants of the same NV being contiguous and ordered. Add a test to assert contiguity so future insertions don’t silently skew revision numbers.

Apply this test addition:

@@
 #[cfg(test)]
 mod tests {
     use super::*;
+    #[test]
+    fn revisions_are_contiguous_per_nv() {
+        use std::collections::HashMap;
+        let mut groups: HashMap<NetworkVersion, Vec<Height>> = HashMap::new();
+        for h in Height::iter() {
+            groups.entry(NetworkVersion::from(h)).or_default().push(h);
+        }
+        for (_nv, hs) in groups {
+            // Ensure all heights for the NV form a single contiguous run in declaration order
+            let idxs: Vec<_> = hs.iter().map(|h| *h as i64).collect();
+            let min = *idxs.iter().min().unwrap();
+            let max = *idxs.iter().max().unwrap();
+            assert_eq!(idxs.len() as i64, max - min + 1, "non-contiguous NV group");
+        }
+    }

47-49: Newest NV constant appears stale

NEWEST_NETWORK_VERSION is V25, but Height includes NV26 and NV27. If this constant is used anywhere, it should be updated to V27.

src/networks/metrics.rs (1)

49-83: Height metrics look correct; consider now() source

Encoding both actual and expected heights is great. Minor: using chrono::Utc::now() is fine; if you already centralize time sources elsewhere (e.g., a Clock trait), consider using it for testability.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1bea7fe and d7a681d.

📒 Files selected for processing (7)
  • CHANGELOG.md (1 hunks)
  • docs/docs/users/reference/metrics.md (3 hunks)
  • src/daemon/mod.rs (5 hunks)
  • src/metrics/mod.rs (3 hunks)
  • src/networks/metrics.rs (1 hunks)
  • src/networks/mod.rs (5 hunks)
  • src/shim/state_tree.rs (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • src/metrics/mod.rs
  • CHANGELOG.md
  • docs/docs/users/reference/metrics.md
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: hanabi1224
PR: ChainSafe/forest#5930
File: build.rs:64-77
Timestamp: 2025-08-13T09:43:20.301Z
Learning: hanabi1224 prefers hard compile-time errors in build scripts rather than runtime safeguards or collision detection, believing it's better to fail fast and fix root causes of issues like malformed snapshot names.
🧬 Code graph analysis (3)
src/networks/metrics.rs (1)
src/networks/mod.rs (7)
  • calculate_expected_epoch (590-596)
  • network_height (416-423)
  • epoch (468-481)
  • network_version (427-432)
  • network_version (745-755)
  • from (186-224)
  • network_version_revision (436-444)
src/networks/mod.rs (2)
src/networks/actors_bundle.rs (1)
  • get_actor_bundles_metadata (149-171)
src/shim/version.rs (8)
  • from (86-88)
  • from (92-94)
  • from (98-100)
  • from (104-106)
  • from (110-112)
  • from (116-118)
  • from (122-124)
  • from (128-130)
src/daemon/mod.rs (5)
src/metrics/mod.rs (3)
  • new (155-157)
  • init_prometheus (68-109)
  • default_registry (26-28)
src/networks/metrics.rs (2)
  • new (32-42)
  • new (104-114)
src/state_manager/mod.rs (3)
  • new (181-186)
  • chain_store (296-298)
  • chain_config (244-246)
src/shim/state_tree.rs (1)
  • new_from_root (168-188)
src/shim/state_tree_v0.rs (1)
  • new_from_root (34-53)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: All lint checks
  • GitHub Check: tests-release
  • GitHub Check: Build MacOS
  • GitHub Check: cargo-publish-dry-run
  • GitHub Check: tests
  • GitHub Check: Build Ubuntu
  • GitHub Check: Build forest binaries on Linux AMD64
🔇 Additional comments (11)
src/shim/state_tree.rs (1)

8-11: LGTM: needed imports and re-exports

Brings in ACTOR_BUNDLES_METADATA and ActorBundleMetadata for the new helper; no concerns.

src/daemon/mod.rs (2)

209-218: LGTM: head-height getter uses Weak to avoid GC deadlocks

Good choice to avoid holding strong refs in a long-lived metrics closure.


251-260: LGTM: NetworkHeightCollector registration

Collector wiring passes correct params and the same height getter used by the server; consistent.

src/networks/mod.rs (5)

13-16: LGTM: Enum iteration support

EnumIter/IntoEnumIterator additions are appropriate for revision computation.


32-34: LGTM: public re-exports

Re-exporting ActorBundleMetadata enables shim access; consistent with usage in state_tree.


138-177: LGTM: Height variants extended

New heights look consistent and maintain contiguous grouping for fixes.


425-433: LGTM: network_version derived from height with genesis floor

Clearer and easier to reason about than bespoke epoch checks.


744-755: LGTM: boundary tests for calibnet

Good coverage of the epoch/off-by-one and revision=0→1 transition.

src/networks/metrics.rs (3)

4-11: LGTM: collector scaffolding and DescriptorEncoder usage

Imports and the switch to DescriptorEncoder are consistent with prometheus_client APIs.


85-97: LGTM: version collector fields and ownership

Arcs and ignored Debug fields keep logs clean; no issues.


117-161: LGTM: emits NV, revision, and actor version

Metric names and help strings are clear; conversions to Gauge types look correct.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
src/networks/metrics.rs (5)

53-64: Optimize gauge allocation by reusing instances.

Creating new Gauge instances on every encode call is inefficient. Consider storing the gauges as fields in the collector struct to avoid repeated allocations during metric collection.

 pub struct NetworkHeightCollector<F>
 where
     F: Fn() -> ChainEpoch,
 {
     block_delay_secs: u32,
     genesis_timestamp: u64,
+    network_height: Gauge,
+    expected_network_height: Gauge,
     #[educe(Debug(ignore))]
     get_chain_head_height: Arc<F>,
 }

Then update the constructor and encode method accordingly:

     pub fn new(
         block_delay_secs: u32,
         genesis_timestamp: u64,
         get_chain_head_height: Arc<F>,
     ) -> Self {
         Self {
             block_delay_secs,
             genesis_timestamp,
+            network_height: Default::default(),
+            expected_network_height: Default::default(),
             get_chain_head_height,
         }
     }
-            let network_height: Gauge = Default::default();
             let epoch = (self.get_chain_head_height)();
-            network_height.set(epoch);
+            self.network_height.set(epoch);
             let metric_encoder = encoder.encode_descriptor(
                 "network_height",
                 "The current network height",
                 None,
-                network_height.metric_type(),
+                self.network_height.metric_type(),
             )?;
-            network_height.encode(metric_encoder)?;
+            self.network_height.encode(metric_encoder)?;

127-127: Potential integer overflow when casting network version.

The cast from u32 to gauge value type (likely i64) and then the unnecessary intermediate cast to _ could mask potential issues. Use explicit typing for clarity.

-            nv_gauge.set(u32::from(network_version) as _);
+            nv_gauge.set(i64::from(u32::from(network_version)));

151-151: Remove unnecessary cast for actor version.

The actor version is already u64, and the unnecessary cast to _ reduces code clarity.

-            av_gauge.set(actor_version as _);
+            av_gauge.set(actor_version as i64);

122-161: Consider adding metric labels for better observability.

Network version metrics could benefit from additional labels like the actual version name (e.g., "v21", "v22") to make dashboards more readable without requiring lookups.

Would you like me to help implement labeled metrics that include both the numeric value and a human-readable version string as a label? This would make Grafana dashboards more intuitive.


85-162: Add documentation for the new collector.

The NetworkVersionCollector struct and its public methods lack documentation. Consider adding doc comments to explain the purpose, usage, and the meaning of each metric it exposes.

+/// Collector for network version related metrics.
+/// 
+/// This collector exposes:
+/// - `network_version`: The current network version based on chain epoch
+/// - `network_version_revision`: The revision number within the current network version
+/// - `actor_version`: The actor bundle version at the current chain head
 #[derive(Educe)]
 #[educe(Debug)]
 pub struct NetworkVersionCollector<F1, F2>
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d7a681d and 2be478f.

📒 Files selected for processing (7)
  • CHANGELOG.md (1 hunks)
  • docs/docs/users/reference/metrics.md (3 hunks)
  • src/daemon/mod.rs (5 hunks)
  • src/metrics/mod.rs (3 hunks)
  • src/networks/metrics.rs (1 hunks)
  • src/networks/mod.rs (5 hunks)
  • src/shim/state_tree.rs (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
  • src/metrics/mod.rs
  • CHANGELOG.md
  • docs/docs/users/reference/metrics.md
  • src/shim/state_tree.rs
  • src/networks/mod.rs
  • src/daemon/mod.rs
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: hanabi1224
PR: ChainSafe/forest#5930
File: build.rs:64-77
Timestamp: 2025-08-13T09:43:20.301Z
Learning: hanabi1224 prefers hard compile-time errors in build scripts rather than runtime safeguards or collision detection, believing it's better to fail fast and fix root causes of issues like malformed snapshot names.
🧬 Code graph analysis (1)
src/networks/metrics.rs (2)
src/networks/mod.rs (8)
  • calculate_expected_epoch (590-596)
  • network_height (416-423)
  • epoch (468-481)
  • chain_config (737-742)
  • network_version (427-432)
  • network_version (745-755)
  • from (186-224)
  • network_version_revision (436-444)
src/rpc/mod.rs (1)
  • chain_config (474-476)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: cargo-publish-dry-run
  • GitHub Check: Build Ubuntu
  • GitHub Check: Build MacOS
  • GitHub Check: tests
  • GitHub Check: tests-release
  • GitHub Check: Build forest binaries on Linux AMD64
  • GitHub Check: All lint checks
🔇 Additional comments (1)
src/networks/metrics.rs (1)

45-83: Handle Weak::upgrade() failures when collecting network metrics

src/networks/metrics.rs (lines 45–83): The collector invokes a closure to read chain height. If that closure upgrades a Weak<RwLock>, upgrade() can fail during collection and cause the metric to record a default/stale value. Ensure the closure handles upgrade failures deterministically (return a safe default, skip emission, and/or log an error). Repository search did not locate the closure implementation; confirm its behavior and update accordingly.

| `peer_failure_total` | Counter | Count | Total number of failed peer requests |
| `full_peers` | Gauge | Count | Number of healthy peers recognized by the node |
| `bad_peers` | Gauge | Count | Number of bad peers recognized by the node |
| `network_height` | Gauge | Count | The current network height |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be good to explain the difference between those metrics in the docs (dedicated section).

head_epoch 3021420
network_height 3021421
expected_network_height 3023095

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realized network_height is a duplicate of head_epoch, removed to avoid confusion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hanabi1224 It's not exactly a duplicate, is it? I mean, the values differ, and from what I saw, head_epoch is not available just when the node starts.

Copy link
Contributor Author

@hanabi1224 hanabi1224 Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LesnyRumcajs The metrics are the same, and the difference is due to how the value is updated. I can update head_epoch to use the collector approach which is more reliable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
src/daemon/mod.rs (1)

234-249: Fix: spawned future must be 'static — use async move.

Without move, the async block borrows stack locals (prometheus_listener, db_directory, db, closures), which won’t satisfy JoinSet::spawn’s 'static bound.

Apply this diff:

-        services.spawn({
+        services.spawn({
             let chain_config = ctx.chain_config().clone();
             let get_chain_head_height = get_chain_head_height.clone();
-            async {
+            async move {
                 crate::metrics::init_prometheus(
                     prometheus_listener,
                     db_directory,
                     db,
                     chain_config,
                     get_chain_head_height,
                     get_chain_head_actor_version,
                 )
                 .await
                 .context("Failed to initiate prometheus server")
             }
         });
🧹 Nitpick comments (1)
src/daemon/mod.rs (1)

219-233: Actor-version getter: verify epoch selection and consider lightweight caching.

  • Using heaviest_tipset().parent_state() may be off by one at upgrade boundaries; verify against a known upgrade epoch that the reported value matches “actual” actor bundle version for that tipset.
  • Building a StateTree every scrape can be expensive; consider memoizing by (tipset_epoch or parent_state_cid) to avoid repeated reads between scrapes.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2be478f and 9965521.

📒 Files selected for processing (2)
  • CHANGELOG.md (1 hunks)
  • src/daemon/mod.rs (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • CHANGELOG.md
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: hanabi1224
PR: ChainSafe/forest#5930
File: build.rs:64-77
Timestamp: 2025-08-13T09:43:20.301Z
Learning: hanabi1224 prefers hard compile-time errors in build scripts rather than runtime safeguards or collision detection, believing it's better to fail fast and fix root causes of issues like malformed snapshot names.
🧬 Code graph analysis (1)
src/daemon/mod.rs (5)
src/networks/metrics.rs (2)
  • new (32-42)
  • new (104-114)
src/metrics/mod.rs (2)
  • new (155-157)
  • init_prometheus (68-109)
src/state_manager/mod.rs (3)
  • new (181-186)
  • chain_store (296-298)
  • chain_config (244-246)
src/shim/state_tree.rs (1)
  • new_from_root (168-188)
src/daemon/context.rs (1)
  • chain_config (65-67)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: Check
  • GitHub Check: Build forest binaries on Linux AMD64
  • GitHub Check: All lint checks
  • GitHub Check: Build MacOS
  • GitHub Check: Build Ubuntu
  • GitHub Check: cargo-publish-dry-run
  • GitHub Check: tests
  • GitHub Check: tests-release
  • GitHub Check: Deploy to Cloudflare Pages
🔇 Additional comments (4)
src/daemon/mod.rs (4)

18-21: Import re-org looks good.

Grouping AppContext, DbType, and import_chain_as_forest_car improves clarity.


31-31: StateTree import is appropriate.

Needed for actor-version lookup; no concerns.


209-218: Height getter via Weak is solid.

Avoids GC deadlocks and returns a sane default on None.


251-259: NetworkHeightCollector wiring looks correct.

Passing block_delay_secs, genesis_timestamp, and the height getter matches the new collector API.

@hanabi1224 hanabi1224 force-pushed the hm/metric-network-version branch from 35e1087 to 2bebfc3 Compare September 16, 2025 11:30
@hanabi1224 hanabi1224 changed the title feat: add prometheus metrics network_height and network_version feat: add prometheus metrics network_version and actor_version Sep 16, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/daemon/mod.rs (1)

252-260: network_height metric is missing from the code path registering collectors.

You only register NetworkHeightCollector (which currently exposes expected_network_height only) and NetworkVersionCollector (which emits version, revision, actor version). The PR title promises network_height, but it is not emitted.

Apply one of the following:

  • Option A (prefer): emit network_height from NetworkVersionCollector using the same epoch it already fetches.
  • Option B: extend NetworkHeightCollector to also emit network_height using a head-height getter.

See code suggestions under src/networks/metrics.rs.

src/networks/metrics.rs (1)

31-52: NetworkHeightCollector emits only expected_network_height; add network_height or rename.

Given the PR title and docs expectations, this collector should also emit network_height (derived from current head), or be clearly renamed to ExpectedNetworkHeightCollector. Prefer emitting both.

Minimal change (add network_height here using the head epoch) requires a getter. If you want to avoid changing this type’s constructor, add network_height in NetworkVersionCollector instead (see below).

♻️ Duplicate comments (2)
docs/docs/users/reference/metrics.md (1)

20-22: Missing network_height doc and unclear distinctions vs. head_epoch/expected_network_height.

  • PR title promises network_height, but it's not documented in the table.
  • Past review asked to clarify differences; please add a short section explaining how these three relate.

Apply this doc diff to add network_height and a clarifying section:

@@
-| `expected_network_height`      | Gauge     | Count   | The expected network height based on the current time and the genesis block time             |
+| `expected_network_height`      | Gauge     | Count   | The expected network height based on the current time and the genesis block time             |
+| `network_height`               | Gauge     | Count   | Network height derived from the current chain head (see notes below)                         |
 | `network_version`              | Gauge     | Count   | Network version of the current chain head                                                    |
 | `network_version_revision`     | Gauge     | Count   | Network version revision of the current chain head                                           |
 | `actor_version`                | Gauge     | Count   | Actor version of the current chain head                                                      |
@@
 </details>
 
+<details>
+  <summary>Example `network_height` output</summary>
+```
+# HELP network_height Network height derived from the current chain head
+# TYPE network_height gauge
+network_height 3021421
+```
+</details>
+
+### Notes: head_epoch vs. network_height vs. expected_network_height
+- head_epoch: the epoch of the node’s heaviest tipset (what you actually synced to).
+- network_height: height derived from the current head for operational dashboards; intended to track the “current” network height as observed by this node.
+- expected_network_height: height computed from wall‑clock time, genesis timestamp, and block delay; use this to detect lag (expected − head_epoch).
+

head_epoch 3021420
network_height 3021421
expected_network_height 3023095

src/daemon/mod.rs (1)

209-249: Fix: spawn a 'static future with async move (compile blocker).

JoinSet::spawn requires a 'static future. The current async { ... } block borrows stack locals (prometheus_listener, db_directory, db, chain_config, closures), which won’t compile.

Apply:

-            async {
+            async move {
                 crate::metrics::init_prometheus(
                     prometheus_listener,
                     db_directory,
                     db,
                     chain_config,
                     get_chain_head_height,
                     get_chain_head_actor_version,
                 )
                 .await
                 .context("Failed to initiate prometheus server")
             }
🧹 Nitpick comments (3)
docs/docs/users/reference/metrics.md (1)

294-320: Examples look good; add one for network_height to align with the table.

Without an example, users may miss how it renders or differs from head_epoch. See suggested diff above.

src/daemon/mod.rs (1)

221-237: Heavy work per scrape for actor_version; consider caching.

Constructing a StateTree and reading bundle metadata on every Prometheus scrape is expensive and can skew scrape latency. Prefer computing on head changes and publishing a cached atomic value.

If you want, I can draft a small subscriber that updates an AtomicI64 on HeadChange::Apply and have the getter read it.

src/networks/metrics.rs (1)

36-43: Nit: avoid lossy cast from i64 to u64 for timestamps.

Utc::now().timestamp() returns i64. Casting to u64 is fine in practice, but use try_into().unwrap_or_default() to avoid surprising wrap in hypothetical negative times (tests, mocks).

-        let expected_epoch = calculate_expected_epoch(
-            chrono::Utc::now().timestamp() as u64,
+        let now_secs: u64 = chrono::Utc::now().timestamp().try_into().unwrap_or_default();
+        let expected_epoch = calculate_expected_epoch(
-            self.genesis_timestamp,
+            now_secs,
             self.block_delay_secs,
         );
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9965521 and 2bebfc3.

📒 Files selected for processing (3)
  • docs/docs/users/reference/metrics.md (2 hunks)
  • src/daemon/mod.rs (3 hunks)
  • src/networks/metrics.rs (2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: hanabi1224
PR: ChainSafe/forest#5930
File: build.rs:64-77
Timestamp: 2025-08-13T09:43:20.301Z
Learning: hanabi1224 prefers hard compile-time errors in build scripts rather than runtime safeguards or collision detection, believing it's better to fail fast and fix root causes of issues like malformed snapshot names.
🧬 Code graph analysis (2)
src/networks/metrics.rs (1)
src/networks/mod.rs (6)
  • calculate_expected_epoch (590-596)
  • epoch (468-481)
  • network_version (427-432)
  • network_version (745-755)
  • from (186-224)
  • network_version_revision (436-444)
src/daemon/mod.rs (4)
src/state_manager/mod.rs (3)
  • chain_config (244-246)
  • new (181-186)
  • chain_store (296-298)
src/daemon/context.rs (1)
  • chain_config (65-67)
src/metrics/mod.rs (2)
  • new (155-157)
  • init_prometheus (68-109)
src/shim/state_tree.rs (1)
  • new_from_root (168-188)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: cargo-publish-dry-run
  • GitHub Check: Build MacOS
  • GitHub Check: Build Ubuntu
  • GitHub Check: tests-release
  • GitHub Check: tests
  • GitHub Check: Build forest binaries on Linux AMD64
  • GitHub Check: All lint checks
🔇 Additional comments (1)
src/networks/metrics.rs (1)

86-131: Emit missing network_height metric in NetworkVersionCollector::encode

Add a Gauge for network_height (derived from the current head) before emitting network_version so the PR/docs match the implementation.

Apply:

     fn encode(&self, mut encoder: DescriptorEncoder) -> Result<(), std::fmt::Error> {
         let epoch = (self.get_chain_head_height)();
+        // network_height: derived from the current chain head
+        {
+            let nh_gauge: Gauge = Default::default();
+            nh_gauge.set(epoch);
+            let metric_encoder = encoder.encode_descriptor(
+                "network_height",
+                "Network height derived from the current chain head",
+                None,
+                nh_gauge.metric_type(),
+            )?;
+            nh_gauge.encode(metric_encoder)?;
+        }
         {
             let network_version = self.chain_config.network_version(epoch);
             let nv_gauge: Gauge = Default::default();
             nv_gauge.set(u32::from(network_version) as _);

If “network_height” should instead mean the protocol upgrade height for the head’s epoch, use:

-            nh_gauge.set(epoch);
+            let height = self
+                .chain_config
+                .network_height(epoch)
+                .map(|h| h as i64)
+                .unwrap_or_default();
+            nh_gauge.set(height);

Confirm intended semantics for network_height:

  • A) current head epoch,
  • B) wall‑clock expected epoch (already exported as expected_network_height),
  • C) protocol upgrade height for the head’s epoch.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (5)
src/chain_sync/metrics.rs (2)

31-33: Typo in metric name: "libp2p_messsage_total" (triple “s”)

Consider correcting the metric name (and planning a deprecation path to avoid breaking dashboards).

-            "libp2p_messsage_total",
+            "libp2p_message_total",

If dashboards depend on the old name, register both for one release cycle and mark the old one deprecated in CHANGELOG.


23-24: Nit: grammar in metric help text

"validate blocks with no cache hit" → "validates blocks without a cache hit".

-        "Duration of routine which validate blocks with no cache hit",
+        "Duration of routine which validates blocks without a cache hit",
src/networks/metrics.rs (3)

45-83: Optional: inject time source for testability

Directly calling chrono::Utc::now() makes this harder to test. Consider injecting a now() closure.

-    F: Fn() -> ChainEpoch + Send + Sync + 'static,
+    F: Fn() -> ChainEpoch + Send + Sync + 'static,
 {
     fn encode(
         &self,
         mut encoder: prometheus_client::encoding::DescriptorEncoder,
     ) -> Result<(), std::fmt::Error> {
-        {
+        {
             let network_height: Gauge = Default::default();
             let epoch = (self.get_chain_head_height)();
             network_height.set(epoch);
             let metric_encoder = encoder.encode_descriptor(
                 "head_epoch",
                 "Latest epoch synchronized to the node",
                 None,
                 network_height.metric_type(),
             )?;
             network_height.encode(metric_encoder)?;
         }
-        {
+        {
             let expected_network_height: Gauge = Default::default();
-            let expected_epoch = calculate_expected_epoch(
-                chrono::Utc::now().timestamp() as u64,
+            // If adopting injection, replace with: (self.get_now)()
+            let expected_epoch = calculate_expected_epoch(
+                chrono::Utc::now().timestamp() as u64,
                 self.genesis_timestamp,
                 self.block_delay_secs,
             );

If desired, I can provide a full diff adding get_now: Arc<dyn Fn() -> u64 + Send + Sync> to the struct/new().


66-79: Naming consistency (epoch vs height)

You expose "head_epoch" but "expected_network_height". Consider "expected_head_epoch" for symmetry.

-                "expected_network_height",
-                "The expected network height based on the current time and the genesis block time",
+                "expected_head_epoch",
+                "The expected head epoch based on current time and genesis time",

If you keep the current name, no blocker—just a consistency note.


149-152: Nit: explicit cast for clarity

Be explicit about the gauge’s integer type.

-            av_gauge.set(actor_version as _);
+            av_gauge.set(actor_version as i64);
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2bebfc3 and 821be7d.

📒 Files selected for processing (5)
  • CHANGELOG.md (1 hunks)
  • docs/docs/users/reference/metrics.md (2 hunks)
  • src/chain/store/chain_store.rs (1 hunks)
  • src/chain_sync/metrics.rs (1 hunks)
  • src/networks/metrics.rs (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • CHANGELOG.md
  • docs/docs/users/reference/metrics.md
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: hanabi1224
PR: ChainSafe/forest#5930
File: build.rs:64-77
Timestamp: 2025-08-13T09:43:20.301Z
Learning: hanabi1224 prefers hard compile-time errors in build scripts rather than runtime safeguards or collision detection, believing it's better to fail fast and fix root causes of issues like malformed snapshot names.
🧬 Code graph analysis (1)
src/networks/metrics.rs (1)
src/networks/mod.rs (7)
  • calculate_expected_epoch (590-596)
  • network_height (416-423)
  • epoch (468-481)
  • network_version (427-432)
  • network_version (745-755)
  • from (186-224)
  • network_version_revision (436-444)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: tests
  • GitHub Check: tests-release
  • GitHub Check: Build forest binaries on Linux AMD64
  • GitHub Check: All lint checks
  • GitHub Check: cargo-publish-dry-run
  • GitHub Check: Build MacOS
  • GitHub Check: Build Ubuntu
  • GitHub Check: Check
  • GitHub Check: Deploy to Cloudflare Pages
🔇 Additional comments (4)
src/networks/metrics.rs (3)

16-26: LGTM: Collector state and closure wiring

Struct design with Arc-wrapped closures is appropriate; Debug(ignore) avoids leaking internals.


28-43: LGTM: constructor API

Constructor parameters are minimal and clear.


117-161: LGTM: network_version, revision, and actor_version collectors

Implementation matches ChainConfig APIs; values are computed at scrape time and encoded correctly.

src/chain/store/chain_store.rs (1)

9-9: LGTM: import reshuffle and HEAD_EPOCH side‑effect removal

Import changes look good and align with removing the HEAD_EPOCH gauge update. The provided rg search produced no output, so I couldn't confirm removal of any residual HEAD_EPOCH.set calls — re-run from the repo root:

rg -nP '\bHEAD_EPOCH.set\s*('
rg -nP '\bHEAD_EPOCH\b'

Comment on lines +6 to 7
metrics::{counter::Counter, family::Family, histogram::Histogram},
};
Copy link
Contributor

@coderabbitai coderabbitai bot Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

🧩 Analysis chain

LGTM: import cleanup matches HEAD_EPOCH removal

No stray Gauge usage here. Please confirm no remaining references to HEAD_EPOCH across the repo.


🏁 Script executed:

#!/bin/bash
# Expect: no results
rg -nP '\bHEAD_EPOCH\b'

Length of output: 188


HEAD_EPOCH references remain — remove or justify
Found in scripts/tests/calibnet_eth_mapping_check.sh:19–20 (HEAD_EPOCH=$(...) and EPOCH=$((HEAD_EPOCH - 1))). Remove or document why this reference must remain.

🤖 Prompt for AI Agents
In src/chain_sync/metrics.rs around lines 6 to 7, the review flags lingering
references to HEAD_EPOCH in scripts/tests/calibnet_eth_mapping_check.sh (lines
19–20); either remove those HEAD_EPOCH references from the script or add a clear
justification. Fix by deleting the HEAD_EPOCH calculation and adjust downstream
logic to use a documented alternative (e.g., explicit EPOCH input, a
configurable env var, or a call to a stable API), or add a comment above the
script lines explaining why HEAD_EPOCH is required, its expected source, and add
validation/fallback logic to avoid silent failures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HEAD_EPOCH in the script is extracted from forest-cli info show output which is irrelevant to the metrics

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hanabi1224 You're absolutely right - thank you for the clarification! The HEAD_EPOCH in the shell script scripts/tests/calibnet_eth_mapping_check.sh is indeed a shell variable extracted from forest-cli info show output, which is completely unrelated to the Rust metrics HEAD_EPOCH constant that was removed from src/chain_sync/metrics.rs.

My analysis incorrectly flagged this as a potential issue when it's actually a legitimate usage in a different context. The script usage is fine and should remain as-is.


✏️ Learnings added
Learnt from: hanabi1224
PR: ChainSafe/forest#6079
File: src/chain_sync/metrics.rs:6-7
Timestamp: 2025-09-16T12:55:26.912Z
Learning: HEAD_EPOCH references in shell scripts (like scripts/tests/calibnet_eth_mapping_check.sh) that extract data from `forest-cli info show` are unrelated to Rust metrics constants with the same name and should not be flagged when metrics cleanup is performed.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: hanabi1224
PR: ChainSafe/forest#5969
File: src/tool/subcommands/snapshot_cmd.rs:412-412
Timestamp: 2025-08-25T13:35:24.230Z
Learning: In src/tool/subcommands/snapshot_cmd.rs, the +1 in `last_epoch = ts.epoch() - epochs as i64 + 1` fixes an off-by-1 bug where specifying --check-stateroots=N would validate N+1 epochs instead of N epochs, causing out-of-bounds errors when the snapshot contains only N recent state roots.

Learnt from: hanabi1224
PR: ChainSafe/forest#5930
File: build.rs:64-77
Timestamp: 2025-08-13T09:43:20.301Z
Learning: hanabi1224 prefers hard compile-time errors in build scripts rather than runtime safeguards or collision detection, believing it's better to fail fast and fix root causes of issues like malformed snapshot names.

Comment on lines +122 to 160
fn encode(&self, mut encoder: DescriptorEncoder) -> Result<(), std::fmt::Error> {
let epoch = (self.get_chain_head_height)();
{
let network_version = self.chain_config.network_version(epoch);
let nv_gauge: Gauge = Default::default();
nv_gauge.set(u32::from(network_version) as _);
let metric_encoder = encoder.encode_descriptor(
"network_version",
"Network version of the current chain head",
None,
nv_gauge.metric_type(),
)?;
nv_gauge.encode(metric_encoder)?;
}
{
let network_version_revision = self.chain_config.network_version_revision(epoch);
let nv_gauge: Gauge = Default::default();
nv_gauge.set(network_version_revision);
let metric_encoder = encoder.encode_descriptor(
"network_version_revision",
"Network version revision of the current chain head",
None,
nv_gauge.metric_type(),
)?;
nv_gauge.encode(metric_encoder)?;
}
{
let actor_version = (self.get_chain_head_actor_version)();
let av_gauge: Gauge = Default::default();
av_gauge.set(actor_version as _);
let metric_encoder = encoder.encode_descriptor(
"actor_version",
"Actor version of the current chain head",
None,
av_gauge.metric_type(),
)?;
av_gauge.encode(metric_encoder)?;
}
Ok(())
Copy link
Contributor

@coderabbitai coderabbitai bot Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

🧩 Analysis chain

Operational note: avoid heavy work in actor_version closure

If get_chain_head_actor_version traverses state on each scrape, consider caching per-head and updating on head changes to avoid scrape latency spikes.


🏁 Script executed:

#!/bin/bash
# Verify where the closure is defined and whether it re-walks state each time
rg -nP 'get_chain_head_actor_version|get_actor_bundle_metadata|NetworkVersionCollector::new' -C3

Length of output: 4398


Avoid rebuilding the state on each Prometheus scrape — cache actor version and update on head changes.
get_chain_head_actor_version in src/daemon/mod.rs (lines 219–229) constructs a StateTree and reads actor bundle metadata on every invocation; replace with a cached value (e.g., AtomicU64/Arc) updated on chain-head changes so the metrics closure returns immediately and avoids scrape latency spikes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loading a state tree is not expensive

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hanabi1224, thank you for the clarification! I appreciate you sharing that domain knowledge about StateTree loading performance in the Forest codebase.

    (\   /)
   ( ^.^ )
  o_(")(")
   Learning!

✏️ Learnings added
Learnt from: hanabi1224
PR: ChainSafe/forest#6079
File: src/networks/metrics.rs:122-160
Timestamp: 2025-09-16T12:56:31.565Z
Learning: In the Forest codebase, loading a StateTree is not expensive and does not cause performance concerns for Prometheus metrics collection.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: hanabi1224
PR: ChainSafe/forest#5930
File: build.rs:64-77
Timestamp: 2025-08-13T09:43:20.301Z
Learning: hanabi1224 prefers hard compile-time errors in build scripts rather than runtime safeguards or collision detection, believing it's better to fail fast and fix root causes of issues like malformed snapshot names.

LesnyRumcajs
LesnyRumcajs previously approved these changes Sep 16, 2025
Copy link
Member

@LesnyRumcajs LesnyRumcajs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/metrics/mod.rs (1)

85-92: Public API change — fix call site(s) using old signature

src/daemon/mod.rs:238 currently calls init_prometheus(prometheus_listener, db_directory, db); update this call to pass chain_config, get_chain_head_height, and get_chain_head_actor_version (chain_config and get_chain_head_height are in scope at 235–236; add/clone get_chain_head_actor_version as needed).

♻️ Duplicate comments (1)
src/daemon/mod.rs (1)

234-249: Blocker: spawned future must be 'static — use async move.

Without move, the async block borrows stack locals and won’t satisfy JoinSet::spawn’s 'static bound. Also capture get_chain_head_actor_version like height for clarity.

Apply:

-        services.spawn({
+        services.spawn({
             let chain_config = ctx.chain_config().clone();
             let get_chain_head_height = get_chain_head_height.clone();
-            async {
+            let get_chain_head_actor_version = get_chain_head_actor_version.clone();
+            async move {
                 crate::metrics::init_prometheus(
                     prometheus_listener,
                     db_directory,
                     db,
                     chain_config,
                     get_chain_head_height,
                     get_chain_head_actor_version,
                 )
                 .await
                 .context("Failed to initiate prometheus server")
             }
         });
🧹 Nitpick comments (2)
src/daemon/mod.rs (2)

209-218: Height getter: avoid sentinel 0; prefer explicit “unknown”.

Returning 0 on upgrade failure conflates “genesis” with “unknown”. Consider signaling unknown explicitly (e.g., Option) and let the collector decide whether to emit a sample.

Example within current types (minimal churn): add a debug log and a distinct negative epoch for unknown if acceptable in your metric logic.

-            move || {
+            move || {
                 chain_store
                     .upgrade()
                     .map(|cs| cs.heaviest_tipset().epoch())
-                    .unwrap_or_default()
+                    .unwrap_or_else(|| {
+                        tracing::debug!("get_chain_head_height: chain_store dropped; reporting unknown");
+                        0 // consider a distinct sentinel or handle Option upstream
+                    })
             }

220-233: Actor version getter may be heavy on each scrape; cache on head changes.

Building a StateTree and reading bundle metadata per /metrics scrape can block the metrics path. Prefer caching the actor version on head changes and having the getter read an Arc<AtomicU64> (or similar).

If helpful, I can sketch a small subscriber that updates an atomic on HeadChange::Apply and wire that into the getter.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 821be7d and 5e3c574.

📒 Files selected for processing (3)
  • src/chain/store/chain_store.rs (1 hunks)
  • src/daemon/mod.rs (4 hunks)
  • src/metrics/mod.rs (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/chain/store/chain_store.rs
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: hanabi1224
PR: ChainSafe/forest#5930
File: build.rs:64-77
Timestamp: 2025-08-13T09:43:20.301Z
Learning: hanabi1224 prefers hard compile-time errors in build scripts rather than runtime safeguards or collision detection, believing it's better to fail fast and fix root causes of issues like malformed snapshot names.
🧬 Code graph analysis (2)
src/metrics/mod.rs (1)
src/networks/metrics.rs (2)
  • new (32-42)
  • new (104-114)
src/daemon/mod.rs (3)
src/metrics/mod.rs (2)
  • new (178-180)
  • init_prometheus (85-124)
src/networks/metrics.rs (2)
  • new (32-42)
  • new (104-114)
src/shim/state_tree.rs (1)
  • new_from_root (168-188)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: tests-release
  • GitHub Check: tests
  • GitHub Check: Build forest binaries on Linux AMD64
  • GitHub Check: cargo-publish-dry-run
  • GitHub Check: Build MacOS
  • GitHub Check: Build Ubuntu
  • GitHub Check: All lint checks
🔇 Additional comments (5)
src/metrics/mod.rs (2)

6-6: Imports LGTM.

The added ChainConfig and ChainEpoch imports are correct and necessary.


107-113: NetworkVersionCollector registration looks good.

Collector is added to the collector registry; no ordering issues with other collectors.

src/daemon/mod.rs (3)

18-21: Import re-org is fine.

No side effects; scoped to daemon module.


31-31: New StateTree import OK.

Needed for actor-version lookup.


251-259: Height collector wiring LGTM.

Reuses the same get_chain_head_height closure; consistent with the new network metrics.

@LesnyRumcajs LesnyRumcajs added this pull request to the merge queue Sep 17, 2025
Merged via the queue into main with commit 940224f Sep 17, 2025
72 of 73 checks passed
@LesnyRumcajs LesnyRumcajs deleted the hm/metric-network-version branch September 17, 2025 08:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Network version Prometheus metric

4 participants