-
Notifications
You must be signed in to change notification settings - Fork 182
feat: add prometheus metrics network_version and actor_version
#6079
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds Prometheus metrics for network_version, network_version_revision, and actor_version; threads Arc‑wrapped Weak-backed closures from the daemon into metrics initialization; introduces NetworkVersionCollector and DescriptorEncoder encoding; makes network version logic height-driven; adds StateTree actor-bundle lookup and NetworkVersion→u32 conversion. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Daemon
participant ChainStore
participant Metrics as metrics::init_prometheus
participant Registry
participant HeightCollector as NetworkHeightCollector
participant VersionCollector as NetworkVersionCollector
participant Prom as Prometheus
Daemon->>ChainStore: create Weak reference
Daemon->>Daemon: define Arc-wrapped closures get_chain_head_height(), get_chain_head_actor_version()
Daemon->>Metrics: call init_prometheus(..., chain_config, get_chain_head_height, get_chain_head_actor_version)
activate Metrics
Metrics->>Registry: register HeightCollector(get_chain_head_height)
Metrics->>Registry: register VersionCollector(chain_config, get_chain_head_height, get_chain_head_actor_version)
deactivate Metrics
Prom->>Registry: scrape /metrics
Registry->>HeightCollector: encode()
HeightCollector->>Daemon: invoke get_chain_head_height()
Registry-->>Prom: export head_epoch, expected_network_height
Registry->>VersionCollector: encode()
VersionCollector->>Daemon: invoke get_chain_head_height()
VersionCollector->>Daemon: invoke get_chain_head_actor_version()
VersionCollector->>Registry: compute network_version, network_version_revision, actor_version
Registry-->>Prom: export network_version, network_version_revision, actor_version
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
CHANGELOG.md (1)
34-36: Capitalize “Prometheus” for consistency with prior entries.Earlier entries use “Prometheus metrics”. Minor wording tweak for consistency.
- - [#6079](https://github.com/ChainSafe/forest/pull/6079) Added prometheus metrics `network_height` and `network_version`. + - [#6079](https://github.com/ChainSafe/forest/pull/6079) Added Prometheus metrics `network_height` and `network_version`.src/shim/version.rs (1)
91-95: Match conversion style used elsewhere in this file.Other impls use
u32::from(...). Align for readability.impl From<NetworkVersion> for u32 { fn from(value: NetworkVersion) -> Self { - value.0.into() + u32::from(value.0) } }src/daemon/mod.rs (1)
207-213: Avoid shadowingchain_store; reuse the same clone for both closures.Small readability tweak; same behavior.
- let chain_store = ctx.state_manager.chain_store().clone(); - let get_chain_head_height = move || chain_store.heaviest_tipset().epoch(); - let chain_store = ctx.state_manager.chain_store().clone(); - let get_chain_head_network_version = move || { - let epoch = chain_store.heaviest_tipset().epoch(); - chain_store.chain_config.network_version(epoch) - }; + let chain_store = ctx.state_manager.chain_store().clone(); + let get_chain_head_height = { + let cs = chain_store.clone(); + move || cs.heaviest_tipset().epoch() + }; + let get_chain_head_network_version = { + let cs = chain_store.clone(); + move || { + let epoch = cs.heaviest_tipset().epoch(); + cs.chain_config.network_version(epoch) + } + };
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
CHANGELOG.md(1 hunks)src/daemon/mod.rs(3 hunks)src/metrics/mod.rs(3 hunks)src/networks/metrics.rs(1 hunks)src/shim/version.rs(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: hanabi1224
PR: ChainSafe/forest#5930
File: build.rs:64-77
Timestamp: 2025-08-13T09:43:20.301Z
Learning: hanabi1224 prefers hard compile-time errors in build scripts rather than runtime safeguards or collision detection, believing it's better to fail fast and fix root causes of issues like malformed snapshot names.
🧬 Code graph analysis (4)
src/shim/version.rs (1)
src/networks/mod.rs (1)
from(184-222)
src/daemon/mod.rs (3)
src/state_manager/mod.rs (1)
chain_store(296-298)src/blocks/tipset.rs (2)
epoch(306-308)epoch(543-545)src/metrics/mod.rs (1)
init_prometheus(68-103)
src/metrics/mod.rs (1)
src/networks/metrics.rs (2)
new(30-36)new(93-97)
src/networks/metrics.rs (2)
src/networks/mod.rs (3)
calculate_expected_epoch(571-577)epoch(449-462)from(184-222)src/shim/version.rs (8)
from(86-88)from(92-94)from(98-100)from(104-106)from(110-112)from(116-118)from(122-124)from(128-130)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: Build MacOS
- GitHub Check: cargo-publish-dry-run
- GitHub Check: Build Ubuntu
- GitHub Check: All lint checks
- GitHub Check: Build forest binaries on Linux AMD64
- GitHub Check: tests
- GitHub Check: tests-release
🔇 Additional comments (5)
src/metrics/mod.rs (1)
68-74: Wiring for network_version collector looks good.Signature extension and registration are clear and minimally invasive.
Also applies to: 90-93
src/daemon/mod.rs (1)
215-223: init_prometheus call updated correctly.Closure handoff matches the new API.
src/networks/metrics.rs (3)
14-37: Generic NetworkHeightCollector with live getters — LGTM.Encodes
network_heightandexpected_network_heightcorrectly; uses saturating math andChainEpochgauge.Also applies to: 39-77
79-117: New NetworkVersionCollector — LGTM.Numeric encoding via
u32::from(NetworkVersion)is appropriate; descriptor/help strings are clear.
4-9: Resolved —educeis declared in Cargo.toml (line 66) with the Debug feature enabled.
No further action required.
LesnyRumcajs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's update https://docs.forest.chainsafe.io/reference/metrics?
src/daemon/mod.rs
Outdated
| let get_chain_head_network_version = move || { | ||
| let epoch = chain_store.heaviest_tipset().epoch(); | ||
| chain_store.chain_config.network_version(epoch) | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it resolves the issue we had with calibnet nodes not updated? #5141 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember using a string Gauge value breaks the Grafana scrapper. How about adding a network_version_revision to distinguish the same network version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me rephrase; you're checking what the network is supposed to be according to the schedule. What is more useful for monitoring the health of the nodes, is checking the actual network version (though the expected version is also good to have, so consumers can set up an alarm if there's a mismatch).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LesnyRumcajs I think with network_version_revision, the issue could be resolved. Please let me know your thoughts.
Watermelon -> v21, rev 0
WatermelonFix -> v21, rev 1
WatermelonFix2 -> v21, rev2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/daemon/mod.rs (1)
207-224: Unify Weak handle, avoid hardcoded V0 fallback, and tighten wording
- Reuse a single
Weakfor both closures to avoid duplicate downgrades.- Prefer a fallback derived from
chain_configinstead of hardcodingNetworkVersion::V0to reduce misleading scrapes whenChainStoreis unavailable.- Minor: tweak comment phrasing “dead lock” → “deadlock”.
Apply this diff:
- // Use `Weak` to not dead lock GC. - let chain_store = Arc::downgrade(ctx.state_manager.chain_store()); - let get_chain_head_height = move || { - chain_store - .upgrade() - .map(|cs| cs.heaviest_tipset().epoch()) - .unwrap_or_default() - }; - // Use `Weak` to not dead lock GC. - let chain_store = Arc::downgrade(ctx.state_manager.chain_store()); - let get_chain_head_network_version = move || { - if let Some(cs) = chain_store.upgrade() { - let epoch = cs.heaviest_tipset().epoch(); - cs.chain_config.network_version(epoch) - } else { - NetworkVersion::V0 - } - }; + // Use a Weak to avoid deadlocking GC and to not keep ChainStore alive. + let chain_store_weak = Arc::downgrade(ctx.state_manager.chain_store()); + let chain_config = ctx.state_manager.chain_config().clone(); + + let get_chain_head_height = { + let chain_store_weak = chain_store_weak.clone(); + move || { + chain_store_weak + .upgrade() + .map(|cs| cs.heaviest_tipset().epoch()) + .unwrap_or_default() + } + }; + + let get_chain_head_network_version = { + let chain_store_weak = chain_store_weak.clone(); + let chain_config = chain_config.clone(); + move || { + chain_store_weak + .upgrade() + .map(|cs| chain_config.network_version(cs.heaviest_tipset().epoch())) + .unwrap_or_else(|| chain_config.network_version(ChainEpoch::default())) + } + };
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/daemon/mod.rs(3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: hanabi1224
PR: ChainSafe/forest#5930
File: build.rs:64-77
Timestamp: 2025-08-13T09:43:20.301Z
Learning: hanabi1224 prefers hard compile-time errors in build scripts rather than runtime safeguards or collision detection, believing it's better to fail fast and fix root causes of issues like malformed snapshot names.
🧬 Code graph analysis (1)
src/daemon/mod.rs (3)
src/state_manager/mod.rs (1)
chain_store(296-298)src/blocks/tipset.rs (2)
epoch(306-308)epoch(543-545)src/metrics/mod.rs (1)
init_prometheus(68-103)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: Build forest binaries on Linux AMD64
- GitHub Check: cargo-publish-dry-run
- GitHub Check: Build MacOS
- GitHub Check: Build Ubuntu
- GitHub Check: tests
- GitHub Check: tests-release
- GitHub Check: All lint checks
🔇 Additional comments (3)
src/daemon/mod.rs (3)
18-21: Import consolidation LGTMGrouped
useimproves readability without behavioral change.
243-244: NetworkHeightCollector wiring LGTMPassing the live head-height getter should make
network_heightaccurate at scrape time.
226-234: Resolved — init_prometheus updated; no remaining 3‑arg call sites foundFound: src/daemon/mod.rs:226 (init_prometheus call); src/metrics/mod.rs:91 (NetworkVersionCollector::new called with the getter). No other init_prometheus / old 3‑arg usages detected.
4899398 to
4a5eb4f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/networks/mod.rs (1)
436-444: Consider caching Height iteration results for performance.The
network_version_revisionfunction iterates through all Heights each time it's called. Since Heights are static and their NetworkVersion mappings don't change, consider caching the first height for each NetworkVersion to avoid repeated iteration in hot paths.This could be optimized with a lazy static lookup table:
+use std::collections::HashMap; + +static FIRST_HEIGHT_BY_VERSION: LazyLock<HashMap<NetworkVersion, Height>> = LazyLock::new(|| { + let mut map = HashMap::new(); + for height in Height::iter() { + let nv = NetworkVersion::from(height); + map.entry(nv).or_insert(height); + } + map +}); pub fn network_version_revision(&self, epoch: ChainEpoch) -> i64 { if let Some(height) = self.network_height(epoch) { let nv = NetworkVersion::from(height); - if let Some(rev0_height) = Height::iter().find(|h| NetworkVersion::from(*h) == nv) { + if let Some(&rev0_height) = FIRST_HEIGHT_BY_VERSION.get(&nv) { return (height as i64) - (rev0_height as i64); } } 0 }src/networks/metrics.rs (1)
53-64: Consider documenting metric units and ranges.The network_height metric would benefit from documentation about its units (epochs) and expected range in the metric description.
let metric_encoder = encoder.encode_descriptor( "network_height", - "The current network height", + "The current network height in epochs (blocks since genesis)", None, network_height.metric_type(), )?;
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
docs/docs/users/reference/metrics.md(1 hunks)src/daemon/mod.rs(3 hunks)src/metrics/mod.rs(3 hunks)src/networks/metrics.rs(1 hunks)src/networks/mod.rs(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- docs/docs/users/reference/metrics.md
- src/daemon/mod.rs
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: hanabi1224
PR: ChainSafe/forest#5930
File: build.rs:64-77
Timestamp: 2025-08-13T09:43:20.301Z
Learning: hanabi1224 prefers hard compile-time errors in build scripts rather than runtime safeguards or collision detection, believing it's better to fail fast and fix root causes of issues like malformed snapshot names.
🧬 Code graph analysis (3)
src/metrics/mod.rs (4)
src/networks/mod.rs (1)
chain_config(737-742)src/state_manager/mod.rs (2)
chain_config(244-246)new(181-186)src/rpc/mod.rs (1)
chain_config(474-476)src/networks/metrics.rs (2)
new(32-42)new(100-105)
src/networks/mod.rs (2)
src/shim/version.rs (8)
from(86-88)from(92-94)from(98-100)from(104-106)from(110-112)from(116-118)from(122-124)from(128-130)src/rpc/methods/chain.rs (1)
calibnet(1251-1256)
src/networks/metrics.rs (1)
src/networks/mod.rs (7)
calculate_expected_epoch(590-596)network_height(416-423)epoch(468-481)network_version(427-432)network_version(745-755)from(186-224)network_version_revision(436-444)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: Build MacOS
- GitHub Check: Build Ubuntu
- GitHub Check: cargo-publish-dry-run
- GitHub Check: Build forest binaries on Linux AMD64
- GitHub Check: tests-release
- GitHub Check: tests
- GitHub Check: All lint checks
- GitHub Check: Deploy to Cloudflare Pages
🔇 Additional comments (11)
src/metrics/mod.rs (3)
6-6: LGTM! Clean import additions for network metrics support.The new imports are appropriate for the network metrics functionality being added.
72-73: Good API design using generic closures for decoupling.The function signature update properly extends the API with the required parameters for network metrics. Using
Arc<impl Fn() -> ChainEpoch + Send + Sync + 'static>provides good flexibility and thread-safety for the height getter closure.
91-93: LGTM! Collector registration follows established patterns.The NetworkVersionCollector registration is properly integrated into the existing metrics initialization flow.
src/networks/mod.rs (5)
13-15: LGTM! Required imports for Height iteration support.The strum imports enable the EnumIter functionality needed for the new network version revision logic.
139-177: Comprehensive Height enum expansion for all network upgrades.The Height enum now includes all network upgrade heights through GoldenWeek (V27), properly supporting the full upgrade history. The EnumIter derive enables iteration for the revision calculation logic.
426-432: Good defensive programming with genesis network version fallback.The network version calculation properly handles edge cases by falling back to genesis version and ensuring the result is never below the genesis network version using
.max().
744-755: Excellent test coverage for network version boundaries.The tests properly validate both network version transitions and revision calculations at critical epoch boundaries, ensuring the metrics will report accurate values.
416-423: Resolved: strict>innetwork_heightis intentional.Tests in src/networks/mod.rs assert network_version(1_013_134) == NetworkVersion::V20 and network_version(1_013_134 + 1) == NetworkVersion::V21, confirming upgrades activate on the epoch after the listed upgrade epoch; no change required.
src/networks/metrics.rs (3)
18-26: Good use of generics for flexible height retrieval.The generic design with
F: Fn() -> ChainEpochprovides excellent flexibility for different sources of chain height while maintaining type safety.
127-136: Clear metric for distinguishing same-version upgrades.The network_version_revision metric elegantly handles the case where multiple upgrades don't bump the network version (e.g., WatermelonFix, WatermelonFix2 all at V21), providing valuable operational insight.
117-117: Conversion safe — no action requiredimpl From for u32 delegates to the inner value (
value.0.into()), and NetworkVersion is created from arbitrary u32 viaNetworkVersion_latest::new(...)(see src/shim/version.rs), sou32::from(network_version)is safe for all NetworkVersion values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
src/daemon/mod.rs (2)
210-215: Head-height getter fallback: confirm semantics and observability.If the
Weakcan’t upgrade, the metric reports height 0. That could silently skewnetwork_height/network_versionduring teardown/restart windows. Consider logging once on fallback or documenting this behavior in the metric HELP so operators know 0 can mean “store unavailable.”
208-209: Nit: tighten the comment.Change “dead lock” → “deadlock” and clarify the reason (avoiding keeping the store alive and blocking GC).
- // Use `Weak` to not dead lock GC. + // Use a Weak to avoid keeping the chain store alive and risking GC deadlock.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/daemon/mod.rs(4 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: hanabi1224
PR: ChainSafe/forest#5930
File: build.rs:64-77
Timestamp: 2025-08-13T09:43:20.301Z
Learning: hanabi1224 prefers hard compile-time errors in build scripts rather than runtime safeguards or collision detection, believing it's better to fail fast and fix root causes of issues like malformed snapshot names.
🧬 Code graph analysis (1)
src/daemon/mod.rs (4)
src/state_manager/mod.rs (3)
chain_store(296-298)new(181-186)chain_config(244-246)src/networks/metrics.rs (2)
new(32-42)new(100-105)src/metrics/mod.rs (3)
new(150-152)init_prometheus(68-104)default_registry(26-28)src/daemon/context.rs (1)
chain_config(65-67)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: Build Ubuntu
- GitHub Check: cargo-publish-dry-run
- GitHub Check: Build MacOS
- GitHub Check: Build forest binaries on Linux AMD64
- GitHub Check: tests
- GitHub Check: tests-release
- GitHub Check: All lint checks
- GitHub Check: Check
🔇 Additional comments (3)
src/daemon/mod.rs (3)
18-21: LGTM: import consolidation reads clearer.
232-241: LGTM: registering NetworkHeightCollector with live head-height source.Passing the getter through aligns the collector with the node’s current head.
558-560: LGTM: reset the default Prometheus registry before (re)startup.Prevents duplicate collectors across restarts within the same process.
d7a681d to
2be478f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/daemon/mod.rs (1)
234-249: Fix: spawned future must be 'static — useasync moveand clone captured ArcsAs noted in the earlier review, JoinSet::spawn requires a 'static future. The current async block borrows stack locals and will fail to compile or is fragile.
Apply this diff:
services.spawn({ let chain_config = ctx.chain_config().clone(); - let get_chain_head_height = get_chain_head_height.clone(); - async { + let get_chain_head_height = std::sync::Arc::clone(&get_chain_head_height); + let get_chain_head_actor_version = + std::sync::Arc::clone(&get_chain_head_actor_version); + async move { crate::metrics::init_prometheus( prometheus_listener, db_directory, db, chain_config, get_chain_head_height, - get_chain_head_actor_version, + get_chain_head_actor_version, ) .await .context("Failed to initiate prometheus server") } });
🧹 Nitpick comments (8)
src/shim/state_tree.rs (1)
200-207: Resolve actor bundle metadata efficiently; minor ergonomicsLooks correct and returns a stable reference. Two nits:
- This does a linear scan over ACTOR_BUNDLES_METADATA on every call. If this ends up hot, consider a secondary map keyed by system actor code Cid for O(1) lookup.
- The returned reference is effectively 'static; current signature is fine, but adding a brief doc comment on lifetime/source (static registry) would help future maintainers.
If you expect this method to be called per-scrape for metrics, we can add a cache keyed by the resolved system code. Want a small follow-up patch?
src/daemon/mod.rs (2)
219-233: Avoid per-scrape heavy state reconstruction (optional)Each scrape rebuilds a StateTree and scans bundle metadata. If scrape frequency is high, this can add load. Consider caching the actor version by epoch (or tipset key) with a small TTL, invalidated on head changes.
I can wire this to the existing head-change publisher and only recompute on Apply. Interested?
577-579: Resetting the global Prometheus registry: confirm lifecycleOverwriting the default registry is fine at process start, but on internal restarts it can drop already-registered custom collectors if any other tasks still hold references. Confirm there’s no concurrent metrics server running when this executes.
If you want a safer pattern, we can guard this behind a “first init” OnceCell or replace it with an explicit “clear” API on the registry.
src/networks/mod.rs (4)
185-224: Validate NV mappings for new heightsMappings for Teep/Tock/TockFix/GoldenWeek imply NV25–NV27. Ensure these align with lotus/build constants for each network; drift here will misreport network_version.
If helpful, I can add a small test asserting that for each Height the NV matches lotus’ tables.
416-423: Off-by-one boundary: upgrade effective at epoch or epoch+1?network_height uses
epoch > info.epoch, which makes the upgrade effective starting atinfo.epoch + 1. Your calibnet test expects that, but please confirm this matches canonical behavior across networks (Lotus often treats upgrades as effective at the named epoch). If not universally true, we may need a per-network flag or>=.I can add cross-check tests against known mainnet/calibnet boundaries to lock this down.
434-444: Revision math depends on enum order; add a guardComputing revision as
height - first_height_with_same_nvrelies on Height variants of the same NV being contiguous and ordered. Add a test to assert contiguity so future insertions don’t silently skew revision numbers.Apply this test addition:
@@ #[cfg(test)] mod tests { use super::*; + #[test] + fn revisions_are_contiguous_per_nv() { + use std::collections::HashMap; + let mut groups: HashMap<NetworkVersion, Vec<Height>> = HashMap::new(); + for h in Height::iter() { + groups.entry(NetworkVersion::from(h)).or_default().push(h); + } + for (_nv, hs) in groups { + // Ensure all heights for the NV form a single contiguous run in declaration order + let idxs: Vec<_> = hs.iter().map(|h| *h as i64).collect(); + let min = *idxs.iter().min().unwrap(); + let max = *idxs.iter().max().unwrap(); + assert_eq!(idxs.len() as i64, max - min + 1, "non-contiguous NV group"); + } + }
47-49: Newest NV constant appears staleNEWEST_NETWORK_VERSION is V25, but Height includes NV26 and NV27. If this constant is used anywhere, it should be updated to V27.
src/networks/metrics.rs (1)
49-83: Height metrics look correct; consider now() sourceEncoding both actual and expected heights is great. Minor: using
chrono::Utc::now()is fine; if you already centralize time sources elsewhere (e.g., a Clock trait), consider using it for testability.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
CHANGELOG.md(1 hunks)docs/docs/users/reference/metrics.md(3 hunks)src/daemon/mod.rs(5 hunks)src/metrics/mod.rs(3 hunks)src/networks/metrics.rs(1 hunks)src/networks/mod.rs(5 hunks)src/shim/state_tree.rs(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- src/metrics/mod.rs
- CHANGELOG.md
- docs/docs/users/reference/metrics.md
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: hanabi1224
PR: ChainSafe/forest#5930
File: build.rs:64-77
Timestamp: 2025-08-13T09:43:20.301Z
Learning: hanabi1224 prefers hard compile-time errors in build scripts rather than runtime safeguards or collision detection, believing it's better to fail fast and fix root causes of issues like malformed snapshot names.
🧬 Code graph analysis (3)
src/networks/metrics.rs (1)
src/networks/mod.rs (7)
calculate_expected_epoch(590-596)network_height(416-423)epoch(468-481)network_version(427-432)network_version(745-755)from(186-224)network_version_revision(436-444)
src/networks/mod.rs (2)
src/networks/actors_bundle.rs (1)
get_actor_bundles_metadata(149-171)src/shim/version.rs (8)
from(86-88)from(92-94)from(98-100)from(104-106)from(110-112)from(116-118)from(122-124)from(128-130)
src/daemon/mod.rs (5)
src/metrics/mod.rs (3)
new(155-157)init_prometheus(68-109)default_registry(26-28)src/networks/metrics.rs (2)
new(32-42)new(104-114)src/state_manager/mod.rs (3)
new(181-186)chain_store(296-298)chain_config(244-246)src/shim/state_tree.rs (1)
new_from_root(168-188)src/shim/state_tree_v0.rs (1)
new_from_root(34-53)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: All lint checks
- GitHub Check: tests-release
- GitHub Check: Build MacOS
- GitHub Check: cargo-publish-dry-run
- GitHub Check: tests
- GitHub Check: Build Ubuntu
- GitHub Check: Build forest binaries on Linux AMD64
🔇 Additional comments (11)
src/shim/state_tree.rs (1)
8-11: LGTM: needed imports and re-exportsBrings in ACTOR_BUNDLES_METADATA and ActorBundleMetadata for the new helper; no concerns.
src/daemon/mod.rs (2)
209-218: LGTM: head-height getter uses Weak to avoid GC deadlocksGood choice to avoid holding strong refs in a long-lived metrics closure.
251-260: LGTM: NetworkHeightCollector registrationCollector wiring passes correct params and the same height getter used by the server; consistent.
src/networks/mod.rs (5)
13-16: LGTM: Enum iteration supportEnumIter/IntoEnumIterator additions are appropriate for revision computation.
32-34: LGTM: public re-exportsRe-exporting ActorBundleMetadata enables shim access; consistent with usage in state_tree.
138-177: LGTM: Height variants extendedNew heights look consistent and maintain contiguous grouping for fixes.
425-433: LGTM: network_version derived from height with genesis floorClearer and easier to reason about than bespoke epoch checks.
744-755: LGTM: boundary tests for calibnetGood coverage of the epoch/off-by-one and revision=0→1 transition.
src/networks/metrics.rs (3)
4-11: LGTM: collector scaffolding and DescriptorEncoder usageImports and the switch to DescriptorEncoder are consistent with prometheus_client APIs.
85-97: LGTM: version collector fields and ownershipArcs and ignored Debug fields keep logs clean; no issues.
117-161: LGTM: emits NV, revision, and actor versionMetric names and help strings are clear; conversions to Gauge types look correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
src/networks/metrics.rs (5)
53-64: Optimize gauge allocation by reusing instances.Creating new
Gaugeinstances on everyencodecall is inefficient. Consider storing the gauges as fields in the collector struct to avoid repeated allocations during metric collection.pub struct NetworkHeightCollector<F> where F: Fn() -> ChainEpoch, { block_delay_secs: u32, genesis_timestamp: u64, + network_height: Gauge, + expected_network_height: Gauge, #[educe(Debug(ignore))] get_chain_head_height: Arc<F>, }Then update the constructor and encode method accordingly:
pub fn new( block_delay_secs: u32, genesis_timestamp: u64, get_chain_head_height: Arc<F>, ) -> Self { Self { block_delay_secs, genesis_timestamp, + network_height: Default::default(), + expected_network_height: Default::default(), get_chain_head_height, } }- let network_height: Gauge = Default::default(); let epoch = (self.get_chain_head_height)(); - network_height.set(epoch); + self.network_height.set(epoch); let metric_encoder = encoder.encode_descriptor( "network_height", "The current network height", None, - network_height.metric_type(), + self.network_height.metric_type(), )?; - network_height.encode(metric_encoder)?; + self.network_height.encode(metric_encoder)?;
127-127: Potential integer overflow when casting network version.The cast from
u32to gauge value type (likelyi64) and then the unnecessary intermediate cast to_could mask potential issues. Use explicit typing for clarity.- nv_gauge.set(u32::from(network_version) as _); + nv_gauge.set(i64::from(u32::from(network_version)));
151-151: Remove unnecessary cast for actor version.The actor version is already
u64, and the unnecessary cast to_reduces code clarity.- av_gauge.set(actor_version as _); + av_gauge.set(actor_version as i64);
122-161: Consider adding metric labels for better observability.Network version metrics could benefit from additional labels like the actual version name (e.g., "v21", "v22") to make dashboards more readable without requiring lookups.
Would you like me to help implement labeled metrics that include both the numeric value and a human-readable version string as a label? This would make Grafana dashboards more intuitive.
85-162: Add documentation for the new collector.The
NetworkVersionCollectorstruct and its public methods lack documentation. Consider adding doc comments to explain the purpose, usage, and the meaning of each metric it exposes.+/// Collector for network version related metrics. +/// +/// This collector exposes: +/// - `network_version`: The current network version based on chain epoch +/// - `network_version_revision`: The revision number within the current network version +/// - `actor_version`: The actor bundle version at the current chain head #[derive(Educe)] #[educe(Debug)] pub struct NetworkVersionCollector<F1, F2>
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
CHANGELOG.md(1 hunks)docs/docs/users/reference/metrics.md(3 hunks)src/daemon/mod.rs(5 hunks)src/metrics/mod.rs(3 hunks)src/networks/metrics.rs(1 hunks)src/networks/mod.rs(5 hunks)src/shim/state_tree.rs(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
- src/metrics/mod.rs
- CHANGELOG.md
- docs/docs/users/reference/metrics.md
- src/shim/state_tree.rs
- src/networks/mod.rs
- src/daemon/mod.rs
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: hanabi1224
PR: ChainSafe/forest#5930
File: build.rs:64-77
Timestamp: 2025-08-13T09:43:20.301Z
Learning: hanabi1224 prefers hard compile-time errors in build scripts rather than runtime safeguards or collision detection, believing it's better to fail fast and fix root causes of issues like malformed snapshot names.
🧬 Code graph analysis (1)
src/networks/metrics.rs (2)
src/networks/mod.rs (8)
calculate_expected_epoch(590-596)network_height(416-423)epoch(468-481)chain_config(737-742)network_version(427-432)network_version(745-755)from(186-224)network_version_revision(436-444)src/rpc/mod.rs (1)
chain_config(474-476)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: cargo-publish-dry-run
- GitHub Check: Build Ubuntu
- GitHub Check: Build MacOS
- GitHub Check: tests
- GitHub Check: tests-release
- GitHub Check: Build forest binaries on Linux AMD64
- GitHub Check: All lint checks
🔇 Additional comments (1)
src/networks/metrics.rs (1)
45-83: Handle Weak::upgrade() failures when collecting network metricssrc/networks/metrics.rs (lines 45–83): The collector invokes a closure to read chain height. If that closure upgrades a Weak<RwLock>, upgrade() can fail during collection and cause the metric to record a default/stale value. Ensure the closure handles upgrade failures deterministically (return a safe default, skip emission, and/or log an error). Repository search did not locate the closure implementation; confirm its behavior and update accordingly.
docs/docs/users/reference/metrics.md
Outdated
| | `peer_failure_total` | Counter | Count | Total number of failed peer requests | | ||
| | `full_peers` | Gauge | Count | Number of healthy peers recognized by the node | | ||
| | `bad_peers` | Gauge | Count | Number of bad peers recognized by the node | | ||
| | `network_height` | Gauge | Count | The current network height | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be good to explain the difference between those metrics in the docs (dedicated section).
head_epoch 3021420
network_height 3021421
expected_network_height 3023095
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realized network_height is a duplicate of head_epoch, removed to avoid confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hanabi1224 It's not exactly a duplicate, is it? I mean, the values differ, and from what I saw, head_epoch is not available just when the node starts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LesnyRumcajs The metrics are the same, and the difference is due to how the value is updated. I can update head_epoch to use the collector approach which is more reliable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/daemon/mod.rs (1)
234-249: Fix: spawned future must be 'static — use async move.Without
move, the async block borrows stack locals (prometheus_listener,db_directory,db, closures), which won’t satisfyJoinSet::spawn’s'staticbound.Apply this diff:
- services.spawn({ + services.spawn({ let chain_config = ctx.chain_config().clone(); let get_chain_head_height = get_chain_head_height.clone(); - async { + async move { crate::metrics::init_prometheus( prometheus_listener, db_directory, db, chain_config, get_chain_head_height, get_chain_head_actor_version, ) .await .context("Failed to initiate prometheus server") } });
🧹 Nitpick comments (1)
src/daemon/mod.rs (1)
219-233: Actor-version getter: verify epoch selection and consider lightweight caching.
- Using
heaviest_tipset().parent_state()may be off by one at upgrade boundaries; verify against a known upgrade epoch that the reported value matches “actual” actor bundle version for that tipset.- Building a
StateTreeevery scrape can be expensive; consider memoizing by(tipset_epoch or parent_state_cid)to avoid repeated reads between scrapes.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
CHANGELOG.md(1 hunks)src/daemon/mod.rs(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- CHANGELOG.md
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: hanabi1224
PR: ChainSafe/forest#5930
File: build.rs:64-77
Timestamp: 2025-08-13T09:43:20.301Z
Learning: hanabi1224 prefers hard compile-time errors in build scripts rather than runtime safeguards or collision detection, believing it's better to fail fast and fix root causes of issues like malformed snapshot names.
🧬 Code graph analysis (1)
src/daemon/mod.rs (5)
src/networks/metrics.rs (2)
new(32-42)new(104-114)src/metrics/mod.rs (2)
new(155-157)init_prometheus(68-109)src/state_manager/mod.rs (3)
new(181-186)chain_store(296-298)chain_config(244-246)src/shim/state_tree.rs (1)
new_from_root(168-188)src/daemon/context.rs (1)
chain_config(65-67)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: Check
- GitHub Check: Build forest binaries on Linux AMD64
- GitHub Check: All lint checks
- GitHub Check: Build MacOS
- GitHub Check: Build Ubuntu
- GitHub Check: cargo-publish-dry-run
- GitHub Check: tests
- GitHub Check: tests-release
- GitHub Check: Deploy to Cloudflare Pages
🔇 Additional comments (4)
src/daemon/mod.rs (4)
18-21: Import re-org looks good.Grouping
AppContext,DbType, andimport_chain_as_forest_carimproves clarity.
31-31: StateTree import is appropriate.Needed for actor-version lookup; no concerns.
209-218: Height getter via Weak is solid.Avoids GC deadlocks and returns a sane default on None.
251-259: NetworkHeightCollector wiring looks correct.Passing
block_delay_secs,genesis_timestamp, and the height getter matches the new collector API.
35e1087 to
2bebfc3
Compare
network_height and network_versionnetwork_version and actor_version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/daemon/mod.rs (1)
252-260:network_heightmetric is missing from the code path registering collectors.You only register
NetworkHeightCollector(which currently exposesexpected_network_heightonly) andNetworkVersionCollector(which emits version, revision, actor version). The PR title promisesnetwork_height, but it is not emitted.Apply one of the following:
- Option A (prefer): emit
network_heightfromNetworkVersionCollectorusing the sameepochit already fetches.- Option B: extend
NetworkHeightCollectorto also emitnetwork_heightusing a head-height getter.See code suggestions under src/networks/metrics.rs.
src/networks/metrics.rs (1)
31-52:NetworkHeightCollectoremits onlyexpected_network_height; addnetwork_heightor rename.Given the PR title and docs expectations, this collector should also emit
network_height(derived from current head), or be clearly renamed toExpectedNetworkHeightCollector. Prefer emitting both.Minimal change (add
network_heighthere using the head epoch) requires a getter. If you want to avoid changing this type’s constructor, addnetwork_heightinNetworkVersionCollectorinstead (see below).
♻️ Duplicate comments (2)
docs/docs/users/reference/metrics.md (1)
20-22: Missingnetwork_heightdoc and unclear distinctions vs.head_epoch/expected_network_height.
- PR title promises
network_height, but it's not documented in the table.- Past review asked to clarify differences; please add a short section explaining how these three relate.
Apply this doc diff to add
network_heightand a clarifying section:@@ -| `expected_network_height` | Gauge | Count | The expected network height based on the current time and the genesis block time | +| `expected_network_height` | Gauge | Count | The expected network height based on the current time and the genesis block time | +| `network_height` | Gauge | Count | Network height derived from the current chain head (see notes below) | | `network_version` | Gauge | Count | Network version of the current chain head | | `network_version_revision` | Gauge | Count | Network version revision of the current chain head | | `actor_version` | Gauge | Count | Actor version of the current chain head | @@ </details> +<details> + <summary>Example `network_height` output</summary> +``` +# HELP network_height Network height derived from the current chain head +# TYPE network_height gauge +network_height 3021421 +``` +</details> + +### Notes: head_epoch vs. network_height vs. expected_network_height +- head_epoch: the epoch of the node’s heaviest tipset (what you actually synced to). +- network_height: height derived from the current head for operational dashboards; intended to track the “current” network height as observed by this node. +- expected_network_height: height computed from wall‑clock time, genesis timestamp, and block delay; use this to detect lag (expected − head_epoch). +head_epoch 3021420
network_height 3021421
expected_network_height 3023095src/daemon/mod.rs (1)
209-249: Fix: spawn a 'static future withasync move(compile blocker).
JoinSet::spawnrequires a'staticfuture. The currentasync { ... }block borrows stack locals (prometheus_listener,db_directory,db,chain_config, closures), which won’t compile.Apply:
- async { + async move { crate::metrics::init_prometheus( prometheus_listener, db_directory, db, chain_config, get_chain_head_height, get_chain_head_actor_version, ) .await .context("Failed to initiate prometheus server") }
🧹 Nitpick comments (3)
docs/docs/users/reference/metrics.md (1)
294-320: Examples look good; add one fornetwork_heightto align with the table.Without an example, users may miss how it renders or differs from
head_epoch. See suggested diff above.src/daemon/mod.rs (1)
221-237: Heavy work per scrape foractor_version; consider caching.Constructing a
StateTreeand reading bundle metadata on every Prometheus scrape is expensive and can skew scrape latency. Prefer computing on head changes and publishing a cached atomic value.If you want, I can draft a small subscriber that updates an
AtomicI64onHeadChange::Applyand have the getter read it.src/networks/metrics.rs (1)
36-43: Nit: avoid lossy cast fromi64tou64for timestamps.
Utc::now().timestamp()returnsi64. Casting tou64is fine in practice, but usetry_into().unwrap_or_default()to avoid surprising wrap in hypothetical negative times (tests, mocks).- let expected_epoch = calculate_expected_epoch( - chrono::Utc::now().timestamp() as u64, + let now_secs: u64 = chrono::Utc::now().timestamp().try_into().unwrap_or_default(); + let expected_epoch = calculate_expected_epoch( - self.genesis_timestamp, + now_secs, self.block_delay_secs, );
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
docs/docs/users/reference/metrics.md(2 hunks)src/daemon/mod.rs(3 hunks)src/networks/metrics.rs(2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: hanabi1224
PR: ChainSafe/forest#5930
File: build.rs:64-77
Timestamp: 2025-08-13T09:43:20.301Z
Learning: hanabi1224 prefers hard compile-time errors in build scripts rather than runtime safeguards or collision detection, believing it's better to fail fast and fix root causes of issues like malformed snapshot names.
🧬 Code graph analysis (2)
src/networks/metrics.rs (1)
src/networks/mod.rs (6)
calculate_expected_epoch(590-596)epoch(468-481)network_version(427-432)network_version(745-755)from(186-224)network_version_revision(436-444)
src/daemon/mod.rs (4)
src/state_manager/mod.rs (3)
chain_config(244-246)new(181-186)chain_store(296-298)src/daemon/context.rs (1)
chain_config(65-67)src/metrics/mod.rs (2)
new(155-157)init_prometheus(68-109)src/shim/state_tree.rs (1)
new_from_root(168-188)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: cargo-publish-dry-run
- GitHub Check: Build MacOS
- GitHub Check: Build Ubuntu
- GitHub Check: tests-release
- GitHub Check: tests
- GitHub Check: Build forest binaries on Linux AMD64
- GitHub Check: All lint checks
🔇 Additional comments (1)
src/networks/metrics.rs (1)
86-131: Emit missingnetwork_heightmetric in NetworkVersionCollector::encodeAdd a Gauge for
network_height(derived from the current head) before emittingnetwork_versionso the PR/docs match the implementation.Apply:
fn encode(&self, mut encoder: DescriptorEncoder) -> Result<(), std::fmt::Error> { let epoch = (self.get_chain_head_height)(); + // network_height: derived from the current chain head + { + let nh_gauge: Gauge = Default::default(); + nh_gauge.set(epoch); + let metric_encoder = encoder.encode_descriptor( + "network_height", + "Network height derived from the current chain head", + None, + nh_gauge.metric_type(), + )?; + nh_gauge.encode(metric_encoder)?; + } { let network_version = self.chain_config.network_version(epoch); let nv_gauge: Gauge = Default::default(); nv_gauge.set(u32::from(network_version) as _);If “network_height” should instead mean the protocol upgrade height for the head’s epoch, use:
- nh_gauge.set(epoch); + let height = self + .chain_config + .network_height(epoch) + .map(|h| h as i64) + .unwrap_or_default(); + nh_gauge.set(height);Confirm intended semantics for
network_height:
- A) current head epoch,
- B) wall‑clock expected epoch (already exported as
expected_network_height),- C) protocol upgrade height for the head’s epoch.
2bebfc3 to
821be7d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (5)
src/chain_sync/metrics.rs (2)
31-33: Typo in metric name: "libp2p_messsage_total" (triple “s”)Consider correcting the metric name (and planning a deprecation path to avoid breaking dashboards).
- "libp2p_messsage_total", + "libp2p_message_total",If dashboards depend on the old name, register both for one release cycle and mark the old one deprecated in CHANGELOG.
23-24: Nit: grammar in metric help text"validate blocks with no cache hit" → "validates blocks without a cache hit".
- "Duration of routine which validate blocks with no cache hit", + "Duration of routine which validates blocks without a cache hit",src/networks/metrics.rs (3)
45-83: Optional: inject time source for testabilityDirectly calling chrono::Utc::now() makes this harder to test. Consider injecting a now() closure.
- F: Fn() -> ChainEpoch + Send + Sync + 'static, + F: Fn() -> ChainEpoch + Send + Sync + 'static, { fn encode( &self, mut encoder: prometheus_client::encoding::DescriptorEncoder, ) -> Result<(), std::fmt::Error> { - { + { let network_height: Gauge = Default::default(); let epoch = (self.get_chain_head_height)(); network_height.set(epoch); let metric_encoder = encoder.encode_descriptor( "head_epoch", "Latest epoch synchronized to the node", None, network_height.metric_type(), )?; network_height.encode(metric_encoder)?; } - { + { let expected_network_height: Gauge = Default::default(); - let expected_epoch = calculate_expected_epoch( - chrono::Utc::now().timestamp() as u64, + // If adopting injection, replace with: (self.get_now)() + let expected_epoch = calculate_expected_epoch( + chrono::Utc::now().timestamp() as u64, self.genesis_timestamp, self.block_delay_secs, );If desired, I can provide a full diff adding
get_now: Arc<dyn Fn() -> u64 + Send + Sync>to the struct/new().
66-79: Naming consistency (epoch vs height)You expose "head_epoch" but "expected_network_height". Consider "expected_head_epoch" for symmetry.
- "expected_network_height", - "The expected network height based on the current time and the genesis block time", + "expected_head_epoch", + "The expected head epoch based on current time and genesis time",If you keep the current name, no blocker—just a consistency note.
149-152: Nit: explicit cast for clarityBe explicit about the gauge’s integer type.
- av_gauge.set(actor_version as _); + av_gauge.set(actor_version as i64);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
CHANGELOG.md(1 hunks)docs/docs/users/reference/metrics.md(2 hunks)src/chain/store/chain_store.rs(1 hunks)src/chain_sync/metrics.rs(1 hunks)src/networks/metrics.rs(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- CHANGELOG.md
- docs/docs/users/reference/metrics.md
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: hanabi1224
PR: ChainSafe/forest#5930
File: build.rs:64-77
Timestamp: 2025-08-13T09:43:20.301Z
Learning: hanabi1224 prefers hard compile-time errors in build scripts rather than runtime safeguards or collision detection, believing it's better to fail fast and fix root causes of issues like malformed snapshot names.
🧬 Code graph analysis (1)
src/networks/metrics.rs (1)
src/networks/mod.rs (7)
calculate_expected_epoch(590-596)network_height(416-423)epoch(468-481)network_version(427-432)network_version(745-755)from(186-224)network_version_revision(436-444)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: tests
- GitHub Check: tests-release
- GitHub Check: Build forest binaries on Linux AMD64
- GitHub Check: All lint checks
- GitHub Check: cargo-publish-dry-run
- GitHub Check: Build MacOS
- GitHub Check: Build Ubuntu
- GitHub Check: Check
- GitHub Check: Deploy to Cloudflare Pages
🔇 Additional comments (4)
src/networks/metrics.rs (3)
16-26: LGTM: Collector state and closure wiringStruct design with Arc-wrapped closures is appropriate; Debug(ignore) avoids leaking internals.
28-43: LGTM: constructor APIConstructor parameters are minimal and clear.
117-161: LGTM: network_version, revision, and actor_version collectorsImplementation matches ChainConfig APIs; values are computed at scrape time and encoded correctly.
src/chain/store/chain_store.rs (1)
9-9: LGTM: import reshuffle and HEAD_EPOCH side‑effect removalImport changes look good and align with removing the HEAD_EPOCH gauge update. The provided rg search produced no output, so I couldn't confirm removal of any residual HEAD_EPOCH.set calls — re-run from the repo root:
rg -nP '\bHEAD_EPOCH.set\s*('
rg -nP '\bHEAD_EPOCH\b'
| metrics::{counter::Counter, family::Family, histogram::Histogram}, | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
LGTM: import cleanup matches HEAD_EPOCH removal
No stray Gauge usage here. Please confirm no remaining references to HEAD_EPOCH across the repo.
🏁 Script executed:
#!/bin/bash
# Expect: no results
rg -nP '\bHEAD_EPOCH\b'Length of output: 188
HEAD_EPOCH references remain — remove or justify
Found in scripts/tests/calibnet_eth_mapping_check.sh:19–20 (HEAD_EPOCH=$(...) and EPOCH=$((HEAD_EPOCH - 1))). Remove or document why this reference must remain.
🤖 Prompt for AI Agents
In src/chain_sync/metrics.rs around lines 6 to 7, the review flags lingering
references to HEAD_EPOCH in scripts/tests/calibnet_eth_mapping_check.sh (lines
19–20); either remove those HEAD_EPOCH references from the script or add a clear
justification. Fix by deleting the HEAD_EPOCH calculation and adjust downstream
logic to use a documented alternative (e.g., explicit EPOCH input, a
configurable env var, or a call to a stable API), or add a comment above the
script lines explaining why HEAD_EPOCH is required, its expected source, and add
validation/fallback logic to avoid silent failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HEAD_EPOCH in the script is extracted from forest-cli info show output which is irrelevant to the metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hanabi1224 You're absolutely right - thank you for the clarification! The HEAD_EPOCH in the shell script scripts/tests/calibnet_eth_mapping_check.sh is indeed a shell variable extracted from forest-cli info show output, which is completely unrelated to the Rust metrics HEAD_EPOCH constant that was removed from src/chain_sync/metrics.rs.
My analysis incorrectly flagged this as a potential issue when it's actually a legitimate usage in a different context. The script usage is fine and should remain as-is.
✏️ Learnings added
Learnt from: hanabi1224
PR: ChainSafe/forest#6079
File: src/chain_sync/metrics.rs:6-7
Timestamp: 2025-09-16T12:55:26.912Z
Learning: HEAD_EPOCH references in shell scripts (like scripts/tests/calibnet_eth_mapping_check.sh) that extract data from `forest-cli info show` are unrelated to Rust metrics constants with the same name and should not be flagged when metrics cleanup is performed.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: hanabi1224
PR: ChainSafe/forest#5969
File: src/tool/subcommands/snapshot_cmd.rs:412-412
Timestamp: 2025-08-25T13:35:24.230Z
Learning: In src/tool/subcommands/snapshot_cmd.rs, the +1 in `last_epoch = ts.epoch() - epochs as i64 + 1` fixes an off-by-1 bug where specifying --check-stateroots=N would validate N+1 epochs instead of N epochs, causing out-of-bounds errors when the snapshot contains only N recent state roots.
Learnt from: hanabi1224
PR: ChainSafe/forest#5930
File: build.rs:64-77
Timestamp: 2025-08-13T09:43:20.301Z
Learning: hanabi1224 prefers hard compile-time errors in build scripts rather than runtime safeguards or collision detection, believing it's better to fail fast and fix root causes of issues like malformed snapshot names.
| fn encode(&self, mut encoder: DescriptorEncoder) -> Result<(), std::fmt::Error> { | ||
| let epoch = (self.get_chain_head_height)(); | ||
| { | ||
| let network_version = self.chain_config.network_version(epoch); | ||
| let nv_gauge: Gauge = Default::default(); | ||
| nv_gauge.set(u32::from(network_version) as _); | ||
| let metric_encoder = encoder.encode_descriptor( | ||
| "network_version", | ||
| "Network version of the current chain head", | ||
| None, | ||
| nv_gauge.metric_type(), | ||
| )?; | ||
| nv_gauge.encode(metric_encoder)?; | ||
| } | ||
| { | ||
| let network_version_revision = self.chain_config.network_version_revision(epoch); | ||
| let nv_gauge: Gauge = Default::default(); | ||
| nv_gauge.set(network_version_revision); | ||
| let metric_encoder = encoder.encode_descriptor( | ||
| "network_version_revision", | ||
| "Network version revision of the current chain head", | ||
| None, | ||
| nv_gauge.metric_type(), | ||
| )?; | ||
| nv_gauge.encode(metric_encoder)?; | ||
| } | ||
| { | ||
| let actor_version = (self.get_chain_head_actor_version)(); | ||
| let av_gauge: Gauge = Default::default(); | ||
| av_gauge.set(actor_version as _); | ||
| let metric_encoder = encoder.encode_descriptor( | ||
| "actor_version", | ||
| "Actor version of the current chain head", | ||
| None, | ||
| av_gauge.metric_type(), | ||
| )?; | ||
| av_gauge.encode(metric_encoder)?; | ||
| } | ||
| Ok(()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Operational note: avoid heavy work in actor_version closure
If get_chain_head_actor_version traverses state on each scrape, consider caching per-head and updating on head changes to avoid scrape latency spikes.
🏁 Script executed:
#!/bin/bash
# Verify where the closure is defined and whether it re-walks state each time
rg -nP 'get_chain_head_actor_version|get_actor_bundle_metadata|NetworkVersionCollector::new' -C3Length of output: 4398
Avoid rebuilding the state on each Prometheus scrape — cache actor version and update on head changes.
get_chain_head_actor_version in src/daemon/mod.rs (lines 219–229) constructs a StateTree and reads actor bundle metadata on every invocation; replace with a cached value (e.g., AtomicU64/Arc) updated on chain-head changes so the metrics closure returns immediately and avoids scrape latency spikes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loading a state tree is not expensive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hanabi1224, thank you for the clarification! I appreciate you sharing that domain knowledge about StateTree loading performance in the Forest codebase.
(\ /)
( ^.^ )
o_(")(")
Learning!
✏️ Learnings added
Learnt from: hanabi1224
PR: ChainSafe/forest#6079
File: src/networks/metrics.rs:122-160
Timestamp: 2025-09-16T12:56:31.565Z
Learning: In the Forest codebase, loading a StateTree is not expensive and does not cause performance concerns for Prometheus metrics collection.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: hanabi1224
PR: ChainSafe/forest#5930
File: build.rs:64-77
Timestamp: 2025-08-13T09:43:20.301Z
Learning: hanabi1224 prefers hard compile-time errors in build scripts rather than runtime safeguards or collision detection, believing it's better to fail fast and fix root causes of issues like malformed snapshot names.
LesnyRumcajs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/metrics/mod.rs (1)
85-92: Public API change — fix call site(s) using old signaturesrc/daemon/mod.rs:238 currently calls init_prometheus(prometheus_listener, db_directory, db); update this call to pass chain_config, get_chain_head_height, and get_chain_head_actor_version (chain_config and get_chain_head_height are in scope at 235–236; add/clone get_chain_head_actor_version as needed).
♻️ Duplicate comments (1)
src/daemon/mod.rs (1)
234-249: Blocker: spawned future must be 'static — useasync move.Without
move, the async block borrows stack locals and won’t satisfyJoinSet::spawn’s'staticbound. Also captureget_chain_head_actor_versionlike height for clarity.Apply:
- services.spawn({ + services.spawn({ let chain_config = ctx.chain_config().clone(); let get_chain_head_height = get_chain_head_height.clone(); - async { + let get_chain_head_actor_version = get_chain_head_actor_version.clone(); + async move { crate::metrics::init_prometheus( prometheus_listener, db_directory, db, chain_config, get_chain_head_height, get_chain_head_actor_version, ) .await .context("Failed to initiate prometheus server") } });
🧹 Nitpick comments (2)
src/daemon/mod.rs (2)
209-218: Height getter: avoid sentinel 0; prefer explicit “unknown”.Returning
0on upgrade failure conflates “genesis” with “unknown”. Consider signaling unknown explicitly (e.g., Option) and let the collector decide whether to emit a sample.Example within current types (minimal churn): add a debug log and a distinct negative epoch for unknown if acceptable in your metric logic.
- move || { + move || { chain_store .upgrade() .map(|cs| cs.heaviest_tipset().epoch()) - .unwrap_or_default() + .unwrap_or_else(|| { + tracing::debug!("get_chain_head_height: chain_store dropped; reporting unknown"); + 0 // consider a distinct sentinel or handle Option upstream + }) }
220-233: Actor version getter may be heavy on each scrape; cache on head changes.Building a StateTree and reading bundle metadata per /metrics scrape can block the metrics path. Prefer caching the actor version on head changes and having the getter read an
Arc<AtomicU64>(or similar).If helpful, I can sketch a small subscriber that updates an atomic on
HeadChange::Applyand wire that into the getter.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/chain/store/chain_store.rs(1 hunks)src/daemon/mod.rs(4 hunks)src/metrics/mod.rs(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/chain/store/chain_store.rs
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: hanabi1224
PR: ChainSafe/forest#5930
File: build.rs:64-77
Timestamp: 2025-08-13T09:43:20.301Z
Learning: hanabi1224 prefers hard compile-time errors in build scripts rather than runtime safeguards or collision detection, believing it's better to fail fast and fix root causes of issues like malformed snapshot names.
🧬 Code graph analysis (2)
src/metrics/mod.rs (1)
src/networks/metrics.rs (2)
new(32-42)new(104-114)
src/daemon/mod.rs (3)
src/metrics/mod.rs (2)
new(178-180)init_prometheus(85-124)src/networks/metrics.rs (2)
new(32-42)new(104-114)src/shim/state_tree.rs (1)
new_from_root(168-188)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: tests-release
- GitHub Check: tests
- GitHub Check: Build forest binaries on Linux AMD64
- GitHub Check: cargo-publish-dry-run
- GitHub Check: Build MacOS
- GitHub Check: Build Ubuntu
- GitHub Check: All lint checks
🔇 Additional comments (5)
src/metrics/mod.rs (2)
6-6: Imports LGTM.The added
ChainConfigandChainEpochimports are correct and necessary.
107-113: NetworkVersionCollector registration looks good.Collector is added to the collector registry; no ordering issues with other collectors.
src/daemon/mod.rs (3)
18-21: Import re-org is fine.No side effects; scoped to daemon module.
31-31: New StateTree import OK.Needed for actor-version lookup.
251-259: Height collector wiring LGTM.Reuses the same
get_chain_head_heightclosure; consistent with the new network metrics.
Summary of changes
Changes introduced in this pull request:
actor_version,network_versionandnetwork_version_revisionReference issue to close (if applicable)
Closes #5141
Other information and links
Change checklist
Summary by CodeRabbit