System information
Bor client version: v2.7.0 (also affects latest main)
OS & Version: Linux (Kubernetes)
Environment: Polygon Mainnet
Type of node: Sentry
Overview of the problem
All Prometheus histogram metrics using ResettingSample have broken _count and _sum values. They decrease on every scrape instead of monotonically increasing, which violates the Prometheus counter type spec and breaks rate(),
increase(), and average latency calculations.
https://prometheus.io/docs/concepts/metric_types/#counter
A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart.
Root cause: ResettingSample.Snapshot() in metrics/resetting_sample.go calls Clear() which resets count and sum to 0. The Prometheus collector in metrics/prometheus/collector.go emits these reset values as counter type,
but they only contain the delta since the last scrape — not cumulative totals.
Affected files and metrics:
| File |
Affected metrics |
rpc/metrics.go |
All rpc_duration_*_count, rpc_duration_*_sum |
p2p/tracker/tracker.go |
P2P tracking metrics |
eth/protocols/eth/handler.go |
eth protocol metrics |
eth/protocols/snap/handler.go |
snap sync metrics |
eth/protocols/wit/handler.go |
witness protocol metrics |
Expected: _count and _sum should be monotonically increasing as required by the Prometheus counter type spec.
Actual: Both reset to interval-only values on every scrape, causing:
rate() returns incorrect results or no data
increase() is unreliable
- Average latency calculation (
rate(_sum) / rate(_count)) is broken
Reproduction Steps
- Enable telemetry with
metrics = true and prometheus-addr = "0.0.0.0:7071"
- Send RPC requests (e.g.
eth_blockNumber)
- Scrape
/debug/metrics/prometheus twice with 1 minute interval
- Observe
rpc_duration_eth_blockNumber_success_count and _sum values decrease on second scrape
Logs / Traces / Output / Error Messages
Scrape at T1 (100 requests since startup)
rpc_duration_eth_blockNumber_success_count 100
rpc_duration_eth_blockNumber_success_sum 5000000
Scrape at T2 (50 requests since T1)
rpc_duration_eth_blockNumber_success_count 50 ← decreased
rpc_duration_eth_blockNumber_success_sum 2500000 ← decreased
Code path:
rpc/metrics.go:46 — creates histogram with ResettingSample
metrics/resetting_sample.go:22 — Snapshot() calls Clear()
metrics/sample.go:165 — Clear() resets count = 0, sum = 0
metrics/prometheus/collector.go:98-107 — emits snapshot values as Prometheus counter
System information
Bor client version: v2.7.0 (also affects latest main)
OS & Version: Linux (Kubernetes)
Environment: Polygon Mainnet
Type of node: Sentry
Overview of the problem
All Prometheus histogram metrics using
ResettingSamplehave broken_countand_sumvalues. They decrease on every scrape instead of monotonically increasing, which violates the Prometheus counter type spec and breaksrate(),increase(), and average latency calculations.https://prometheus.io/docs/concepts/metric_types/#counter
Root cause:
ResettingSample.Snapshot()inmetrics/resetting_sample.gocallsClear()which resets count and sum to 0. The Prometheus collector inmetrics/prometheus/collector.goemits these reset values ascountertype,but they only contain the delta since the last scrape — not cumulative totals.
Affected files and metrics:
rpc/metrics.gorpc_duration_*_count,rpc_duration_*_sump2p/tracker/tracker.goeth/protocols/eth/handler.goeth/protocols/snap/handler.goeth/protocols/wit/handler.goExpected:
_countand_sumshould be monotonically increasing as required by the Prometheus counter type spec.Actual: Both reset to interval-only values on every scrape, causing:
rate()returns incorrect results or no dataincrease()is unreliablerate(_sum) / rate(_count)) is brokenReproduction Steps
metrics = trueandprometheus-addr = "0.0.0.0:7071"eth_blockNumber)/debug/metrics/prometheustwice with 1 minute intervalrpc_duration_eth_blockNumber_success_countand_sumvalues decrease on second scrapeLogs / Traces / Output / Error Messages
Scrape at T1 (100 requests since startup)
rpc_duration_eth_blockNumber_success_count 100
rpc_duration_eth_blockNumber_success_sum 5000000
Scrape at T2 (50 requests since T1)
rpc_duration_eth_blockNumber_success_count 50 ← decreased
rpc_duration_eth_blockNumber_success_sum 2500000 ← decreased
Code path:
rpc/metrics.go:46— creates histogram withResettingSamplemetrics/resetting_sample.go:22—Snapshot()callsClear()metrics/sample.go:165—Clear()resetscount = 0,sum = 0metrics/prometheus/collector.go:98-107— emits snapshot values as Prometheuscounter