Skip to content

Prometheus metrics violate counter monotonicity convention: _count and _sum reset on every scrape due to ResettingSample #2173

@lake-dunamu

Description

@lake-dunamu

System information

Bor client version: v2.7.0 (also affects latest main)

OS & Version: Linux (Kubernetes)

Environment: Polygon Mainnet

Type of node: Sentry

Overview of the problem

All Prometheus histogram metrics using ResettingSample have broken _count and _sum values. They decrease on every scrape instead of monotonically increasing, which violates the Prometheus counter type spec and breaks rate(),
increase(), and average latency calculations.

https://prometheus.io/docs/concepts/metric_types/#counter

A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart.

Root cause: ResettingSample.Snapshot() in metrics/resetting_sample.go calls Clear() which resets count and sum to 0. The Prometheus collector in metrics/prometheus/collector.go emits these reset values as counter type,
but they only contain the delta since the last scrape — not cumulative totals.

Affected files and metrics:

File Affected metrics
rpc/metrics.go All rpc_duration_*_count, rpc_duration_*_sum
p2p/tracker/tracker.go P2P tracking metrics
eth/protocols/eth/handler.go eth protocol metrics
eth/protocols/snap/handler.go snap sync metrics
eth/protocols/wit/handler.go witness protocol metrics

Expected: _count and _sum should be monotonically increasing as required by the Prometheus counter type spec.

Actual: Both reset to interval-only values on every scrape, causing:

  • rate() returns incorrect results or no data
  • increase() is unreliable
  • Average latency calculation (rate(_sum) / rate(_count)) is broken

Reproduction Steps

  1. Enable telemetry with metrics = true and prometheus-addr = "0.0.0.0:7071"
  2. Send RPC requests (e.g. eth_blockNumber)
  3. Scrape /debug/metrics/prometheus twice with 1 minute interval
  4. Observe rpc_duration_eth_blockNumber_success_count and _sum values decrease on second scrape

Logs / Traces / Output / Error Messages

Scrape at T1 (100 requests since startup)

rpc_duration_eth_blockNumber_success_count 100
rpc_duration_eth_blockNumber_success_sum 5000000

Scrape at T2 (50 requests since T1)

rpc_duration_eth_blockNumber_success_count 50 ← decreased
rpc_duration_eth_blockNumber_success_sum 2500000 ← decreased

Code path:

  1. rpc/metrics.go:46 — creates histogram with ResettingSample
  2. metrics/resetting_sample.go:22Snapshot() calls Clear()
  3. metrics/sample.go:165Clear() resets count = 0, sum = 0
  4. metrics/prometheus/collector.go:98-107 — emits snapshot values as Prometheus counter

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions