Skip to content

feat(metrics): align provider metrics with lava_provider_ spec#2239

Merged
nimrod-teich merged 1 commit into
mainfrom
metrics-provider-only
Mar 25, 2026
Merged

feat(metrics): align provider metrics with lava_provider_ spec#2239
nimrod-teich merged 1 commit into
mainfrom
metrics-provider-only

Conversation

@NadavLevi

@NadavLevi NadavLevi commented Mar 18, 2026

Copy link
Copy Markdown
Contributor

User description

  • Add lava_provider_ prefix to all provider metrics
  • Convert latency metrics to HistogramVec (end_to_end, provider, per_function)
  • Merge relay/errored counters into single metric with function label
  • Add apiInterface label to fetch/block metrics and lava_provider_latest_block
  • Remove lava_consumer_QoS and lava_consumer_qos_metrics from provider side
  • Update chain_tracker and rpcprovider callers accordingly
  • Add provider_metrics.md documentation

Description

Closes: #XXXX


Author Checklist

All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.

I have...

  • read the contribution guide
  • included the correct type prefix in the PR title, you can find examples of the prefixes below:
  • confirmed ! in the type prefix if API or client breaking change
  • targeted the main branch
  • provided a link to the relevant issue or specification
  • reviewed "Files changed" and left comments if necessary
  • included the necessary unit and integration tests
  • updated the relevant documentation or specification, including comments for documenting Go code
  • confirmed all CI checks have passed

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.

I have...

  • confirmed the correct type prefix in the PR title
  • confirmed all author checklist items have been addressed
  • reviewed state machine logic, API design and naming, documentation is accurate, tests and test coverage

Generated description

Below is a concise technical summary of the changes proposed in this PR:
Align provider instrumentation with the refreshed lava_provider_* schema by renaming and augmenting observable metrics, converting latency gauges to histograms, merging relay/error counters, and propagating the new labels through the chain tracker and rpcprovider paths. Supplement provider observability guidance with health and analytics improvements plus resilience documentation that highlights cache signals, endpoint health breakdowns, and failsafe-go adoption considerations.

TopicDetails
Developer scripts Clean the developer setup scripts by dropping the obsolete --add-api-method-metrics flag wherever rpcconsumer is started so that the new provider-side instrumentation is authoritative.
Modified files (7)
  • scripts/pre_setups/init_eth_addons.sh
  • scripts/pre_setups/init_lava_only_with_node.sh
  • scripts/pre_setups/init_lava_only_with_node_protocol_only.sh
  • scripts/pre_setups/init_lava_only_with_node_rate_limit.sh
  • scripts/pre_setups/init_lava_only_with_node_with_python_proxy.sh
  • scripts/pre_setups/init_movement_only_with_node.sh
  • scripts/pre_setups/init_solana_only_with_node.sh
Latest Contributors(2)
UserCommitDate
ran@lavanet.xyzadded trace verificationsDecember 29, 2024
shleikesrefactor: PRT - Replac...December 15, 2024
Provider metrics spec Modernize provider instrumentation by renaming metrics to the lava_provider_* schema, tagging them with apiInterface/function, switching latency counters to histograms, and updating the documentation, chain tracker bindings, and rpcprovider relay recording to match the new schema.
Modified files (7)
  • docs/metrics/provider_metrics.md
  • protocol/chaintracker/chain_tracker.go
  • protocol/metrics/provider_metrics.go
  • protocol/metrics/provider_metrics_manager.go
  • protocol/metrics/provider_metrics_test.go
  • protocol/rpcprovider/rpcprovider.go
  • protocol/rpcprovider/rpcprovider_server.go
Latest Contributors(2)
UserCommitDate
NadavLevifeat(smart-router)!: i...March 11, 2026
anna@magmadevs.comfeat(provideroptimizer...February 22, 2026
Health analytics Improve health/analytics flows by recording cache hits, deduplicating cross-validation provider labels, exposing per-apiInterface breakdown hooks, and ensuring consumer/smartrouter servers instantiate the augmented RelayMetrics payload before emitting cache metrics.
Modified files (6)
  • protocol/metrics/analytics.go
  • protocol/metrics/consumer_metrics_manager.go
  • protocol/metrics/relays_monitor_aggregator.go
  • protocol/metrics/smartrouter_metrics_manager.go
  • protocol/rpcconsumer/rpcconsumer_server.go
  • protocol/rpcsmartrouter/rpcsmartrouter_server.go
Latest Contributors(2)
UserCommitDate
NadavLevichore(metrics): cleanu...March 23, 2026
anna@magmadevs.comfeat(provideroptimizer...February 22, 2026
Failsafe analysis Document the resilience stack migration by outlining the risks, phases, and architectural mismatches when integrating failsafe-go, so stakeholders understand the complexity before introducing policy-based retry/circuit-breaker logic.
Modified files (1)
  • failsafe-go-integration-analysis.md
Latest Contributors(0)
UserCommitDate
This pull request is reviewed by Baz. Review like a pro on (Baz).

@qodo-code-review

Copy link
Copy Markdown

Review Summary by Qodo

Align provider metrics with lava_provider_ spec and convert to histograms

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Convert latency metrics from Gauge to HistogramVec for better distribution analysis
• Add apiInterface label to block fetch metrics and latest block tracking
• Merge relay/errored counters into unified metrics with function labels
• Remove consumer QoS metrics from provider side (lava_consumer_QoS)
• Simplify metric initialization and constructor signatures
• Add comprehensive provider metrics documentation with PromQL examples
Diagram
flowchart LR
  A["Latency Metrics<br/>Gauge → Histogram"] -->|Observe instead of Set| B["Better Distribution<br/>Analysis"]
  C["Block Fetch Metrics"] -->|Add apiInterface label| D["Enhanced<br/>Granularity"]
  E["Relay/Error Counters"] -->|Merge with function label| F["Unified Metric<br/>Structure"]
  G["Consumer QoS Metrics"] -->|Remove from provider| H["Cleaner Scope"]
  I["Constructor Params"] -->|Simplify| J["Reduced Complexity"]
  B --> K["Provider Metrics<br/>Aligned"]
  D --> K
  F --> K
  H --> K
  J --> K
Loading

Grey Divider

File Changes

1. protocol/metrics/provider_metrics.go ✨ Enhancement +19/-56

Convert latency metrics to histograms and simplify structure

• Convert providerLatencyMetric and providerEndToEndLatencyMetric from GaugeVec to HistogramVec
• Convert requestLatencyPerFunctionMetric from MappedLabelsGaugeVec to HistogramVec
• Change latency recording from Set() to Observe() method calls
• Remove totalRequestsPerFunctionMetric and totalErrorsPerFunctionMetric fields
• Remove consumerQoSMetric field and all QoS-related code from AddRelay()
• Merge error tracking: AddFunctionError() now increments both totalRelaysServicedMetric and
 totalErroredMetric
• Simplify NewProviderMetrics() constructor signature by removing 4 parameters
• Remove unnecessary mutex locks from latency methods

protocol/metrics/provider_metrics.go


2. protocol/metrics/provider_metrics_manager.go ✨ Enhancement +60/-87

Add apiInterface labels and convert to histogram metrics

• Add apiInterface parameter to block fetch metric methods: SetLatestBlockFetchError(),
 SetSpecificBlockFetchError(), SetLatestBlockFetchSuccess(), SetSpecificBlockFetchSuccess()
• Convert providerLatencyMetric and providerEndToEndLatencyMetric from GaugeVec to HistogramVec
 with predefined buckets
• Create requestLatencyPerFunctionMetric as HistogramVec instead of MappedLabelsGaugeVec
• Remove totalRequestsPerFunctionMetric, totalErrorsPerFunctionMetric, and consumerQoSMetric
 fields
• Rename lava_latest_block to lava_provider_latest_block and add apiInterface label
• Add apiInterface label to all block fetch metrics
• Update SetLatestBlock() to accept apiInterface parameter
• Remove AddApiMethodCallsMetrics constant
• Rename virtual_epoch to lava_provider_virtual_epoch
• Update metric registration and manager initialization

protocol/metrics/provider_metrics_manager.go


3. protocol/chaintracker/chain_tracker.go ✨ Enhancement +4/-4

Pass apiInterface to block fetch metrics

• Add cs.endpoint.ApiInterface parameter to all four block fetch metric calls
• Update SetLatestBlockFetchError(), SetLatestBlockFetchSuccess(),
 SetSpecificBlockFetchError(), and SetSpecificBlockFetchSuccess() invocations

protocol/chaintracker/chain_tracker.go


View more (3)
4. protocol/rpcprovider/rpcprovider.go ✨ Enhancement +1/-1

Pass apiInterface to SetLatestBlock call

• Add apiInterface parameter to SetLatestBlock() call in the ChainTracker callback
• Update method invocation to pass chain ID, API interface, endpoint address, and block number

protocol/rpcprovider/rpcprovider.go


5. protocol/metrics/provider_metrics_test.go 🧪 Tests +205/-0

Add comprehensive provider metrics test coverage

• Add comprehensive test suite for provider metrics with 13 test functions
• Test relay counting, CU accumulation, error tracking, and in-flight relay management
• Verify histogram observations for latency metrics (function, provider, and end-to-end)
• Test nil-safety for all metric operations
• Use test-specific metric names to prevent counter accumulation across tests

protocol/metrics/provider_metrics_test.go


6. docs/metrics/provider_metrics.md 📝 Documentation +163/-0

Add comprehensive provider metrics documentation

• Create new documentation file explaining all provider metrics with detailed descriptions
• Document key concepts: relay, function, CU, load rate, block fetch, frozen/jailed states, virtual
 epoch
• Explain counting invariant showing relationship between total relays and errored relays
• Provide detailed tables for relay serving, latency, node tracking, health, and on-chain status
 metrics
• Include 12 practical PromQL query examples for monitoring and alerting
• Document optional provider_endpoint label behavior and HTTP health endpoints

docs/metrics/provider_metrics.md


Grey Divider

Qodo Logo

@qodo-code-review

qodo-code-review Bot commented Mar 18, 2026

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0) 📐 Spec deviations (0)

Grey Divider


Action required

1. Undefined latency buckets🐞 Bug ✓ Correctness
Description
Provider histogram metrics are created with Buckets: LatencyBuckets, but LatencyBuckets is not
defined in the metrics package, causing compilation to fail. This blocks building any binary that
imports protocol/metrics.
Code

protocol/metrics/provider_metrics_manager.go[R90-94]

+	requestLatencyPerFunctionMetric := prometheus.NewHistogramVec(prometheus.HistogramOpts{
+		Name:    "lava_provider_request_latency_milliseconds",
+		Help:    "Distribution of relay latency per function in milliseconds.",
+		Buckets: LatencyBuckets,
+	}, []string{"spec", "apiInterface", "function"})
Evidence
NewProviderMetricsManager uses LatencyBuckets for HistogramVec bucket configuration, but the
metrics package has no LatencyBuckets definition (only defaultLatencyBuckets exists elsewhere).
This results in an undefined identifier at compile time.

protocol/metrics/provider_metrics_manager.go[90-94]
protocol/metrics/provider_metrics_manager.go[149-159]
protocol/metrics/smartrouter_metrics_manager.go[19-22]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`protocol/metrics/provider_metrics_manager.go` references `LatencyBuckets` when creating histograms, but `LatencyBuckets` is not defined in the `metrics` package.
## Issue Context
There is already an in-package bucket list (`defaultLatencyBuckets`) used by smart router metrics. Provider metrics should use a defined bucket set as well (ideally shared).
## Fix Focus Areas
- protocol/metrics/provider_metrics_manager.go[90-94]
- protocol/metrics/provider_metrics_manager.go[149-159]
- protocol/metrics/smartrouter_metrics_manager.go[19-22]
## Suggested fix
- Add a package-level exported var/const in `protocol/metrics` (e.g., in a new file `protocol/metrics/latency_buckets.go`):
- `var LatencyBuckets = []float64{1,2,5,10,25,50,100,250,500,1000,2500,5000,10000,30000}`
- Optionally refactor smart router to reuse `LatencyBuckets` instead of maintaining `defaultLatencyBuckets` (to keep docs/code aligned).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Missing AddApiMethodCallsMetrics🐞 Bug ✓ Correctness
Description
metrics.AddApiMethodCallsMetrics was removed from the metrics package constants, but rpcconsumer
and rpcsmartrouter still reference it for flag binding and viper lookups. This causes compilation
failures for those binaries.
Code

protocol/metrics/provider_metrics_manager.go[18]

-	AddApiMethodCallsMetrics      = "add-api-method-metrics"
Evidence
The metrics package const block no longer defines AddApiMethodCallsMetrics, yet both rpcconsumer
and rpcsmartrouter still call viper.GetBool(metrics.AddApiMethodCallsMetrics) and register a flag
using that constant, producing an undefined identifier error at build time.

protocol/metrics/provider_metrics_manager.go[16-27]
protocol/rpcconsumer/rpcconsumer.go[624-726]
protocol/rpcsmartrouter/rpcsmartrouter.go[1322-1433]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The exported constant `AddApiMethodCallsMetrics` was removed from the `metrics` package, but it is still referenced by `rpcconsumer` and `rpcsmartrouter`.
## Issue Context
Both binaries use `metrics.AddApiMethodCallsMetrics` as the flag name and for viper reads; removing it breaks compilation.
## Fix Focus Areas
- protocol/metrics/provider_metrics_manager.go[16-27]
- protocol/rpcconsumer/rpcconsumer.go[624-726]
- protocol/rpcsmartrouter/rpcsmartrouter.go[1322-1433]
## Suggested fix
- Re-introduce `AddApiMethodCallsMetrics` as an exported constant in the `metrics` package (e.g., in `provider_metrics_manager.go` const block or a new shared `protocol/metrics/flags.go`).
- Keep the flag name string stable (the previous value in this repo was `&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;quot;add-api-method-metrics&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;quot;`).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Missing QoS label constants🐞 Bug ✓ Correctness
Description
AvailabilityLabel, SyncLabel, and LatencyLabel were removed from the metrics package, but
ConsumerMetricsManager.SetQOSMetrics still uses them when setting QoS metrics labels. This causes
compilation errors in the metrics package itself.
Code

protocol/metrics/provider_metrics.go[L12-14]

-	AvailabilityLabel = "availability"
-	SyncLabel         = "sync/freshness"
-	LatencyLabel      = "latency"
Evidence
provider_metrics.go no longer defines the AvailabilityLabel/SyncLabel/LatencyLabel
constants, but consumer_metrics_manager.go references them directly in the same metrics package.
Go compilation fails with undefined identifiers.

protocol/metrics/provider_metrics.go[11-18]
protocol/metrics/consumer_metrics_manager.go[709-736]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`AvailabilityLabel`, `SyncLabel`, and `LatencyLabel` constants were removed from the `metrics` package, but are still used by consumer-side metrics code in `ConsumerMetricsManager.SetQOSMetrics`.
## Issue Context
These constants are used as label values for the consumer QoS metrics (`qos_metric` label). Removing them breaks compilation.
## Fix Focus Areas
- protocol/metrics/provider_metrics.go[11-18]
- protocol/metrics/consumer_metrics_manager.go[709-736]
## Suggested fix
- Re-add the constants in the `metrics` package (preferably in a consumer-oriented constants file to avoid future provider cleanups removing them):
- `AvailabilityLabel = &amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;quot;availability&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;quot;`
- `SyncLabel = &amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;quot;sync/freshness&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;quot;` (note: this differs from `SelectionSyncLabel`)
- `LatencyLabel = &amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;quot;latency&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;quot;`
- Keep `Selection*Label` constants as-is for selection stats metrics.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment thread protocol/metrics/provider_metrics_manager.go
Comment thread protocol/metrics/provider_metrics_manager.go
Comment thread protocol/metrics/provider_metrics.go
@github-actions

github-actions Bot commented Mar 18, 2026

Copy link
Copy Markdown

Test Results

0 tests  ±0   0 ✅ ±0   0s ⏱️ ±0s
0 suites ±0   0 💤 ±0 
7 files   ±0   0 ❌ ±0 

Results for commit e054c06. ± Comparison against base commit 5efe92d.

♻️ This comment has been updated with latest results.

@codecov

codecov Bot commented Mar 18, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 36.60714% with 71 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
protocol/metrics/provider_metrics_manager.go 20.68% 45 Missing and 1 partial ⚠️
protocol/rpcsmartrouter/rpcsmartrouter_server.go 25.00% 8 Missing and 1 partial ⚠️
protocol/metrics/relays_monitor_aggregator.go 0.00% 6 Missing ⚠️
protocol/rpcconsumer/rpcconsumer_server.go 73.68% 4 Missing and 1 partial ⚠️
protocol/chaintracker/chain_tracker.go 50.00% 2 Missing ⚠️
protocol/metrics/consumer_metrics_manager.go 0.00% 0 Missing and 1 partial ⚠️
protocol/rpcprovider/rpcprovider.go 0.00% 1 Missing ⚠️
protocol/rpcprovider/rpcprovider_server.go 0.00% 1 Missing ⚠️
Flag Coverage Δ
consensus 8.71% <ø> (ø)
protocol 33.93% <36.60%> (+0.47%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
protocol/metrics/analytics.go 0.00% <ø> (ø)
protocol/metrics/provider_metrics.go 93.93% <100.00%> (+93.93%) ⬆️
protocol/metrics/smartrouter_metrics_manager.go 21.43% <ø> (ø)
protocol/metrics/consumer_metrics_manager.go 13.89% <0.00%> (ø)
protocol/rpcprovider/rpcprovider.go 8.35% <0.00%> (ø)
protocol/rpcprovider/rpcprovider_server.go 9.55% <0.00%> (ø)
protocol/chaintracker/chain_tracker.go 61.26% <50.00%> (ø)
protocol/rpcconsumer/rpcconsumer_server.go 32.44% <73.68%> (+0.58%) ⬆️
protocol/metrics/relays_monitor_aggregator.go 0.00% <0.00%> (ø)
protocol/rpcsmartrouter/rpcsmartrouter_server.go 13.49% <25.00%> (+0.04%) ⬆️
... and 1 more

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@NadavLevi NadavLevi force-pushed the metrics-provider-only branch from c190119 to 1a9ed3e Compare March 18, 2026 12:04
@NadavLevi NadavLevi force-pushed the metrics-provider-only branch 4 times, most recently from 0b1f54d to 0b65b74 Compare March 18, 2026 12:30
Comment thread protocol/rpcconsumer/rpcconsumer_server.go
Comment thread protocol/rpcconsumer/rpcconsumer_server.go
Comment thread protocol/rpcconsumer/rpcconsumer_server.go Outdated
@NadavLevi NadavLevi force-pushed the metrics-provider-only branch 3 times, most recently from 20130f5 to 2097223 Compare March 18, 2026 13:09
Comment thread protocol/metrics/smartrouter_metrics_manager.go
Comment thread protocol/metrics/consumer_metrics_manager.go
@NadavLevi NadavLevi force-pushed the metrics-provider-only branch 5 times, most recently from 0084636 to 566cae1 Compare March 19, 2026 08:25
Comment thread protocol/metrics/consumer_metrics_manager.go
Comment thread protocol/metrics/consumer_metrics_manager.go
Comment thread protocol/rpcconsumer/rpcconsumer.go
Comment thread protocol/metrics/smartrouter_metrics_manager.go
Comment thread protocol/metrics/smartrouter_metrics_manager.go
Comment thread protocol/metrics/smartrouter_metrics_manager.go
@NadavLevi NadavLevi force-pushed the metrics-provider-only branch from 566cae1 to e333270 Compare March 19, 2026 10:08
Comment thread protocol/metrics/provider_metrics_manager.go
Comment thread protocol/metrics/provider_metrics_manager.go
Comment thread protocol/metrics/smartrouter_metrics_manager.go Outdated
@NadavLevi NadavLevi force-pushed the metrics-provider-only branch from e333270 to 4391ff2 Compare March 19, 2026 10:59
Comment thread protocol/rpcsmartrouter/rpcsmartrouter_server.go
Comment thread protocol/rpcconsumer/rpcconsumer_server.go
@NadavLevi NadavLevi force-pushed the metrics-provider-only branch from 4391ff2 to c8e59d6 Compare March 19, 2026 12:48
Comment thread protocol/metrics/consumer_metrics_manager.go
Comment thread docs/metrics/smartrouter_metrics.md
Comment thread protocol/rpcconsumer/rpcconsumer_server.go
@NadavLevi NadavLevi force-pushed the metrics-provider-only branch 4 times, most recently from 2b2ac61 to b35b1d8 Compare March 23, 2026 11:33
@NadavLevi NadavLevi force-pushed the metrics-provider-only branch 2 times, most recently from 3e3b59b to efc11b1 Compare March 23, 2026 11:47
@NadavLevi NadavLevi requested a review from avitenzer March 23, 2026 11:56
Comment thread protocol/rpcconsumer/rpcconsumer_server.go
Comment thread protocol/metrics/provider_metrics.go
Comment thread protocol/metrics/provider_metrics.go
Comment thread protocol/rpcconsumer/rpcconsumer_server.go

@avitenzer avitenzer left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review comments

@NadavLevi NadavLevi force-pushed the metrics-provider-only branch from efc11b1 to 5833610 Compare March 24, 2026 09:03
@NadavLevi NadavLevi requested a review from avitenzer March 24, 2026 09:04
@NadavLevi NadavLevi force-pushed the metrics-provider-only branch from 5833610 to 53426c1 Compare March 24, 2026 09:09
- Add lava_provider_ prefix to all provider metrics
- Convert latency metrics to HistogramVec (end_to_end, provider, per_function)
- Merge relay/errored counters into single metric with function label
- Add apiInterface label to fetch/block metrics and lava_provider_latest_block
- Remove lava_consumer_QoS and lava_consumer_qos_metrics from provider side
- Update chain_tracker and rpcprovider callers accordingly
- Add provider_metrics.md documentation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@NadavLevi NadavLevi force-pushed the metrics-provider-only branch from 53426c1 to e054c06 Compare March 25, 2026 12:02
@nimrod-teich nimrod-teich merged commit 04c5507 into main Mar 25, 2026
32 checks passed
@nimrod-teich nimrod-teich deleted the metrics-provider-only branch March 25, 2026 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants