Skip to content

fix(metrics): resolve name collisions, add missing prefixes, fix typo…#2237

Merged
nimrod-teich merged 1 commit into
mainfrom
pr/metrics-setup
Mar 19, 2026
Merged

fix(metrics): resolve name collisions, add missing prefixes, fix typo…#2237
nimrod-teich merged 1 commit into
mainfrom
pr/metrics-setup

Conversation

@NadavLevi

@NadavLevi NadavLevi commented Mar 16, 2026

Copy link
Copy Markdown
Contributor

User description

…, and wire optimizer QoS to smart router

  • Rename lava_latest_block → lava_consumer_latest_block / lava_provider_latest_block
  • Rename virtual_epoch → lava_consumer_virtual_epoch / lava_provider_virtual_epoch
  • Add missing lava_consumer_ prefix to 4 metrics: relay_processing_latency_before/after_provider, requests_per_provider, protocol_errors_per_provider
  • Fix label typo: dissconectReason → disconnectReason
  • Implement StartSelectionStatsUpdater in SmartRouterMetricsManager so optimizer QoS reports are periodically written to lava_rpc_endpoint_selection_score; previously a silent no-op caused all optimizer-sampled scores to be dropped
  • Pass OptimizerQoSClient via SmartRouterMetricsManagerOptions and simplify call site in rpcsmartrouter.go

Description

Closes: #XXXX


Author Checklist

All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.

I have...

  • read the contribution guide
  • included the correct type prefix in the PR title, you can find examples of the prefixes below:
  • confirmed ! in the type prefix if API or client breaking change
  • targeted the main branch
  • provided a link to the relevant issue or specification
  • reviewed "Files changed" and left comments if necessary
  • included the necessary unit and integration tests
  • updated the relevant documentation or specification, including comments for documenting Go code
  • confirmed all CI checks have passed

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.

I have...

  • confirmed the correct type prefix in the PR title
  • confirmed all author checklist items have been addressed
  • reviewed state machine logic, API design and naming, documentation is accurate, tests and test coverage

Generated description

Below is a concise technical summary of the changes proposed in this PR:
Align the consumer/provider metrics naming by adding the lava_consumer_/lava_provider_ prefixes, correcting the disconnectReason label typo, and consistently invoking SetWsSubscriptionDisconnectRequestMetric from websocket managers. Wire the optimizer QoS client into SmartRouterMetricsManager so StartSelectionStatsUpdater periodically streams selection scores into lava_rpc_endpoint_selection_score during smart router startup.

TopicDetails
Metrics cleanup Rename consumer/provider metrics via ConsumerMetricsManager and ProviderMetricsManager, fix the disconnectReason label, and ensure websocket disconnect counters route through SetWsSubscriptionDisconnectRequestMetric consistently.
Modified files (5)
  • protocol/chainlib/consumer_ws_subscription_manager.go
  • protocol/metrics/consumer_metrics_manager.go
  • protocol/metrics/consumer_metrics_manager_inf.go
  • protocol/metrics/provider_metrics_manager.go
  • protocol/rpcsmartrouter/direct_ws_subscription_manager.go
Latest Contributors(2)
UserCommitDate
NadavLevifeat-smart-router-impl...March 11, 2026
anna@magmadevs.comfeat-provideroptimizer...February 22, 2026
Optimizer metrics Publish optimizer QoS scores by wiring OptimizerQoSClient into SmartRouterMetricsManagerOptions, invoking StartSelectionStatsUpdater from the smart router start-up, and streaming periodic reports into lava_rpc_endpoint_selection_score.
Modified files (2)
  • protocol/metrics/smartrouter_metrics_manager.go
  • protocol/rpcsmartrouter/rpcsmartrouter.go
Latest Contributors(2)
UserCommitDate
NadavLevifeat-smart-router-impl...March 11, 2026
nimrod.teich@gmail.comfix-make-relay-retry-l...March 01, 2026
This pull request is reviewed by Baz. Review like a pro on (Baz).

@qodo-code-review

Copy link
Copy Markdown

Review Summary by Qodo

Fix metric collisions, add prefixes, wire optimizer QoS to smart router

🐞 Bug fix ✨ Enhancement

Grey Divider

Walkthroughs

Description
• Resolve metric name collisions by adding consumer/provider prefixes
  - Rename lava_latest_blocklava_consumer_latest_block / lava_provider_latest_block
  - Rename virtual_epochlava_consumer_virtual_epoch / lava_provider_virtual_epoch
  - Add missing lava_consumer_ prefix to 4 metrics
• Fix label typo: dissconectReasondisconnectReason
• Implement StartSelectionStatsUpdater in SmartRouterMetricsManager
  - Periodically writes optimizer QoS reports to metrics
  - Prevents silent no-op that dropped optimizer-sampled scores
• Remove unused provider selection score gauges from consumer metrics
• Wire OptimizerQoSClient through SmartRouterMetricsManagerOptions
Diagram
flowchart LR
  A["Metric Name Collisions"] -->|"Add prefixes"| B["Consumer/Provider Metrics"]
  C["Unused Score Gauges"] -->|"Remove"| D["Cleaner Consumer Metrics"]
  E["OptimizerQoSClient"] -->|"Pass via Options"| F["SmartRouterMetricsManager"]
  F -->|"Implement Updater"| G["Periodic QoS Score Updates"]
  H["Label Typo"] -->|"Fix"| I["Correct disconnectReason"]
Loading

Grey Divider

File Changes

1. protocol/metrics/consumer_metrics_manager.go 🐞 Bug fix +11/-60

Add metric prefixes, fix typo, remove unused gauges

• Rename lava_latest_block to lava_consumer_latest_block and virtual_epoch to
 lava_consumer_virtual_epoch
• Add missing lava_consumer_ prefix to relay processing latency, requests per provider, and
 protocol errors metrics
• Fix typo in websocket disconnect metric label: dissconectReasondisconnectReason
• Remove unused provider selection score gauges (availability, latency, sync, stake, composite)
• Simplify SetProviderSelected method to only extract selected provider's composite score

protocol/metrics/consumer_metrics_manager.go


2. protocol/metrics/provider_metrics_manager.go 🐞 Bug fix +2/-2

Add provider prefix to block and epoch metrics

• Rename lava_latest_block to lava_provider_latest_block
• Rename virtual_epoch to lava_provider_virtual_epoch

protocol/metrics/provider_metrics_manager.go


3. protocol/metrics/smartrouter_metrics_manager.go ✨ Enhancement +29/-3

Implement optimizer QoS periodic stats updater

• Add optimizerQoSClient field to SmartRouterMetricsManager struct
• Add OptimizerQoSClient parameter to SmartRouterMetricsManagerOptions
• Implement StartSelectionStatsUpdater method with periodic ticker
• Periodically fetch optimizer QoS reports and update endpoint selection score metrics

protocol/metrics/smartrouter_metrics_manager.go


View more (1)
4. protocol/rpcsmartrouter/rpcsmartrouter.go ✨ Enhancement +4/-7

Wire optimizer QoS client to metrics manager

• Pass OptimizerQoSClient to SmartRouterMetricsManager via options
• Simplify call site by removing conditional check before calling StartSelectionStatsUpdater
• Update formatting for consistency in SmartRouterMetricsManagerOptions initialization

protocol/rpcsmartrouter/rpcsmartrouter.go


Grey Divider

Qodo Logo

@qodo-code-review

qodo-code-review Bot commented Mar 16, 2026

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (1) 📎 Requirement gaps (0)

Grey Divider


Action required

1. Non-positive ticker panics 🐞 Bug ⛯ Reliability
Description
SmartRouterMetricsManager.StartSelectionStatsUpdater calls time.NewTicker(updateInterval) without
validating updateInterval, so a 0/negative OptimizerQosServerSamplingInterval will panic and crash
the smart router. The interval is user-configurable via a CLI flag, so this is a runtime footgun
rather than a theoretical edge case.
Code

protocol/metrics/smartrouter_metrics_manager.go[R822-828]

+func (m *SmartRouterMetricsManager) StartSelectionStatsUpdater(ctx context.Context, updateInterval time.Duration) {
+	if m == nil || m.optimizerQoSClient == nil {
+		return
+	}
+	go func() {
+		ticker := time.NewTicker(updateInterval)
+		defer ticker.Stop()
Evidence
StartSelectionStatsUpdater unconditionally creates a ticker from the passed duration; Go tickers
panic on non-positive durations. The duration comes from a DurationVar flag (user input) and is
passed into StartSelectionStatsUpdater at startup, so misconfiguration leads to an immediate crash
loop.

protocol/metrics/smartrouter_metrics_manager.go[822-829]
protocol/rpcsmartrouter/rpcsmartrouter.go[244-246]
protocol/rpcsmartrouter/rpcsmartrouter.go[1458-1462]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`SmartRouterMetricsManager.StartSelectionStatsUpdater` creates `time.NewTicker(updateInterval)` without checking that `updateInterval > 0`. Since `OptimizerQosServerSamplingInterval` is user-configurable via a flag, setting it to `0` or a negative duration will panic and crash the smart router.

### Issue Context
The smart router calls `StartSelectionStatsUpdater(ctx, metrics.OptimizerQosServerSamplingInterval)` during startup. The interval is set via a `DurationVar` flag, so it can be misconfigured.

### Fix Focus Areas
- protocol/metrics/smartrouter_metrics_manager.go[822-845]
- protocol/rpcsmartrouter/rpcsmartrouter.go[244-246]
- protocol/rpcsmartrouter/rpcsmartrouter.go[1458-1462]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. OptimizerQoSClient uses concrete type 📘 Rule violation ✓ Correctness
Description
SmartRouterMetricsManagerOptions exposes a concrete *ConsumerOptimizerQoSClient dependency,
expanding the exported API surface and preventing interface-based injection. This conflicts with the
guideline to depend on interfaces and minimize exported APIs.
Code

protocol/metrics/smartrouter_metrics_manager.go[R107-112]

// SmartRouterMetricsManagerOptions contains configuration for the metrics manager
type SmartRouterMetricsManagerOptions struct {
-	NetworkAddress  string
-	StartHTTPServer bool // If false, only register metrics (for use alongside ConsumerMetricsManager)
+	NetworkAddress      string
+	StartHTTPServer     bool // If false, only register metrics (for use alongside ConsumerMetricsManager)
+	OptimizerQoSClient *ConsumerOptimizerQoSClient
}
Evidence
Compliance ID 2 requires depending on interfaces and minimizing exports. The PR adds an exported
options field OptimizerQoSClient typed as the concrete *ConsumerOptimizerQoSClient, even though
the manager only needs the GetReportsToSend() behavior.

AGENTS.md
protocol/metrics/smartrouter_metrics_manager.go[107-112]
protocol/metrics/smartrouter_metrics_manager.go[822-841]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`SmartRouterMetricsManagerOptions` adds an exported field `OptimizerQoSClient` typed as `*ConsumerOptimizerQoSClient`. This hard-codes a concrete dependency into the exported API surface, making testing and alternate implementations harder.

## Issue Context
`StartSelectionStatsUpdater` only needs `GetReportsToSend()`, so this dependency can be expressed as a narrow interface (or a function callback) to comply with interface-based dependency guidance and minimize exported API coupling.

## Fix Focus Areas
- protocol/metrics/smartrouter_metrics_manager.go[91-112]
- protocol/metrics/smartrouter_metrics_manager.go[822-845]
- protocol/rpcsmartrouter/rpcsmartrouter.go[233-246]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Provider score metrics removed 🐞 Bug ✧ Quality
Description
ConsumerMetricsManager no longer registers/updates the lava_consumer_provider_*_score gauges, but
the repository monitoring docs still instruct users to query lava_consumer_provider_composite_score.
This will break the documented PromQL queries (and likely dashboards/alerts) immediately after
deployment.
Code

protocol/metrics/consumer_metrics_manager.go[L879-883]

-		pme.providerAvailabilityScoreGauge.WithLabelValues(chainId, scores.ProviderAddress).Set(scores.Availability)
-		pme.providerLatencyScoreGauge.WithLabelValues(chainId, scores.ProviderAddress).Set(scores.Latency)
-		pme.providerSyncScoreGauge.WithLabelValues(chainId, scores.ProviderAddress).Set(scores.Sync)
-		pme.providerStakeScoreGauge.WithLabelValues(chainId, scores.ProviderAddress).Set(scores.Stake)
-		pme.providerCompositeScoreGauge.WithLabelValues(chainId, scores.ProviderAddress).Set(scores.Composite)
Evidence
The metrics manager no longer registers any of the provider score gauges, and SetProviderSelected no
longer emits score-per-provider gauges. Meanwhile, scripts/monitoring/README.md explicitly
references the removed metric name in example queries, creating a guaranteed mismatch between docs
and actual exported metrics.

protocol/metrics/consumer_metrics_manager.go[360-391]
protocol/metrics/consumer_metrics_manager.go[822-886]
scripts/monitoring/README.md[69-76]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
Consumer provider score gauge metrics (`lava_consumer_provider_*_score`) have been removed from the consumer metrics manager, but in-repo monitoring documentation still references at least `lava_consumer_provider_composite_score`. This breaks documented PromQL queries and likely any dashboards/alerts based on those metrics.

### Issue Context
The PR removes creation/registration/updating of provider score gauges and changes `SetProviderSelected` behavior accordingly. The monitoring README still recommends querying the removed metric names.

### Fix Focus Areas
- protocol/metrics/consumer_metrics_manager.go[360-437]
- protocol/metrics/consumer_metrics_manager.go[822-886]
- scripts/monitoring/README.md[59-100]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment thread protocol/metrics/smartrouter_metrics_manager.go
@github-actions

github-actions Bot commented Mar 16, 2026

Copy link
Copy Markdown

Test Results

0 tests  ±0   0 ✅ ±0   0s ⏱️ ±0s
0 suites ±0   0 💤 ±0 
7 files   ±0   0 ❌ ±0 

Results for commit 59f5951. ± Comparison against base commit 28c9192.

♻️ This comment has been updated with latest results.

@codecov

codecov Bot commented Mar 16, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 18.75000% with 39 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
protocol/metrics/smartrouter_metrics_manager.go 0.00% 24 Missing ⚠️
...l/rpcsmartrouter/direct_ws_subscription_manager.go 0.00% 4 Missing ⚠️
protocol/rpcsmartrouter/rpcsmartrouter.go 0.00% 4 Missing ⚠️
...tocol/chainlib/consumer_ws_subscription_manager.go 33.33% 2 Missing ⚠️
protocol/metrics/consumer_metrics_manager.go 80.00% 2 Missing ⚠️
protocol/metrics/provider_metrics_manager.go 0.00% 2 Missing ⚠️
protocol/metrics/consumer_metrics_manager_inf.go 0.00% 1 Missing ⚠️
Flag Coverage Δ
consensus 8.71% <ø> (ø)
protocol 33.56% <18.75%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
protocol/metrics/consumer_metrics_manager_inf.go 9.37% <0.00%> (ø)
...tocol/chainlib/consumer_ws_subscription_manager.go 60.20% <33.33%> (ø)
protocol/metrics/consumer_metrics_manager.go 39.72% <80.00%> (-2.41%) ⬇️
protocol/metrics/provider_metrics_manager.go 0.00% <0.00%> (ø)
...l/rpcsmartrouter/direct_ws_subscription_manager.go 32.71% <0.00%> (ø)
protocol/rpcsmartrouter/rpcsmartrouter.go 2.94% <0.00%> (ø)
protocol/metrics/smartrouter_metrics_manager.go 0.00% <0.00%> (ø)

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@NadavLevi NadavLevi force-pushed the pr/metrics-setup branch 2 times, most recently from e64f292 to 6898996 Compare March 17, 2026 09:27
Comment thread protocol/metrics/smartrouter_metrics_manager.go

@avitenzer avitenzer left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Label rename dissconectReason → disconnectReason

Comment thread protocol/metrics/consumer_metrics_manager.go
Comment thread protocol/metrics/consumer_metrics_manager.go
Comment thread protocol/metrics/consumer_metrics_manager.go
…, and wire optimizer QoS to smart router

- Rename lava_latest_block → lava_consumer_latest_block / lava_provider_latest_block
- Rename virtual_epoch → lava_consumer_virtual_epoch / lava_provider_virtual_epoch
- Add missing lava_consumer_ prefix to 4 metrics:
  relay_processing_latency_before/after_provider, requests_per_provider, protocol_errors_per_provider
- Fix label typo: dissconectReason → disconnectReason
- Implement StartSelectionStatsUpdater in SmartRouterMetricsManager so optimizer
  QoS reports are periodically written to lava_rpc_endpoint_selection_score;
  previously a silent no-op caused all optimizer-sampled scores to be dropped
- Pass OptimizerQoSClient via SmartRouterMetricsManagerOptions and simplify
  call site in rpcsmartrouter.go

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread protocol/metrics/consumer_metrics_manager.go
@nimrod-teich nimrod-teich merged commit 81d0277 into main Mar 19, 2026
32 checks passed
@nimrod-teich nimrod-teich deleted the pr/metrics-setup branch March 19, 2026 09:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants