fix: backup providers issues by NadavLevi · Pull Request #2266 · lavanet/lava

NadavLevi · 2026-04-12T19:34:45Z

Description

Closes: #XXXX

Author Checklist

All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.

I have...

read the contribution guide
included the correct type prefix in the PR title, you can find examples of the prefixes below:
confirmed ! in the type prefix if API or client breaking change
targeted the main branch
provided a link to the relevant issue or specification
reviewed "Files changed" and left comments if necessary
included the necessary unit and integration tests
updated the relevant documentation or specification, including comments for documenting Go code
confirmed all CI checks have passed

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.

I have...

confirmed the correct type prefix in the PR title
confirmed all author checklist items have been addressed
reviewed state machine logic, API design and naming, documentation is accurate, tests and test coverage

…oss epochs Backup providers could never be blocked: blockProvider was a no-op for them since they are not in validAddresses, and the backup selection path never checked any blocked list. Additionally, the epoch-transition re-blocking and health-probe logic only covered the main pairing, leaving previously-blocked backup providers with a clean slate. - Add blockedBackupProviders map to ConsumerSessionManager - blockProvider now adds backup providers to blockedBackupProviders when they are not found in validAddresses - getValidConsumerSessionsWithProviderFromBackupProviderList skips providers in blockedBackupProviders - UpdateAllProviders merges blockedBackupProviders into previousEpochBlockedProviders and re-blocks them in the new epoch - checkAndUnblockHealthyReBlockedProviders handles backup providers in both probe passes, including cross-role transitions (normal→backup) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

qodo-code-review · 2026-04-12T19:35:07Z

Review Summary by Qodo

Fix backup provider blocking, metrics fan-out, and Solana chain tracker initialization

🐞 Bug fix ✨ Enhancement

Walkthroughs

Description

• Track and persist blocked backup provider state across epochs
  - Add blockedBackupProviders map to track backup providers blocked this epoch
  - Merge blocked backups into previousEpochBlockedProviders at epoch transition
  - Re-block backup providers that were blocked in previous epoch if still in backup list
  - Skip blocked backups in getValidConsumerSessionsWithProviderFromBackupProviderList
• Handle backup providers in reconnection and health-check recovery paths
  - Update GenerateReconnectCallback to unblock backups via blockedBackupProviders
  - Extend checkAndUnblockHealthyReBlockedProviders to probe and unblock backup providers
  - Add comprehensive test coverage for backup provider blocking lifecycle
• Fix metrics fan-out for shared endpoint URLs across multiple providers
  - Change urlToProviderName map to urlToProviderNames slice to support multiple providers per URL
  - Update SetEndpointLatestBlock and RecordBlockFetch to emit metrics for all providers on shared
  URL
  - Add deduplication logic in RegisterEndpoint to prevent duplicate provider entries
• Fix endpoint health metric reset at epoch transitions for backups
  - Call SetEndpointOverallHealth during updateEpoch to reset Prometheus gauge alongside struct
  reset
  - Ensure backup providers with stuck unhealthy metrics recover at epoch boundary
• Enable Solana chain tracker per-endpoint metrics via CustomMessage implementation
  - Implement CustomMessage in EndpointChainFetcher to support SVMChainTracker getLatestBlockhash
  calls
  - Force blocksToSave=1 for Solana-family chains to avoid SVMChainTracker slot-cache misses
  - Include backup providers in GetAllDirectRPCEndpoints for startup ChainTracker initialization

Diagram

flowchart LR
  A["Backup Provider Blocked"] --> B["Added to blockedBackupProviders"]
  B --> C["Skipped in Selection"]
  B --> D["Merged to previousEpochBlockedProviders"]
  D --> E["Re-blocked in New Epoch"]
  E --> F["Comprehensive Probe"]
  F --> G["Unblocked if Healthy"]
  
  H["Shared Endpoint URL"] --> I["Multiple Providers Registered"]
  I --> J["urlToProviderNames Map"]
  J --> K["Fan-out Metrics to All Providers"]
  
  L["Epoch Transition"] --> M["Reset Endpoint Health Struct"]
  M --> N["Reset Prometheus Gauge"]
  N --> O["Backup Metrics Recover"]
  
  P["Solana Chain Tracker"] --> Q["CustomMessage Implementation"]
  Q --> R["Per-endpoint Metrics Enabled"]
  R --> S["Backup URLs in Startup Init"]

File Changes

1. protocol/lavasession/consumer_session_manager.go 🐞 Bug fix +107/-42

Track and persist blocked backup provider state

protocol/lavasession/consumer_session_manager.go

2. protocol/lavasession/consumer_session_manager_test.go 🧪 Tests +377/-0

Comprehensive test coverage for backup provider blocking

protocol/lavasession/consumer_session_manager_test.go

3. protocol/metrics/smartrouter_metrics_manager.go 🐞 Bug fix +63/-38

Fix metrics fan-out for shared endpoint URLs

protocol/metrics/smartrouter_metrics_manager.go

View more (14)

4. protocol/metrics/cache_metrics_test.go 🧪 Tests +1/-1

Update test helper for URL to provider names mapping

protocol/metrics/cache_metrics_test.go

5. protocol/metrics/cross_validation_metrics_test.go 🧪 Tests +1/-1

Update test helper for URL to provider names mapping

protocol/metrics/cross_validation_metrics_test.go

6. protocol/metrics/incident_consistency_metrics_test.go 🧪 Tests +1/-1

Update test helper for URL to provider names mapping

protocol/metrics/incident_consistency_metrics_test.go

7. protocol/metrics/incident_error_metrics_test.go 🧪 Tests +1/-1

Update test helper for URL to provider names mapping

protocol/metrics/incident_error_metrics_test.go

8. protocol/metrics/incident_hedge_metrics_test.go 🧪 Tests +1/-1

Update test helper for URL to provider names mapping

protocol/metrics/incident_hedge_metrics_test.go

9. protocol/metrics/incident_retry_metrics_test.go 🧪 Tests +1/-1

Update test helper for URL to provider names mapping

protocol/metrics/incident_retry_metrics_test.go

10. protocol/metrics/request_group_metrics_test.go 🧪 Tests +1/-1

Update test helper for URL to provider names mapping

protocol/metrics/request_group_metrics_test.go

11. protocol/metrics/url_fanout_test.go 🧪 Tests +137/-0

New test suite for URL fan-out metrics behavior

protocol/metrics/url_fanout_test.go

12. protocol/rpcsmartrouter/endpoint_chain_fetcher.go ✨ Enhancement +13/-4

Implement CustomMessage for Solana chain tracker support

protocol/rpcsmartrouter/endpoint_chain_fetcher.go

13. protocol/rpcsmartrouter/endpoint_chain_fetcher_test.go 🧪 Tests +99/-0

Test CustomMessage and Solana chain tracker initialization

protocol/rpcsmartrouter/endpoint_chain_fetcher_test.go

14. protocol/rpcsmartrouter/endpoint_chain_tracker_manager.go 🐞 Bug fix +17/-0

Force blocksToSave=1 for Solana-family chains

protocol/rpcsmartrouter/endpoint_chain_tracker_manager.go

15. protocol/rpcsmartrouter/endpoint_chain_tracker_test.go 🧪 Tests +2/-1

Update mock connection interface for Solana support

protocol/rpcsmartrouter/endpoint_chain_tracker_test.go

16. protocol/rpcsmartrouter/rpcsmartrouter.go 🐞 Bug fix +25/-0

Reset endpoint health metrics at epoch transitions

protocol/rpcsmartrouter/rpcsmartrouter.go

17. protocol/rpcsmartrouter/rpcsmartrouter_test.go 🧪 Tests +129/-0

Test health metric reset for primary and backup providers

protocol/rpcsmartrouter/rpcsmartrouter_test.go

qodo-code-review · 2026-04-12T19:35:08Z

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0)

1. ~~HTTP probe unblocks down backups~~ ☑ 🐞 ☼

Description

checkAndUnblockHealthyReBlockedProviders unblocks re-blocked backup providers when probeProvider
succeeds, but for StaticProvider endpoints probeProvider only consults DirectRPCConnection.IsHealthy
and HTTPDirectRPCConnection.IsHealthy always returns true, so unreachable HTTP/HTTPS backups can be
unblocked immediately at epoch transition.

Code

protocol/lavasession/consumer_session_manager.go[R1964-1984]

+	for blockedAddr, info := range providersNeedingComprehensiveProbe {
utils.LavaFormatDebug("Attempting comprehensive probe with endpoint reconnection",
utils.Attribute{Key: "provider", Value: blockedAddr},
+			utils.Attribute{Key: "isBackup", Value: info.isBackup},
utils.Attribute{Key: "epoch", Value: epoch},
utils.LogAttr("GUID", ctx),
)
-		// Probe with tryReconnect=TRUE - retry disabled endpoints
-		_, providerAddress, err := csm.probeProvider(
-			ctx,
-			cswp,
-			epoch,
-			true, // tryReconnect=TRUE: Comprehensive probe, retries disabled endpoints
-		)
+		_, providerAddress, err := csm.probeProvider(ctx, info.cswp, epoch, true)
if err == nil {
-			// Comprehensive probe succeeded! Provider recovered, unblock immediately
utils.LavaFormatInfo("Re-blocked provider's comprehensive probe succeeded, immediately unblocking",
utils.Attribute{Key: "provider", Value: providerAddress},
+				utils.Attribute{Key: "isBackup", Value: info.isBackup},
utils.Attribute{Key: "epoch", Value: epoch},
utils.LogAttr("GUID", ctx),
)
-			csm.validateAndReturnBlockedProviderToValidAddressesList(providerAddress)
+			if info.isBackup {
+				csm.lock.Lock()
+				delete(csm.blockedBackupProviders, providerAddress)
+				csm.lock.Unlock()

Evidence
Backups are routed into the comprehensive probe path, which calls probeProvider and, on success,
deletes the provider from blockedBackupProviders. For StaticProvider entries, probeProvider
delegates to probeDirectRPCEndpoints, which treats conn.IsHealthy() as the health signal. For
HTTP/HTTPS direct connections, IsHealthy is hardcoded to true, so the comprehensive probe will
report “healthy” regardless of actual reachability and unblock the backup.
protocol/lavasession/consumer_session_manager.go[1888-1989]
protocol/lavasession/consumer_session_manager.go[499-507]
protocol/lavasession/consumer_session_manager.go[591-665]
protocol/lavasession/direct_rpc_connection.go[342-344]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Backup providers that are **StaticProvider** and use **HTTP/HTTPS** direct connections can be unblocked during epoch transition even when the upstream is still unreachable, because the "probe" path ultimately relies on `HTTPDirectRPCConnection.IsHealthy()` which always returns `true`.
### Issue Context
`checkAndUnblockHealthyReBlockedProviders` runs a comprehensive probe for backups and unblocks them when `probeProvider` returns `nil` error. For static providers, `probeProvider` delegates to `probeDirectRPCEndpoints`, which uses `conn.IsHealthy()` as its only health signal. For HTTP/HTTPS, `IsHealthy()` is currently a constant `true`.
### Fix Focus Areas
- protocol/lavasession/consumer_session_manager.go[499-507]
- protocol/lavasession/consumer_session_manager.go[591-665]
- protocol/lavasession/direct_rpc_connection.go[342-344]
### Suggested fix
- Make the comprehensive probe for **static HTTP/HTTPS** actually reflect endpoint reachability, e.g.:
- Option A (preferred): add real health-state tracking to `HTTPDirectRPCConnection` (store an atomic healthy flag updated on request failures/successes) and have `IsHealthy()` return that.
- Option B: in `probeDirectRPCEndpoints`, treat an endpoint as healthy only if the endpoint itself is enabled/not refused (e.g. consult `endpoint.Enabled` / `endpoint.ConnectionRefusals`) and/or perform a lightweight request with a tight timeout.
- Ensure `checkAndUnblockHealthyReBlockedProviders` does not delete from `blockedBackupProviders` unless the probe is based on a meaningful health signal for HTTP/HTTPS.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. ~~Reconnect leaves primary blocked~~ ☑ 🐞 ≡

Description

GenerateReconnectCallback treats any provider found in backupProviders as a backup and only deletes
blockedBackupProviders, so if a provider address exists in both pairing and backupProviders and was
blocked as a primary, a successful reconnect will not remove it from
currentlyBlockedProviderAddresses/restore it to validAddresses.

Code

protocol/lavasession/consumer_session_manager.go[R1871-1879]

+			csm.lock.Lock()
+			if _, isBackup := csm.backupProviders[providerAddress]; isBackup {
+				// Backup providers are tracked in blockedBackupProviders, not currentlyBlockedProviderAddresses.
+				delete(csm.blockedBackupProviders, providerAddress)
+				csm.lock.Unlock()
+			} else {
+				csm.lock.Unlock()
+				csm.validateAndReturnBlockedProviderToValidAddressesList(providerAddress)
+			}

Evidence

The reconnect callback chooses the unblock path by checking only
csm.backupProviders[providerAddress]. If true, it skips
validateAndReturnBlockedProviderToValidAddressesList, which is the routine that actually removes
an address from currentlyBlockedProviderAddresses and appends it back into validAddresses.
UpdateAllProviders populates pairing and backupProviders from independent lists without enforcing
disjointness, so the overlap scenario is possible and would strand the provider in the
blocked-primary list.

protocol/lavasession/consumer_session_manager.go[1865-1879]
protocol/lavasession/consumer_session_manager.go[1704-1715]
protocol/lavasession/consumer_session_manager.go[233-273]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`GenerateReconnectCallback` decides whether a provider is a backup by checking `csm.backupProviders` only. If a provider address exists in both `csm.pairing` and `csm.backupProviders` and it was blocked as a primary (in `currentlyBlockedProviderAddresses`), a successful reconnect will **not** restore it to `validAddresses`.
### Issue Context
- Primary unblocking is done by `validateAndReturnBlockedProviderToValidAddressesList*`, which operates on `currentlyBlockedProviderAddresses`.
- The current reconnect callback bypasses that path whenever the address exists in `backupProviders`.
- `UpdateAllProviders` builds `pairing` and `backupProviders` from separate inputs without dedup/exclusion.
### Fix Focus Areas
- protocol/lavasession/consumer_session_manager.go[1865-1879]
- protocol/lavasession/consumer_session_manager.go[1704-1715]
- protocol/lavasession/consumer_session_manager.go[233-273]
### Suggested fix
Inside the reconnect callback, decide the unblock behavior using **pairing-first** logic, or unblock both structures when overlap exists:
- If `providerAddress` exists in `csm.pairing` (or is present in `currentlyBlockedProviderAddresses`), call `validateAndReturnBlockedProviderToValidAddressesList(providerAddress)`.
- Separately, if it exists in `csm.backupProviders`, delete it from `blockedBackupProviders`.
This makes the callback correct even when pairing/backup sets overlap.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. ~~Nil listenEndpoint dereference~~ ☑ 🐞 ☼

Description

RPCSmartRouter.updateEpoch dereferences server.listenEndpoint.ChainID/ApiInterface without checking
listenEndpoint for nil, which can panic if an RPCSmartRouterServer is registered without a
listenEndpoint.

Code

protocol/rpcsmartrouter/rpcsmartrouter.go[R1614-1617]

+		if server, exists := rpsr.rpcServers[chainKey]; exists && server != nil {
+			epochMetrics = server.smartRouterEndpointMetrics
+			epochChainID = server.listenEndpoint.ChainID
+			epochApiInterface = server.listenEndpoint.ApiInterface

Evidence
updateEpoch checks that server != nil but not server.listenEndpoint != nil before accessing
fields. RPCSmartRouterServer defines listenEndpoint as a pointer, so a nil value will panic at epoch
transitions.
protocol/rpcsmartrouter/rpcsmartrouter.go[1612-1618]
protocol/rpcsmartrouter/rpcsmartrouter_server.go[48-53]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`RPCSmartRouter.updateEpoch` reads `server.listenEndpoint.ChainID` and `.ApiInterface` without verifying `server.listenEndpoint` is non-nil, which can cause a panic.
### Issue Context
`RPCSmartRouterServer.listenEndpoint` is a `*lavasession.RPCEndpoint` pointer.
### Fix Focus Areas
- protocol/rpcsmartrouter/rpcsmartrouter.go[1612-1618]
- protocol/rpcsmartrouter/rpcsmartrouter_server.go[48-53]
### Suggested fix
Add a nil check:
- Only read ChainID/ApiInterface when `server.listenEndpoint != nil`.
- Otherwise leave `epochMetrics` nil (or log/debug) so the epoch transition still proceeds without metrics reset rather than crashing.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ⓘ The new review experience is currently in Beta. Learn more

codecov · 2026-04-12T19:40:30Z

Codecov Report

❌ Patch coverage is 70.81081% with 54 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
protocol/rpcconsumer/rpcconsumer_server.go	0.00%	20 Missing ⚠️
protocol/common/error_classifier.go	67.74%	10 Missing ⚠️
protocol/metrics/smartrouter_metrics_manager.go	77.41%	6 Missing and 1 partial ⚠️
protocol/rpcsmartrouter/direct_rpc_relay.go	22.22%	6 Missing and 1 partial ⚠️
protocol/lavasession/consumer_session_manager.go	92.64%	3 Missing and 2 partials ⚠️
protocol/lavasession/direct_rpc_connection.go	60.00%	4 Missing ⚠️
protocol/rpcsmartrouter/rpcsmartrouter_server.go	0.00%	1 Missing ⚠️

Flag	Coverage Δ
consensus	`8.98% <67.74%> (+0.01%)`	⬆️
protocol	`35.18% <70.81%> (+0.47%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
protocol/common/endpoints.go	`0.00% <ø> (ø)`
protocol/relaycore/relay_processor.go	`57.02% <100.00%> (+0.89%)`	⬆️
protocol/rpcsmartrouter/endpoint_chain_fetcher.go	`8.88% <100.00%> (+8.88%)`	⬆️
...l/rpcsmartrouter/endpoint_chain_tracker_manager.go	`50.90% <100.00%> (+0.60%)`	⬆️
protocol/rpcsmartrouter/rpcsmartrouter.go	`8.26% <100.00%> (+1.44%)`	⬆️
protocol/rpcsmartrouter/rpcsmartrouter_server.go	`13.35% <0.00%> (-0.01%)`	⬇️
protocol/lavasession/direct_rpc_connection.go	`27.63% <60.00%> (+5.54%)`	⬆️
protocol/lavasession/consumer_session_manager.go	`67.81% <92.64%> (+6.00%)`	⬆️
protocol/metrics/smartrouter_metrics_manager.go	`27.43% <77.41%> (+5.99%)`	⬆️
protocol/rpcsmartrouter/direct_rpc_relay.go	`52.89% <22.22%> (-0.72%)`	⬇️
... and 2 more

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2026-04-12T19:55:01Z

Test Results

0 tests ±0 0 ✅ ±0 0s ⏱️ ±0s
0 suites ±0 0 💤 ±0
7 files ±0 0 ❌ ±0

Results for commit c9f19da. ± Comparison against base commit bec08b8.

♻️ This comment has been updated with latest results.

Ports the GenerateReconnectCallback change from #2265 on top of #2257: when the periodic reconnect probe succeeds for a blocked backup provider, remove it from blockedBackupProviders instead of calling validateAndReturnBlockedProviderToValidAddressesList (a no-op for backups, which are not in validAddresses). Without this, periodic reconnection could never recover a blocked backup outside the epoch-transition path. Also expands the test suite: - TestCheckAndUnblock_BackupRoutedToComprehensiveProbe covers the !isBackup && !IsReported guard: confirms backups never take the immediate-unblock branch (which would not touch blockedBackupProviders and would silently stall the recovery). - TestCheckAndUnblock_BackupUnblockedWhenHealthy covers the positive path — a blocked backup with a reachable endpoint is unblocked via comprehensive probe at epoch transition. - TestGenerateReconnectCallback_BackupProviderUnblocked covers the new backup branch added by this commit. - TestGenerateReconnectCallback_NonBackupUsesValidAddressesPath guards against regressing the existing non-backup reconnect flow. Co-authored-by: avitenzer <avitenzer@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PR #2256 resets endpoint.ResetHealth() for both primary and backup providers on epoch transition, but only the in-memory struct was cleared — the lava_rpc_endpoint_overall_health Prometheus gauge stayed stuck at 0 (unhealthy) forever. The metric is only toggled on state transitions from relay outcomes (MarkUnhealthy/ResetHealth in the hot path), and backups rarely receive the successful relay needed to push the gauge back to 1. Net effect: operators see backups at 0% uptime on dashboards even though the router considers them healthy again after the epoch reset, which can mask real regressions and (worse) cause failover logic that reads the gauge to refuse routing to a provider that is in fact available. Emits SetEndpointOverallHealth(..., true) alongside the struct reset for both the primary and backup provider loops in updateEpoch, mirrored via rpsr.rpcServers[chainKey].smartRouterEndpointMetrics. Also adds TestUpdateEpoch_ResetsHealthMetric which seeds the gauge to 0 (via SetEndpointOverallHealth) for a primary and a backup, runs updateEpoch, then reads lava_rpc_endpoint_overall_health back from prometheus.DefaultGatherer and asserts both have transitioned to 1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ResetEndpointMetrics had no callers and, despite its name, set endpoint_overall_health to 0 instead of 1 — a footgun if ever wired up. The two actual consumers (relay hot path and epoch transition) already call SetEndpointOverallHealth with an explicit bool, so there's no need for a helper. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Two independent gaps combined to leave backup providers showing N/A latest-block on the dashboard: 1. GetAllDirectRPCEndpoints (consumer_session_manager.go) iterated only csm.pairing, so the smart router's initializeChainTrackers startup path skipped every backup. A ChainTracker for a dedicated-URL backup (e.g. base.lava.build with no primary sharing that URL) was only ever created lazily via the relay hot path — which backups rarely reach. Extended the function to iterate csm.backupProviders as well. 2. urlToProviderName was a map[string]string, so multiple providers configured on the same upstream URL collapsed to a single name (last writer wins). The ChainTracker's OnNewBlock callback keys emissions by URL and resolves to a provider name via this map — so every URL-keyed metric emission reached exactly one provider's label, leaving its peers stuck at zero on dashboards. Changed the map to map[string][]string with append+dedup in RegisterEndpoint, added resolveProviderNames returning all providers for a URL, and updated the two URL-keyed emitters (SetEndpointLatestBlock, RecordBlockFetch) to iterate and emit per provider. resolveProviderName is kept for callers that already pass a provider name (relay hot path) — it now returns the first name. Tests: - TestGetAllDirectRPCEndpoints_IncludesBackupProviders — primary and backup both appear in the startup iteration. - TestResolveProviderNames_FansOutAcrossSharedURL — both providers registered at the same URL resolve together. - TestResolveProviderNames_DeduplicatesSameProvider — repeated (URL, providerName) registrations (typical for providers whose node-urls lists the same URL twice) don't duplicate entries. - TestResolveProviderNames_FallsBackToInput — unknown inputs (provider names) round-trip as a single-element slice for backward compatibility with callers that pass names directly. - TestSetEndpointLatestBlock_FansOutAcrossSharedURL — one URL-keyed call populates the gauge for every provider sharing the URL. - TestRecordBlockFetch_FansOutAcrossSharedURL — same guarantee for the chain-tracker success/fail counters. The test files in protocol/metrics that constructed SmartRouterMetricsManager literals were updated to the new field name. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…for Solana SVMChainTracker (protocol/chaintracker/svm_chain_tracker.go:55) fetches latest block via CustomMessage with a getLatestBlockhash JSON-RPC body — it needs the slot, block hash, and block height returned together, which has no equivalent in the generic FetchLatestBlockNum path. EndpointChainFetcher.CustomMessage was a stub returning "CustomMessage not supported for EndpointChainFetcher". On every Solana-family chain this caused the per-endpoint ChainTracker to fail its first fetch — no OnNewBlock callbacks ever fired, so the metric fan-out added earlier in this branch had nothing to fan out and every per-endpoint metric for backups remained at N/A. Delegate to the existing sendRawRequest helper: for POST (the SVM case) the `data` argument is the JSON-RPC body and sendRawRequest posts it through the direct RPC connection; for GET the `data` argument is already used as the URL suffix per existing REST convention, so the same delegation preserves GET semantics. Tests: - TestEndpointChainFetcher_CustomMessage_POSTDelegatesToConnection exercises the SVMChainTracker code path — asserts the upstream response body flows back through CustomMessage unchanged. - TestEndpointChainFetcher_CustomMessage_PropagatesUnhealthyConnection asserts the connection-health check still fails closed rather than silently returning empty data. Also fixes a latent compile-break in the package-shared mockDirectRPCConnection: its GetNodeUrl signature was returning interface{} while the lavasession.DirectRPCConnection interface requires *common.NodeUrl. The existing tracker tests never passed the mock through the interface, so it compiled on its own — but any new test using it as a DirectRPCConnection (including this one) hit the mismatch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ckers With CustomMessage now working for EndpointChainFetcher, the SVM ChainTracker path made it past its initial getLatestBlockhash call on Solana — only to die seconds later with "ChainTracker stopped with error: slot not found in cache". Root cause is a pre-existing limitation in SVMChainTracker: fetchLatestBlockNumInner populates the blockNum→slot cache for the latest block only (svm_chain_tracker.go:70), but the generic ChainTracker init loop (chain_tracker.go readHashes) iterates latestBlock down to latestBlock - blocksToSave + 1 calling FetchBlockHashByNum for each. On SVM every lookup after the first hits the cache-miss path and returns "slot not found in cache", killing the tracker permanently (StartAndServe returns the error and the goroutine exits). For per-endpoint tracking we don't need history — each tracker watches a single URL, so there's no cross-endpoint fork detection to do; the manager only uses GetLatestBlockNum via ValidateEndpointSync and the OnNewBlock callback for per-provider metric emission. Forcing blocksToSave=1 on Solana-family chains (SOLANA, SOLANAT, KOII, KOIIT) sidesteps the SVMChainTracker limitation entirely without losing any capability the manager actually uses. EVM chains keep their caller-requested value. Test: TestEndpointChainTrackerManager_ForcesBlocksToSave1ForSolana asserts the override fires for every SVM family id and leaves non-SVM chains unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

HTTPDirectRPCConnection.IsHealthy was hard-coded to return true, with a comment deferring health tracking to "the endpoint/QoS layer". But the comprehensive probe path used by checkAndUnblockHealthyReBlockedProviders (probeProvider → probeDirectRPCEndpoints for static providers) only consults conn.IsHealthy() — so for static HTTP/HTTPS backups the probe was a guaranteed pass regardless of reachability, and a blocked backup with an unreachable upstream would be silently unblocked at every epoch transition. Two complementary changes close this: 1. HTTPDirectRPCConnection grows an atomic healthy flag (initial true), flipped to false by transport-level failures in SendRequest and DoHTTPRequest (dial / TLS / body-read errors), flipped back to true on any successful exchange — including HTTP 4xx/5xx, which are *application* errors, not transport failures. Flipping on 4xx/5xx would make rate-limited endpoints flap unhealthy and starve the dashboard on routine server errors; the test TestHTTPDirectRPCConnection_IsHealthy_Stays4xxHealthy guards that carve-out explicitly. 2. probeDirectRPCEndpoints now also gates on endpoint.Enabled. This belt-and-suspenders with IsHealthy(): the endpoint struct accumulates relay-path failures via MarkUnhealthy → ConnectionRefusals ≥ MaxConsecutiveConnectionAttempts → Enabled=false. Reading that state here means a backup that the hot path has already declared dead won't be optimistically revived by a probe that happens to catch a lucky moment in IsHealthy's lifecycle. Known gap (flagged in the IsHealthy comment): a never-hit backup with an unreachable upstream still reports healthy on the very first epoch because nothing has exercised either signal yet. Closing that cold-start case requires an active probe request (e.g. chain-specific no-op via the ChainParser), which is a meaningfully bigger change and belongs in a separate follow-up. Tests: - TestHTTPDirectRPCConnection_IsHealthy_StartsTrue — documents the optimistic-initialization behavior. - TestHTTPDirectRPCConnection_IsHealthy_FlipsOnDialFailure — the core behavior: a transport error must flip IsHealthy to false. - TestHTTPDirectRPCConnection_IsHealthy_Stays4xxHealthy — guards against overreach on application-level HTTP errors. - TestHTTPDirectRPCConnection_IsHealthy_RecoversAfterFailure — after upstream recovery, a successful exchange must restore IsHealthy=true. - TestProbeDirectRPCEndpoints_RespectsDisabledEndpoint — the endpoint.Enabled gate blocks a disabled endpoint from passing the probe even when IsHealthy happens to say true. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The JSON decoder was reading r.Body unbounded, letting a caller stream an arbitrarily large payload into an endpoint whose only legitimate content is {"offset_seconds": N}. Wrap the body in http.MaxBytesReader(w, r.Body, 1024) before decoding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Node errors classified to a LavaError with Retryable=false (e.g. CHAIN_EXECUTION_REVERTED, CHAIN_OUT_OF_GAS, CHAIN_DOUBLE_SPEND) were retried because the state machine only short-circuited on IsUnsupportedMethod / IsUserError subcategories and ignored the registry's Retryable flag. Populate RelayResult.IsNonRetryable from the classifier on both consumer and smart-router paths and make HasNonRetryableUserFacingErrors rest solely on that flag. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

pull-request-size Bot added the size/XXL label Apr 12, 2026

github-actions Bot added C:protocol Team:Protocol labels Apr 12, 2026

qodo-code-review Bot reviewed Apr 12, 2026

View reviewed changes

Comment thread protocol/lavasession/consumer_session_manager.go

NadavLevi and others added 7 commits April 13, 2026 09:41

NadavLevi force-pushed the fix/backup-providers-combined branch from 520c33c to e03cc62 Compare April 13, 2026 06:43

avitenzer reviewed Apr 13, 2026

View reviewed changes

Comment thread protocol/rpcsmartrouter/rpcsmartrouter.go

NadavLevi requested a review from avitenzer April 13, 2026 09:42

avitenzer previously approved these changes Apr 13, 2026

View reviewed changes

NadavLevi dismissed avitenzer’s stale review via 186d251 April 13, 2026 13:04

NadavLevi requested a review from avitenzer April 13, 2026 13:04

avitenzer previously approved these changes Apr 13, 2026

View reviewed changes

NadavLevi dismissed avitenzer’s stale review via c9f19da April 14, 2026 07:29

NadavLevi force-pushed the fix/backup-providers-combined branch from 186d251 to c9f19da Compare April 14, 2026 07:29

nimrod-teich approved these changes Apr 14, 2026

View reviewed changes

nimrod-teich merged commit 5741773 into main Apr 14, 2026
48 of 49 checks passed

nimrod-teich deleted the fix/backup-providers-combined branch April 14, 2026 08:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: backup providers issues#2266

fix: backup providers issues#2266
nimrod-teich merged 10 commits into
mainfrom
fix/backup-providers-combined

NadavLevi commented Apr 12, 2026

Uh oh!

qodo-code-review Bot commented Apr 12, 2026

Uh oh!

qodo-code-review Bot commented Apr 12, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

github-actions Bot commented Apr 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

NadavLevi commented Apr 12, 2026

Description

Author Checklist

Reviewers Checklist

Uh oh!

qodo-code-review Bot commented Apr 12, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-code-review Bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

codecov Bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

github-actions Bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qodo-code-review Bot commented Apr 12, 2026 •

edited

Loading

codecov Bot commented Apr 12, 2026 •

edited

Loading

github-actions Bot commented Apr 12, 2026 •

edited

Loading