fix: Re-block backup by avitenzer · Pull Request #2265 · lavanet/lava

avitenzer · 2026-04-12T15:12:26Z

fix(lavasession): restore blocked backup providers to active pool on recovery

Backup providers that failed and were blocked would never recover — they had no re-blocking tracking across epoch transitions and were silently skipped by all recovery paths (checkAndUnblockHealthyReBlockedProviders,
GenerateReconnectCallback) which only operated on the regular pairing list.

Adds blockedBackupProviders set to mirror the regular provider blocked list, wired into all four lifecycle stages: block on failure, filter during selection, re-block at epoch transition, and unblock on successful probe.

qodo-code-review · 2026-04-12T15:12:44Z

Review Summary by Qodo

Fix backup provider re-blocking across epoch transitions

🐞 Bug fix

Walkthroughs

Description

• Add blockedBackupProviders tracking to prevent blocked backup providers from recovering silently
• Preserve blocked backup providers across epoch transitions for health-checking
• Filter blocked backup providers during selection to prevent stale provider usage
• Unblock backup providers on successful probe in reconnect and health-check callbacks

Diagram

flowchart LR
  A["Backup Provider Fails"] --> B["Add to blockedBackupProviders"]
  B --> C["Preserve at Epoch Transition"]
  C --> D["Re-block in New Epoch"]
  D --> E["Filter During Selection"]
  E --> F["Health Check Probe"]
  F --> G["Unblock on Success"]

File Changes

1. protocol/lavasession/consumer_session_manager.go 🐞 Bug fix +77/-7

Implement backup provider re-blocking lifecycle management

• Added blockedBackupProviders map field to track blocked backup providers separately from regular
 providers
• Preserve blocked backup providers across epoch transitions in UpdateAllProviders method
• Re-block backup providers that were blocked in previous epoch if they still exist in new backup
 list
• Filter blocked backup providers during selection in
 getValidConsumerSessionsWithProviderFromBackup
• Track backup provider blocking in blockProvider method when provider not found in valid
 addresses
• Unblock backup providers on successful probe in GenerateReconnectCallback method
• Update checkAndUnblockHealthyReBlockedProviders to handle both regular and backup providers with
 separate unblocking logic

protocol/lavasession/consumer_session_manager.go

qodo-code-review · 2026-04-12T15:12:45Z

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (0) 📎 Requirement gaps (0) 🖥 UI issues (0) 🎨 UX Issues (0)

🐞\ ≡ Correctness (1) ☼ Reliability (2)

1. Backups unblocked without probe 🐞 ≡

Description

checkAndUnblockHealthyReBlockedProviders unblocks backup providers when they are not in
reportedProviders, but UpdateAllProviders only probes the main pairingList and resets
reportedProviders. This means re-blocked backup providers can be removed from blockedBackupProviders
at epoch transition without any actual health check.

Code

protocol/lavasession/consumer_session_manager.go[R1911-1927]

+	// First pass: Identify which re-blocked providers had successful probes.
+	// This covers both regular providers (in csm.pairing) and backup providers (in csm.backupProviders).
	providersNeedingComprehensiveProbe := make(map[string]*ConsumerSessionsWithProvider)

	for blockedAddr := range csm.previousEpochBlockedProviders {
+		// Look up in regular pairing first, then fall back to backup providers.
		cswp, exists := csm.pairing[blockedAddr]
+		isBackup := false
+		if !exists {
+			cswp, exists = csm.backupProviders[blockedAddr]
+			isBackup = exists
+		}
		if !exists {
-			continue // Provider not in current pairing
+			continue // Provider not in current pairing or backup list
		}

		// Check if provider is in reported providers

Evidence
UpdateAllProviders resets reportedProviders, then probes only pairingList (not backupProviderList)
before running checkAndUnblockHealthyReBlockedProviders. The unblocking logic for both regular and
backup providers uses reportedProviders.IsReported(...) as a proxy for probe failure/success, but
backup providers aren’t part of that probe pass, so they will typically appear “healthy” (not
reported) immediately after the reset and get unblocked.
protocol/lavasession/consumer_session_manager.go[153-163]
protocol/lavasession/consumer_session_manager.go[219-223]
protocol/lavasession/consumer_session_manager.go[454-473]
protocol/lavasession/consumer_session_manager.go[1909-1959]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`checkAndUnblockHealthyReBlockedProviders` uses `reportedProviders.IsReported(addr)` to infer whether a provider’s initial probe failed this epoch. However, `UpdateAllProviders` only calls `probeProviders(ctx, pairingList, epoch)` and also resets `reportedProviders`, so backup providers (tracked separately) often have no probe signal and are treated as healthy, causing them to be unblocked without verification.

### Issue Context
Backup providers are held in `csm.backupProviders` and are explicitly not part of `validAddresses` selection. The epoch-transition flow should keep previously-blocked backups blocked until they *actually* pass a probe (or at least until a probe attempt has been made).

### Fix Focus Areas
- protocol/lavasession/consumer_session_manager.go[153-163]
- protocol/lavasession/consumer_session_manager.go[219-223]
- protocol/lavasession/consumer_session_manager.go[454-473]
- protocol/lavasession/consumer_session_manager.go[1909-1959]

### Suggested fix
Implement one of the following:
1) Probe backup providers as part of the epoch transition:
  - Call `csm.probeProviders(ctx, backupProviderList, epoch)` in the deferred goroutine (or probe only the re-blocked backup subset).
  - Ensure probe failures for backups are recorded in `reportedProviders` (via existing `probeProvider -> blockProvider -> ReportProvider` path).

2) Or, make `checkAndUnblockHealthyReBlockedProviders` explicitly probe backup providers (and/or all re-blocked providers) instead of relying on `reportedProviders` as a proxy:
  - For `isBackup==true`, run `probeProvider(..., tryReconnect=false)` and only `delete(csm.blockedBackupProviders, addr)` on success; otherwise keep it blocked (and optionally report it).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. Second-chance never unblocks backups 🐞 ☼

Description

When allowSecondChance=true, blockProvider’s delayed second-chance path calls
validateAndReturnBlockedProviderToValidAddressesList, which only unblocks providers tracked in
currentlyBlockedProviderAddresses. Backup providers are blocked via blockedBackupProviders, so they
can remain blocked even after the second-chance timer fires.

Code

protocol/lavasession/consumer_session_manager.go[R1561-1567]

+			// Backup providers are never in validAddresses; track them in a separate blocked set.
+			if _, isBackup := csm.backupProviders[address]; isBackup {
+				if csm.blockedBackupProviders == nil {
+					csm.blockedBackupProviders = make(map[string]struct{})
+				}
+				csm.blockedBackupProviders[address] = struct{}{}
+				utils.LavaFormatInfo("Blocked backup provider", utils.LogAttr("address", address), utils.LogAttr("GUID", ctx))

Evidence
blockProvider adds backup addresses to blockedBackupProviders when removeAddressFromValidAddresses
fails. But the second-chance goroutine always calls
validateAndReturnBlockedProviderToValidAddressesList(address), and that function only moves
addresses between currentlyBlockedProviderAddresses and validAddresses, never touching
blockedBackupProviders. OnSessionFailure can set allowSecondChance=true and calls blockProvider with
reconnectCallback=nil, so there’s no alternate callback path to clear blockedBackupProviders for
this flow.
protocol/lavasession/consumer_session_manager.go[1538-1550]
protocol/lavasession/consumer_session_manager.go[1525-1572]
protocol/lavasession/consumer_session_manager.go[1645-1662]
protocol/lavasession/consumer_session_manager.go[1680-1683]
protocol/lavasession/consumer_session_manager.go[1701-1723]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The second-chance recovery path unblocks only regular providers. When a backup provider is blocked into `blockedBackupProviders`, the delayed second-chance goroutine doesn’t remove it, so the provider can remain blocked despite the second-chance mechanism.

### Issue Context
`allowSecondChance` can be enabled by `OnSessionFailure` and calls `blockProvider(..., allowSecondChance=true, reconnectCallback=nil, ...)`, so the only intended “automatic” recovery here is the second-chance goroutine.

### Fix Focus Areas
- protocol/lavasession/consumer_session_manager.go[1538-1550]
- protocol/lavasession/consumer_session_manager.go[1525-1572]
- protocol/lavasession/consumer_session_manager.go[1680-1683]
- protocol/lavasession/consumer_session_manager.go[1701-1723]

### Suggested fix
In the second-chance goroutine, decide whether the address is blocked as a backup and clear the appropriate structure:
- Under `csm.lock`, if `address` exists in `blockedBackupProviders`, `delete(csm.blockedBackupProviders, address)`.
- Else call `validateAndReturnBlockedProviderToValidAddressesList(address)` (existing behavior).

Optionally, also consider preventing `allowSecondChance` for backups if that behavior is not desired, but then ensure there is a separate recovery path for backups within the same epoch.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. Reconnect misclassifies overlap 🐞 ☼

Description

GenerateReconnectCallback chooses the unblock path based on whether the address is present in
csm.backupProviders at callback time. Because UpdateAllProviders builds pairing and backupProviders
from separate lists without preventing overlap, a regular blocked provider could be treated as
backup and never returned to validAddresses.

Code

protocol/lavasession/consumer_session_manager.go[R1869-1877]

+			csm.lock.Lock()
+			if _, isBackup := csm.backupProviders[providerAddress]; isBackup {
+				// Backup providers are tracked in blockedBackupProviders, not currentlyBlockedProviderAddresses
+				delete(csm.blockedBackupProviders, providerAddress)
+				csm.lock.Unlock()
+			} else {
+				csm.lock.Unlock()
+				csm.validateAndReturnBlockedProviderToValidAddressesList(providerAddress)
+			}

Evidence
UpdateAllProviders populates csm.pairing and csm.backupProviders independently.
GenerateReconnectCallback uses only current membership in csm.backupProviders to decide whether to
delete from blockedBackupProviders vs returning to validAddresses; if an address exists in both
maps, it can take the wrong unblock path for a regular block.
protocol/lavasession/consumer_session_manager.go[231-237]
protocol/lavasession/consumer_session_manager.go[269-273]
protocol/lavasession/consumer_session_manager.go[1863-1880]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The reconnect callback decides backup-vs-regular based on `csm.backupProviders[providerAddress]` at the time the callback runs. If provider lists overlap or change (including same-epoch updates), a regular provider can be misclassified and not returned to `validAddresses`.

### Issue Context
The unblock decision should be based on *how the provider is currently blocked* (in `currentlyBlockedProviderAddresses` vs `blockedBackupProviders`), not on list membership.

### Fix Focus Areas
- protocol/lavasession/consumer_session_manager.go[1863-1880]
- protocol/lavasession/consumer_session_manager.go[231-237]
- protocol/lavasession/consumer_session_manager.go[269-273]

### Suggested fix
In `GenerateReconnectCallback` (and similarly in the comprehensive-unblock path), decide the unblock operation by checking:
- If `providerAddress` exists in `blockedBackupProviders`, delete it there.
- Else, call `validateAndReturnBlockedProviderToValidAddressesList(providerAddress)`.

Optionally add a guard in `UpdateAllProviders` to prevent the same address from being inserted into both `pairing` and `backupProviders` unless explicitly intended.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ⓘ The new review experience is currently in Beta. Learn more

qodo-code-review · 2026-04-12T15:17:18Z

+	// First pass: Identify which re-blocked providers had successful probes.
+	// This covers both regular providers (in csm.pairing) and backup providers (in csm.backupProviders).
 	providersNeedingComprehensiveProbe := make(map[string]*ConsumerSessionsWithProvider)

 	for blockedAddr := range csm.previousEpochBlockedProviders {
+		// Look up in regular pairing first, then fall back to backup providers.
 		cswp, exists := csm.pairing[blockedAddr]
+		isBackup := false
+		if !exists {
+			cswp, exists = csm.backupProviders[blockedAddr]
+			isBackup = exists
+		}
 		if !exists {
-			continue // Provider not in current pairing
+			continue // Provider not in current pairing or backup list
 		}

 		// Check if provider is in reported providers


1. Backups unblocked without probe 🐞 Bug ≡ Correctness

checkAndUnblockHealthyReBlockedProviders unblocks backup providers when they are not in reportedProviders, but UpdateAllProviders only probes the main pairingList and resets reportedProviders. This means re-blocked backup providers can be removed from blockedBackupProviders at epoch transition without any actual health check.

Agent Prompt

### Issue description `checkAndUnblockHealthyReBlockedProviders` uses `reportedProviders.IsReported(addr)` to infer whether a provider’s initial probe failed this epoch. However, `UpdateAllProviders` only calls `probeProviders(ctx, pairingList, epoch)` and also resets `reportedProviders`, so backup providers (tracked separately) often have no probe signal and are treated as healthy, causing them to be unblocked without verification. ### Issue Context Backup providers are held in `csm.backupProviders` and are explicitly not part of `validAddresses` selection. The epoch-transition flow should keep previously-blocked backups blocked until they *actually* pass a probe (or at least until a probe attempt has been made). ### Fix Focus Areas - protocol/lavasession/consumer_session_manager.go[153-163] - protocol/lavasession/consumer_session_manager.go[219-223] - protocol/lavasession/consumer_session_manager.go[454-473] - protocol/lavasession/consumer_session_manager.go[1909-1959] ### Suggested fix Implement one of the following: 1) Probe backup providers as part of the epoch transition: - Call `csm.probeProviders(ctx, backupProviderList, epoch)` in the deferred goroutine (or probe only the re-blocked backup subset). - Ensure probe failures for backups are recorded in `reportedProviders` (via existing `probeProvider -> blockProvider -> ReportProvider` path). 2) Or, make `checkAndUnblockHealthyReBlockedProviders` explicitly probe backup providers (and/or all re-blocked providers) instead of relying on `reportedProviders` as a proxy: - For `isBackup==true`, run `probeProvider(..., tryReconnect=false)` and only `delete(csm.blockedBackupProviders, addr)` on success; otherwise keep it blocked (and optionally report it).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

codecov · 2026-04-12T15:18:18Z

Codecov Report

❌ Patch coverage is 18.00000% with 41 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
protocol/lavasession/consumer_session_manager.go	18.00%	37 Missing and 4 partials ⚠️

Flag	Coverage Δ
consensus	`8.96% <ø> (ø)`
protocol	`34.67% <18.00%> (-0.05%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
protocol/lavasession/consumer_session_manager.go	`60.01% <18.00%> (-1.80%)`	⬇️

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2026-04-12T15:32:48Z

Test Results

0 tests ±0 0 ✅ ±0 0s ⏱️ ±0s
0 suites ±0 0 💤 ±0
7 files ±0 0 ❌ ±0

Results for commit 9bc1aa2. ± Comparison against base commit bec08b8.

Ports the GenerateReconnectCallback change from #2265 on top of #2257: when the periodic reconnect probe succeeds for a blocked backup provider, remove it from blockedBackupProviders instead of calling validateAndReturnBlockedProviderToValidAddressesList (a no-op for backups, which are not in validAddresses). Without this, periodic reconnection could never recover a blocked backup outside the epoch-transition path. Also expands the test suite: - TestCheckAndUnblock_BackupRoutedToComprehensiveProbe covers the !isBackup && !IsReported guard: confirms backups never take the immediate-unblock branch (which would not touch blockedBackupProviders and would silently stall the recovery). - TestCheckAndUnblock_BackupUnblockedWhenHealthy covers the positive path — a blocked backup with a reachable endpoint is unblocked via comprehensive probe at epoch transition. - TestGenerateReconnectCallback_BackupProviderUnblocked covers the new backup branch added by this commit. - TestGenerateReconnectCallback_NonBackupUsesValidAddressesPath guards against regressing the existing non-backup reconnect flow. Co-authored-by: avitenzer <avitenzer@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(lavasession): track and persist blocked backup provider state across epochs Backup providers could never be blocked: blockProvider was a no-op for them since they are not in validAddresses, and the backup selection path never checked any blocked list. Additionally, the epoch-transition re-blocking and health-probe logic only covered the main pairing, leaving previously-blocked backup providers with a clean slate. - Add blockedBackupProviders map to ConsumerSessionManager - blockProvider now adds backup providers to blockedBackupProviders when they are not found in validAddresses - getValidConsumerSessionsWithProviderFromBackupProviderList skips providers in blockedBackupProviders - UpdateAllProviders merges blockedBackupProviders into previousEpochBlockedProviders and re-blocks them in the new epoch - checkAndUnblockHealthyReBlockedProviders handles backup providers in both probe passes, including cross-role transitions (normal→backup) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(lavasession): handle backup providers in GenerateReconnectCallback Ports the GenerateReconnectCallback change from #2265 on top of #2257: when the periodic reconnect probe succeeds for a blocked backup provider, remove it from blockedBackupProviders instead of calling validateAndReturnBlockedProviderToValidAddressesList (a no-op for backups, which are not in validAddresses). Without this, periodic reconnection could never recover a blocked backup outside the epoch-transition path. Also expands the test suite: - TestCheckAndUnblock_BackupRoutedToComprehensiveProbe covers the !isBackup && !IsReported guard: confirms backups never take the immediate-unblock branch (which would not touch blockedBackupProviders and would silently stall the recovery). - TestCheckAndUnblock_BackupUnblockedWhenHealthy covers the positive path — a blocked backup with a reachable endpoint is unblocked via comprehensive probe at epoch transition. - TestGenerateReconnectCallback_BackupProviderUnblocked covers the new backup branch added by this commit. - TestGenerateReconnectCallback_NonBackupUsesValidAddressesPath guards against regressing the existing non-backup reconnect flow. Co-authored-by: avitenzer <avitenzer@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(rpcsmartrouter): reset endpoint health metric on epoch transition PR #2256 resets endpoint.ResetHealth() for both primary and backup providers on epoch transition, but only the in-memory struct was cleared — the lava_rpc_endpoint_overall_health Prometheus gauge stayed stuck at 0 (unhealthy) forever. The metric is only toggled on state transitions from relay outcomes (MarkUnhealthy/ResetHealth in the hot path), and backups rarely receive the successful relay needed to push the gauge back to 1. Net effect: operators see backups at 0% uptime on dashboards even though the router considers them healthy again after the epoch reset, which can mask real regressions and (worse) cause failover logic that reads the gauge to refuse routing to a provider that is in fact available. Emits SetEndpointOverallHealth(..., true) alongside the struct reset for both the primary and backup provider loops in updateEpoch, mirrored via rpsr.rpcServers[chainKey].smartRouterEndpointMetrics. Also adds TestUpdateEpoch_ResetsHealthMetric which seeds the gauge to 0 (via SetEndpointOverallHealth) for a primary and a backup, runs updateEpoch, then reads lava_rpc_endpoint_overall_health back from prometheus.DefaultGatherer and asserts both have transitioned to 1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore(metrics): remove unused ResetEndpointMetrics ResetEndpointMetrics had no callers and, despite its name, set endpoint_overall_health to 0 instead of 1 — a footgun if ever wired up. The two actual consumers (relay hot path and epoch transition) already call SetEndpointOverallHealth with an explicit bool, so there's no need for a helper. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: populate per-endpoint metrics for backup providers Two independent gaps combined to leave backup providers showing N/A latest-block on the dashboard: 1. GetAllDirectRPCEndpoints (consumer_session_manager.go) iterated only csm.pairing, so the smart router's initializeChainTrackers startup path skipped every backup. A ChainTracker for a dedicated-URL backup (e.g. base.lava.build with no primary sharing that URL) was only ever created lazily via the relay hot path — which backups rarely reach. Extended the function to iterate csm.backupProviders as well. 2. urlToProviderName was a map[string]string, so multiple providers configured on the same upstream URL collapsed to a single name (last writer wins). The ChainTracker's OnNewBlock callback keys emissions by URL and resolves to a provider name via this map — so every URL-keyed metric emission reached exactly one provider's label, leaving its peers stuck at zero on dashboards. Changed the map to map[string][]string with append+dedup in RegisterEndpoint, added resolveProviderNames returning all providers for a URL, and updated the two URL-keyed emitters (SetEndpointLatestBlock, RecordBlockFetch) to iterate and emit per provider. resolveProviderName is kept for callers that already pass a provider name (relay hot path) — it now returns the first name. Tests: - TestGetAllDirectRPCEndpoints_IncludesBackupProviders — primary and backup both appear in the startup iteration. - TestResolveProviderNames_FansOutAcrossSharedURL — both providers registered at the same URL resolve together. - TestResolveProviderNames_DeduplicatesSameProvider — repeated (URL, providerName) registrations (typical for providers whose node-urls lists the same URL twice) don't duplicate entries. - TestResolveProviderNames_FallsBackToInput — unknown inputs (provider names) round-trip as a single-element slice for backward compatibility with callers that pass names directly. - TestSetEndpointLatestBlock_FansOutAcrossSharedURL — one URL-keyed call populates the gauge for every provider sharing the URL. - TestRecordBlockFetch_FansOutAcrossSharedURL — same guarantee for the chain-tracker success/fail counters. The test files in protocol/metrics that constructed SmartRouterMetricsManager literals were updated to the new field name. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(rpcsmartrouter): implement CustomMessage on EndpointChainFetcher for Solana SVMChainTracker (protocol/chaintracker/svm_chain_tracker.go:55) fetches latest block via CustomMessage with a getLatestBlockhash JSON-RPC body — it needs the slot, block hash, and block height returned together, which has no equivalent in the generic FetchLatestBlockNum path. EndpointChainFetcher.CustomMessage was a stub returning "CustomMessage not supported for EndpointChainFetcher". On every Solana-family chain this caused the per-endpoint ChainTracker to fail its first fetch — no OnNewBlock callbacks ever fired, so the metric fan-out added earlier in this branch had nothing to fan out and every per-endpoint metric for backups remained at N/A. Delegate to the existing sendRawRequest helper: for POST (the SVM case) the `data` argument is the JSON-RPC body and sendRawRequest posts it through the direct RPC connection; for GET the `data` argument is already used as the URL suffix per existing REST convention, so the same delegation preserves GET semantics. Tests: - TestEndpointChainFetcher_CustomMessage_POSTDelegatesToConnection exercises the SVMChainTracker code path — asserts the upstream response body flows back through CustomMessage unchanged. - TestEndpointChainFetcher_CustomMessage_PropagatesUnhealthyConnection asserts the connection-health check still fails closed rather than silently returning empty data. Also fixes a latent compile-break in the package-shared mockDirectRPCConnection: its GetNodeUrl signature was returning interface{} while the lavasession.DirectRPCConnection interface requires *common.NodeUrl. The existing tracker tests never passed the mock through the interface, so it compiled on its own — but any new test using it as a DirectRPCConnection (including this one) hit the mismatch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(rpcsmartrouter): force blocksToSave=1 for Solana per-endpoint trackers With CustomMessage now working for EndpointChainFetcher, the SVM ChainTracker path made it past its initial getLatestBlockhash call on Solana — only to die seconds later with "ChainTracker stopped with error: slot not found in cache". Root cause is a pre-existing limitation in SVMChainTracker: fetchLatestBlockNumInner populates the blockNum→slot cache for the latest block only (svm_chain_tracker.go:70), but the generic ChainTracker init loop (chain_tracker.go readHashes) iterates latestBlock down to latestBlock - blocksToSave + 1 calling FetchBlockHashByNum for each. On SVM every lookup after the first hits the cache-miss path and returns "slot not found in cache", killing the tracker permanently (StartAndServe returns the error and the goroutine exits). For per-endpoint tracking we don't need history — each tracker watches a single URL, so there's no cross-endpoint fork detection to do; the manager only uses GetLatestBlockNum via ValidateEndpointSync and the OnNewBlock callback for per-provider metric emission. Forcing blocksToSave=1 on Solana-family chains (SOLANA, SOLANAT, KOII, KOIIT) sidesteps the SVMChainTracker limitation entirely without losing any capability the manager actually uses. EVM chains keep their caller-requested value. Test: TestEndpointChainTrackerManager_ForcesBlocksToSave1ForSolana asserts the override fires for every SVM family id and leaves non-SVM chains unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(lavasession): meaningful IsHealthy for HTTPDirectRPCConnection HTTPDirectRPCConnection.IsHealthy was hard-coded to return true, with a comment deferring health tracking to "the endpoint/QoS layer". But the comprehensive probe path used by checkAndUnblockHealthyReBlockedProviders (probeProvider → probeDirectRPCEndpoints for static providers) only consults conn.IsHealthy() — so for static HTTP/HTTPS backups the probe was a guaranteed pass regardless of reachability, and a blocked backup with an unreachable upstream would be silently unblocked at every epoch transition. Two complementary changes close this: 1. HTTPDirectRPCConnection grows an atomic healthy flag (initial true), flipped to false by transport-level failures in SendRequest and DoHTTPRequest (dial / TLS / body-read errors), flipped back to true on any successful exchange — including HTTP 4xx/5xx, which are *application* errors, not transport failures. Flipping on 4xx/5xx would make rate-limited endpoints flap unhealthy and starve the dashboard on routine server errors; the test TestHTTPDirectRPCConnection_IsHealthy_Stays4xxHealthy guards that carve-out explicitly. 2. probeDirectRPCEndpoints now also gates on endpoint.Enabled. This belt-and-suspenders with IsHealthy(): the endpoint struct accumulates relay-path failures via MarkUnhealthy → ConnectionRefusals ≥ MaxConsecutiveConnectionAttempts → Enabled=false. Reading that state here means a backup that the hot path has already declared dead won't be optimistically revived by a probe that happens to catch a lucky moment in IsHealthy's lifecycle. Known gap (flagged in the IsHealthy comment): a never-hit backup with an unreachable upstream still reports healthy on the very first epoch because nothing has exercised either signal yet. Closing that cold-start case requires an active probe request (e.g. chain-specific no-op via the ChainParser), which is a meaningfully bigger change and belongs in a separate follow-up. Tests: - TestHTTPDirectRPCConnection_IsHealthy_StartsTrue — documents the optimistic-initialization behavior. - TestHTTPDirectRPCConnection_IsHealthy_FlipsOnDialFailure — the core behavior: a transport error must flip IsHealthy to false. - TestHTTPDirectRPCConnection_IsHealthy_Stays4xxHealthy — guards against overreach on application-level HTTP errors. - TestHTTPDirectRPCConnection_IsHealthy_RecoversAfterFailure — after upstream recovery, a successful exchange must restore IsHealthy=true. - TestProbeDirectRPCEndpoints_RespectsDisabledEndpoint — the endpoint.Enabled gate blocks a disabled endpoint from passing the probe even when IsHealthy happens to say true. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(rpcsmartrouter): cap /debug/time-warp body at 1 KiB The JSON decoder was reading r.Body unbounded, letting a caller stream an arbitrarily large payload into an endpoint whose only legitimate content is {"offset_seconds": N}. Wrap the body in http.MaxBytesReader(w, r.Body, 1024) before decoding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(errors): honor Retryable flag on node-error retry decision Node errors classified to a LavaError with Retryable=false (e.g. CHAIN_EXECUTION_REVERTED, CHAIN_OUT_OF_GAS, CHAIN_DOUBLE_SPEND) were retried because the state machine only short-circuited on IsUnsupportedMethod / IsUserError subcategories and ignored the registry's Retryable flag. Populate RelayResult.IsNonRetryable from the classifier on both consumer and smart-router paths and make HasNonRetryableUserFacingErrors rest solely on that flag. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: avitenzer <avitenzer@users.noreply.github.com>

Fix: Re-block backup

9bc1aa2

pull-request-size Bot added the size/M label Apr 12, 2026

github-actions Bot added C:protocol Team:Protocol labels Apr 12, 2026

qodo-code-review Bot reviewed Apr 12, 2026

View reviewed changes

NadavLevi closed this Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Re-block backup#2265

fix: Re-block backup#2265
avitenzer wants to merge 1 commit into
mainfrom
fix/backup-providers

avitenzer commented Apr 12, 2026

Uh oh!

qodo-code-review Bot commented Apr 12, 2026

Uh oh!

qodo-code-review Bot commented Apr 12, 2026 •

edited

Loading

Uh oh!

qodo-code-review Bot Apr 12, 2026

Uh oh!

codecov Bot commented Apr 12, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

avitenzer commented Apr 12, 2026

Uh oh!

qodo-code-review Bot commented Apr 12, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-code-review Bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

qodo-code-review Bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Apr 12, 2026

Test Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

qodo-code-review Bot commented Apr 12, 2026 •

edited

Loading

codecov Bot commented Apr 12, 2026 •

edited

Loading