fix(lavasession): track and persist blocked backup provider state acr… by NadavLevi · Pull Request #2257 · lavanet/lava

NadavLevi · 2026-03-31T10:15:36Z

User description

…oss epochs

Backup providers could never be blocked: blockProvider was a no-op for them since they are not in validAddresses, and the backup selection path never checked any blocked list. Additionally, the epoch-transition re-blocking and health-probe logic only covered the main pairing, leaving previously-blocked backup providers with a clean slate.

Add blockedBackupProviders map to ConsumerSessionManager
blockProvider now adds backup providers to blockedBackupProviders when they are not found in validAddresses
getValidConsumerSessionsWithProviderFromBackupProviderList skips providers in blockedBackupProviders
UpdateAllProviders merges blockedBackupProviders into previousEpochBlockedProviders and re-blocks them in the new epoch
checkAndUnblockHealthyReBlockedProviders handles backup providers in both probe passes, including cross-role transitions (normal→backup)

Description

Closes: #XXXX

Author Checklist

All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.

I have...

read the contribution guide
included the correct type prefix in the PR title, you can find examples of the prefixes below:
confirmed ! in the type prefix if API or client breaking change
targeted the main branch
provided a link to the relevant issue or specification
reviewed "Files changed" and left comments if necessary
included the necessary unit and integration tests
updated the relevant documentation or specification, including comments for documenting Go code
confirmed all CI checks have passed

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.

I have...

confirmed the correct type prefix in the PR title
confirmed all author checklist items have been addressed
reviewed state machine logic, API design and naming, documentation is accurate, tests and test coverage

Generated description

Below is a concise technical summary of the changes proposed in this PR:
Track backup providers blocked during failures by adding a blockedBackupProviders set and wiring it into blocking, selection, and epoch updates so their state persists across epochs. Handle health probes for re-blocked backups by extending checkAndUnblockHealthyReBlockedProviders to treat backup entries alongside paired providers and reblock or unblock them appropriately.

Topic

Details

Backup provider blocking

Track blocked backup providers in the blocking, selection, and epoch-update flow so they stay excluded and re-persisted like normal providers, and cover the new behavior with targeted tests.

Modified files (2)

protocol/lavasession/consumer_session_manager.go
protocol/lavasession/consumer_session_manager_test.go

Latest Contributors(2)

User	Commit	Date
NadavLevi	fix(lavasession): prev...	March 26, 2026
anna@magmadevs.com	feat(provideroptimizer...	February 22, 2026

Probe reblocked

Extend the reblocking probe flow to surface backup providers during both probe passes so health checks can unblock them when healthy or keep them blocked when failures persist.

Modified files (1)

protocol/lavasession/consumer_session_manager.go

Latest Contributors(2)

User	Commit	Date
NadavLevi	fix(lavasession): prev...	March 26, 2026
anna@magmadevs.com	feat(provideroptimizer...	February 22, 2026

This pull request is reviewed by Baz. Review like a pro on (Baz).

qodo-code-review · 2026-03-31T10:15:53Z

ⓘ You are approaching your monthly quota for Qodo. Upgrade your plan

Review Summary by Qodo

Track and persist blocked backup provider state across epochs

🐞 Bug fix

Walkthroughs

Description

• Add blockedBackupProviders map to track backup providers blocked this epoch
• blockProvider now blocks backup providers when not in validAddresses
• getValidConsumerSessionsWithProviderFromBackupProviderList filters blocked backups
• UpdateAllProviders persists blocked backup providers across epoch transitions
• checkAndUnblockHealthyReBlockedProviders handles backup providers in health probes

Diagram

flowchart LR
  A["blockProvider called"] --> B{Provider in validAddresses?}
  B -->|No| C["Check if backup provider"]
  C -->|Yes| D["Add to blockedBackupProviders"]
  C -->|No| E["Log debug message"]
  B -->|Yes| F["Block normally"]
  D --> G["getValidConsumerSessionsWithProviderFromBackupProviderList"]
  G --> H["Skip blocked backups"]
  H --> I["Return eligible backups"]
  J["UpdateAllProviders"] --> K["Merge blockedBackupProviders into previousEpochBlockedProviders"]
  K --> L["Re-block backups in new epoch"]
  M["checkAndUnblockHealthyReBlockedProviders"] --> N["Check both normal and backup providers"]
  N --> O{Probe succeeded?}
  O -->|Yes| P["Unblock from appropriate list"]
  O -->|No| Q["Try comprehensive probe"]

File Changes

1. protocol/lavasession/consumer_session_manager.go 🐞 Bug fix +71/-26

Implement backup provider blocking and health probe tracking

• Added blockedBackupProviders map field to track backup providers blocked this epoch
• Modified blockProvider to add backup providers to blockedBackupProviders when not in
 validAddresses
• Updated getValidConsumerSessionsWithProviderFromBackupProviderList to skip providers in
 blockedBackupProviders
• Enhanced UpdateAllProviders to merge blockedBackupProviders into
 previousEpochBlockedProviders and re-block them in new epoch
• Refactored checkAndUnblockHealthyReBlockedProviders to handle both normal and backup providers
 with new reBlockedProviderInfo struct for cross-role transition support

protocol/lavasession/consumer_session_manager.go

2. protocol/lavasession/consumer_session_manager_test.go 🧪 Tests +130/-0

Add comprehensive tests for blocked backup provider behavior

• Added createBackupProviderList helper function to create test backup provider lists
• Added TestBlockedBackupProviderIsTracked to verify blockProvider adds backup providers to
 blockedBackupProviders
• Added TestBlockedBackupProviderFilteredFromSelection to verify blocked backups are excluded from
 selection
• Added TestBlockedBackupProviderPersistedAcrossEpoch to verify blocked backups persist across
 epoch transitions
• Added TestNormalProviderBlockedAsBackupInNextEpoch to verify cross-role transition handling

protocol/lavasession/consumer_session_manager_test.go

qodo-code-review · 2026-03-31T10:15:54Z

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (1) 📎 Requirement gaps (0)

1. Backups unblocked without probe 🐞 Bug ≡ Correctness

Description

checkAndUnblockHealthyReBlockedProviders deletes a re-blocked backup provider from
blockedBackupProviders whenever it is not present in reportedProviders, even though
UpdateAllProviders only probes the main pairingList and resets reportedProviders each update. This
makes previously-blocked backup providers appear “probe succeeded” and get unblocked at epoch
transition without ever being probed/reported.

Code

protocol/lavasession/consumer_session_manager.go[R1903-1930]

	for blockedAddr := range csm.previousEpochBlockedProviders {
		cswp, exists := csm.pairing[blockedAddr]
+		isBackup := false
+		if !exists {
+			cswp, exists = csm.backupProviders[blockedAddr]
+			isBackup = true
+		}
		if !exists {
-			continue // Provider not in current pairing
+			continue // Provider not in current pairing or backup list
		}

		// Check if provider is in reported providers
-		// If probe FAILED in this epoch, provider would be in reportedProviders (line 1208)
+		// If probe FAILED in this epoch, provider would be in reportedProviders
		// If probe SUCCEEDED in this epoch, provider would NOT be in reportedProviders
		if !csm.reportedProviders.IsReported(blockedAddr) {
			// Probe succeeded! Provider is healthy, immediately unblock
			utils.LavaFormatInfo("Re-blocked provider's probe succeeded, immediately unblocking",
				utils.Attribute{Key: "provider", Value: blockedAddr},
+				utils.Attribute{Key: "isBackup", Value: isBackup},
				utils.Attribute{Key: "epoch", Value: epoch},
				utils.LogAttr("GUID", ctx),
			)
-			csm.validateAndReturnBlockedProviderToValidAddressesListLocked(blockedAddr)
-
-			// Clean up: Remove from reported providers if it was there from previous epoch
-			// This prevents periodic reconnection attempts from trying this provider again
+			if isBackup {
+				delete(csm.blockedBackupProviders, blockedAddr)
+			} else {
+				csm.validateAndReturnBlockedProviderToValidAddressesListLocked(blockedAddr)
+			}
			csm.reportedProviders.RemoveReport(blockedAddr)

Evidence
UpdateAllProviders’ deferred post-update goroutine probes only the pairingList parameter and then
calls checkAndUnblockHealthyReBlockedProviders; it does not probe backupProviderList. In the same
UpdateAllProviders call, reportedProviders is Reset(), so backup providers will typically not be
reported in the new epoch; checkAndUnblockHealthyReBlockedProviders interprets “not reported” as
“probe succeeded” and deletes backup entries from blockedBackupProviders.
protocol/lavasession/consumer_session_manager.go[154-164]
protocol/lavasession/consumer_session_manager.go[209-222]
protocol/lavasession/consumer_session_manager.go[451-469]
protocol/lavasession/consumer_session_manager.go[1903-1930]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`checkAndUnblockHealthyReBlockedProviders` currently treats `!reportedProviders.IsReported(addr)` as a successful probe and unblocks re-blocked backup providers by deleting them from `blockedBackupProviders`. But `UpdateAllProviders` only probes `pairingList` (not `backupProviderList`) and also resets `reportedProviders`, so backup providers can be unblocked without any probe.

### Issue Context
This undermines the PR goal of persisting backup-provider blocked state across epochs: re-blocking happens, but the async re-block check can immediately clear it.

### Fix Focus Areas
- protocol/lavasession/consumer_session_manager.go[154-164]
- protocol/lavasession/consumer_session_manager.go[451-469]
- protocol/lavasession/consumer_session_manager.go[1903-1970]

### Suggested change
Make the re-block health decision based on an actual probe outcome for backup providers (or for all re-blocked providers), not on `reportedProviders` membership.
Options:
1) In the deferred goroutine in `UpdateAllProviders`, probe `backupProviderList` too (or probe the subset in `previousEpochBlockedProviders` that are backups) before running `checkAndUnblockHealthyReBlockedProviders`.
2) Refactor `checkAndUnblockHealthyReBlockedProviders` to explicitly call `probeProvider(..., tryReconnect=false)` for each re-blocked provider (including backups) and only unblock on `err == nil`; run the comprehensive probe (`tryReconnect=true`) on failures.

Ensure the logic cannot conclude “probe succeeded” for a provider that was never probed in this epoch.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. Nil map write panic 🐞 Bug ☼ Reliability

Description

blockProvider writes to csm.blockedBackupProviders when blocking a backup provider that is not in
validAddresses, but NewConsumerSessionManager does not initialize blockedBackupProviders. If
blockProvider is called before the first successful UpdateAllProviders (or after an early-return
error path), this will panic with “assignment to entry in nil map”.

Code

protocol/lavasession/consumer_session_manager.go[R1557-1561]

+			// Address not in validAddresses — check if it's a backup provider and block it there
+			if _, isBackup := csm.backupProviders[address]; isBackup {
+				csm.blockedBackupProviders[address] = struct{}{}
+				utils.LavaFormatInfo("🔒 BLOCKING BACKUP PROVIDER", utils.LogAttr("address", address), utils.LogAttr("GUID", ctx))
+			} else {

Evidence
The new backup-provider blocking path assigns into blockedBackupProviders. The constructor
NewConsumerSessionManager does not initialize blockedBackupProviders (or other maps), so the field
is nil until UpdateAllProviders sets it, making early calls unsafe.
protocol/lavasession/consumer_session_manager.go[1554-1563]
protocol/lavasession/consumer_session_manager.go[1982-2001]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`blockedBackupProviders` can be nil until `UpdateAllProviders` runs, but `blockProvider` now writes to it for backup providers. Writing to a nil map panics.

### Issue Context
Even if the “normal” lifecycle calls `UpdateAllProviders` first, defensive initialization avoids crashes during startup, error paths, or unexpected call ordering.

### Fix Focus Areas
- protocol/lavasession/consumer_session_manager.go[1982-2001]
- protocol/lavasession/consumer_session_manager.go[1554-1563]

### Suggested change
Initialize `blockedBackupProviders: make(map[string]struct{})` in `NewConsumerSessionManager`.
Additionally (optional hardening): in `blockProvider`, before writing, guard with `if csm.blockedBackupProviders == nil { csm.blockedBackupProviders = make(map[string]struct{}) }` under the existing `csm.lock`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. New tests lack underscores 📘 Rule violation ⚙ Maintainability

Description

New unit tests are named without the required TestComponent_Scenario underscore format, reducing
consistency and discoverability. This violates the standardized test naming requirement for
new/modified tests.

Code

protocol/lavasession/consumer_session_manager_test.go[R1686-1796]

+// TestBlockedBackupProviderIsTracked verifies that calling blockProvider for a backup provider
+// adds it to blockedBackupProviders (not silently dropped because it's not in validAddresses).
+func TestBlockedBackupProviderIsTracked(t *testing.T) {
+	csm := CreateConsumerSessionManager()
+	backupList := createBackupProviderList(grpcListener)
+	err := csm.UpdateAllProviders(firstEpochHeight, nil, backupList)
+	require.NoError(t, err)
+
+	backupAddr := backupList[0].PublicLavaAddress
+
+	// Block the backup provider
+	err = csm.blockProvider(context.Background(), backupAddr, false, firstEpochHeight, 0, 0, false, nil, nil)
+	require.NoError(t, err)
+
+	csm.lock.RLock()
+	_, blocked := csm.blockedBackupProviders[backupAddr]
+	csm.lock.RUnlock()
+
+	require.True(t, blocked, "backup provider should be in blockedBackupProviders after blockProvider call")
+}
+
+// TestBlockedBackupProviderFilteredFromSelection verifies that a blocked backup provider is
+// not returned by getValidConsumerSessionsWithProviderFromBackupProviderList.
+func TestBlockedBackupProviderFilteredFromSelection(t *testing.T) {
+	csm := CreateConsumerSessionManager()
+	backupList := createBackupProviderList(grpcListener)
+	err := csm.UpdateAllProviders(firstEpochHeight, nil, backupList)
+	require.NoError(t, err)
+
+	backupAddr := backupList[0].PublicLavaAddress
+
+	// Directly mark it blocked
+	csm.lock.Lock()
+	csm.blockedBackupProviders[backupAddr] = struct{}{}
+	csm.lock.Unlock()
+
+	// Attempting to get backup sessions should fail — no eligible backup providers
+	ignored := &ignoredProviders{providers: make(map[string]struct{}), currentEpoch: firstEpochHeight}
+	_, err = csm.getValidConsumerSessionsWithProviderFromBackupProviderList(
+		context.Background(), ignored, 1, servicedBlockNumber, "", nil, 0, 0, NewUsedProviders(nil),
+	)
+	require.Error(t, err, "blocked backup provider should not be selectable")
+}
+
+// TestBlockedBackupProviderPersistedAcrossEpoch verifies that a backup provider blocked in
+// epoch N is re-blocked in epoch N+1 via previousEpochBlockedProviders.
+func TestBlockedBackupProviderPersistedAcrossEpoch(t *testing.T) {
+	csm := CreateConsumerSessionManager()
+	backupList := createBackupProviderList(grpcListener)
+	err := csm.UpdateAllProviders(firstEpochHeight, nil, backupList)
+	require.NoError(t, err)
+
+	backupAddr := backupList[0].PublicLavaAddress
+
+	// Block it in the first epoch
+	err = csm.blockProvider(context.Background(), backupAddr, false, firstEpochHeight, 0, 0, false, nil, nil)
+	require.NoError(t, err)
+
+	csm.lock.RLock()
+	_, blocked := csm.blockedBackupProviders[backupAddr]
+	csm.lock.RUnlock()
+	require.True(t, blocked, "backup provider should be blocked before epoch transition")
+
+	// Transition to next epoch with the same backup provider
+	err = csm.UpdateAllProviders(secondEpochHeight, nil, backupList)
+	require.NoError(t, err)
+
+	// Should still be blocked in the new epoch
+	csm.lock.RLock()
+	_, stillBlocked := csm.blockedBackupProviders[backupAddr]
+	csm.lock.RUnlock()
+	require.True(t, stillBlocked, "backup provider should remain blocked after epoch transition")
+}
+
+// TestNormalProviderBlockedAsBackupInNextEpoch verifies that a provider blocked as a normal
+// provider in epoch N is re-blocked as a backup provider in epoch N+1 if it moves to the backup list.
+func TestNormalProviderBlockedAsBackupInNextEpoch(t *testing.T) {
+	csm := CreateConsumerSessionManager()
+	pairingList := createPairingList("", true)
+	err := csm.UpdateAllProviders(firstEpochHeight, pairingList, nil)
+	require.NoError(t, err)
+
+	// Block a normal provider
+	normalAddr := pairingList[0].PublicLavaAddress
+	err = csm.blockProvider(context.Background(), normalAddr, false, firstEpochHeight, 0, 0, false, nil, nil)
+	require.NoError(t, err)
+
+	csm.lock.RLock()
+	_, inBlockedList := func() (int, bool) {
+		for _, addr := range csm.currentlyBlockedProviderAddresses {
+			if addr == normalAddr {
+				return 0, true
+			}
+		}
+		return 0, false
+	}()
+	csm.lock.RUnlock()
+	require.True(t, inBlockedList, "normal provider should be in currentlyBlockedProviderAddresses")
+
+	// In next epoch the same provider appears only in the backup list
+	backupList := map[uint64]*ConsumerSessionsWithProvider{
+		0: NewConsumerSessionWithProvider(normalAddr, pairingList[0].Endpoints, 1000, secondEpochHeight, sdk.NewCoin("ulava", sdk.NewInt(100))),
+	}
+	err = csm.UpdateAllProviders(secondEpochHeight, nil, backupList)
+	require.NoError(t, err)
+
+	csm.lock.RLock()
+	_, blockedAsBackup := csm.blockedBackupProviders[normalAddr]
+	csm.lock.RUnlock()
+	require.True(t, blockedAsBackup, "previously-blocked normal provider should be re-blocked as backup in new epoch")
+}

Evidence

PR Compliance ID 5 requires new/modified tests to follow the TestComponent_Scenario naming
convention. The newly-added tests (TestBlockedBackupProviderIsTracked,
TestBlockedBackupProviderFilteredFromSelection, etc.) do not include an underscore separator.

AGENTS.md
protocol/lavasession/consumer_session_manager_test.go[1686-1796]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
New tests added in `consumer_session_manager_test.go` do not follow the required `TestComponent_Scenario` naming convention (underscore-separated).

## Issue Context
Compliance requires standardized test naming for new/modified tests to improve consistency and discoverability.

## Fix Focus Areas
- protocol/lavasession/consumer_session_manager_test.go[1686-1796]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

4. Racy epoch-block tests 🐞 Bug ☼ Reliability

Description

The new tests assert blockedBackupProviders immediately after UpdateAllProviders, but
UpdateAllProviders runs a deferred sleep and then asynchronously probes/unblocks providers, mutating
blockedBackupProviders after the call returns. This makes the tests timing-dependent and can hide
the real post-transition state.

Code

protocol/lavasession/consumer_session_manager_test.go[R1730-1758]

+// TestBlockedBackupProviderPersistedAcrossEpoch verifies that a backup provider blocked in
+// epoch N is re-blocked in epoch N+1 via previousEpochBlockedProviders.
+func TestBlockedBackupProviderPersistedAcrossEpoch(t *testing.T) {
+	csm := CreateConsumerSessionManager()
+	backupList := createBackupProviderList(grpcListener)
+	err := csm.UpdateAllProviders(firstEpochHeight, nil, backupList)
+	require.NoError(t, err)
+
+	backupAddr := backupList[0].PublicLavaAddress
+
+	// Block it in the first epoch
+	err = csm.blockProvider(context.Background(), backupAddr, false, firstEpochHeight, 0, 0, false, nil, nil)
+	require.NoError(t, err)
+
+	csm.lock.RLock()
+	_, blocked := csm.blockedBackupProviders[backupAddr]
+	csm.lock.RUnlock()
+	require.True(t, blocked, "backup provider should be blocked before epoch transition")
+
+	// Transition to next epoch with the same backup provider
+	err = csm.UpdateAllProviders(secondEpochHeight, nil, backupList)
+	require.NoError(t, err)
+
+	// Should still be blocked in the new epoch
+	csm.lock.RLock()
+	_, stillBlocked := csm.blockedBackupProviders[backupAddr]
+	csm.lock.RUnlock()
+	require.True(t, stillBlocked, "backup provider should remain blocked after epoch transition")
+}

Evidence
UpdateAllProviders defers a sleep and then launches an async goroutine that calls
checkAndUnblockHealthyReBlockedProviders, which can delete from blockedBackupProviders. The tests
check blockedBackupProviders state right after UpdateAllProviders without synchronizing with that
goroutine.
protocol/lavasession/consumer_session_manager.go[154-164]
protocol/lavasession/consumer_session_manager_test.go[1730-1758]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
Tests read `blockedBackupProviders` immediately after `UpdateAllProviders`, but `UpdateAllProviders` triggers async probe/unblock work after a deferred sleep. Assertions can be invalid depending on scheduling.

### Issue Context
These tests should either:
- assert the *pre-async* state deterministically (requires a hook/flag to disable async goroutines in tests), or
- wait for the async epoch-transition work to complete and then assert the expected *post-async* state.

### Fix Focus Areas
- protocol/lavasession/consumer_session_manager.go[154-164]
- protocol/lavasession/consumer_session_manager_test.go[1686-1796]

### Suggested change
Introduce a test-only mechanism to control epoch-transition goroutines (e.g., inject a sleeper/hook, or a `DisableAsyncProbes` flag), or have tests wait on a deterministic signal (e.g., poll until `previousEpochBlockedProviders` is cleared, since `checkAndUnblockHealthyReBlockedProviders` clears it in a deferred cleanup) before asserting final blocked/unblocked state.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ⓘ The new review experience is currently in Beta. Learn more

codecov · 2026-03-31T10:21:52Z

Codecov Report

❌ Patch coverage is 89.36170% with 5 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
protocol/lavasession/consumer_session_manager.go	89.36%	3 Missing and 2 partials ⚠️

Flag	Coverage Δ
consensus	`8.71% <ø> (ø)`
protocol	`34.14% <89.36%> (+0.12%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
protocol/lavasession/consumer_session_manager.go	`65.00% <89.36%> (+3.19%)`	⬆️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2026-03-31T10:35:18Z

Test Results

0 tests ±0 0 ✅ ±0 0s ⏱️ ±0s
0 suites ±0 0 💤 ±0
7 files ±0 0 ❌ ±0

Results for commit 630c2a9. ± Comparison against base commit 6b5b690.

♻️ This comment has been updated with latest results.

…oss epochs Backup providers could never be blocked: blockProvider was a no-op for them since they are not in validAddresses, and the backup selection path never checked any blocked list. Additionally, the epoch-transition re-blocking and health-probe logic only covered the main pairing, leaving previously-blocked backup providers with a clean slate. - Add blockedBackupProviders map to ConsumerSessionManager - blockProvider now adds backup providers to blockedBackupProviders when they are not found in validAddresses - getValidConsumerSessionsWithProviderFromBackupProviderList skips providers in blockedBackupProviders - UpdateAllProviders merges blockedBackupProviders into previousEpochBlockedProviders and re-blocks them in the new epoch - checkAndUnblockHealthyReBlockedProviders handles backup providers in both probe passes, including cross-role transitions (normal→backup) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Ports the GenerateReconnectCallback change from #2265 on top of #2257: when the periodic reconnect probe succeeds for a blocked backup provider, remove it from blockedBackupProviders instead of calling validateAndReturnBlockedProviderToValidAddressesList (a no-op for backups, which are not in validAddresses). Without this, periodic reconnection could never recover a blocked backup outside the epoch-transition path. Also expands the test suite: - TestCheckAndUnblock_BackupRoutedToComprehensiveProbe covers the !isBackup && !IsReported guard: confirms backups never take the immediate-unblock branch (which would not touch blockedBackupProviders and would silently stall the recovery). - TestCheckAndUnblock_BackupUnblockedWhenHealthy covers the positive path — a blocked backup with a reachable endpoint is unblocked via comprehensive probe at epoch transition. - TestGenerateReconnectCallback_BackupProviderUnblocked covers the new backup branch added by this commit. - TestGenerateReconnectCallback_NonBackupUsesValidAddressesPath guards against regressing the existing non-backup reconnect flow. Co-authored-by: avitenzer <avitenzer@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(lavasession): track and persist blocked backup provider state across epochs Backup providers could never be blocked: blockProvider was a no-op for them since they are not in validAddresses, and the backup selection path never checked any blocked list. Additionally, the epoch-transition re-blocking and health-probe logic only covered the main pairing, leaving previously-blocked backup providers with a clean slate. - Add blockedBackupProviders map to ConsumerSessionManager - blockProvider now adds backup providers to blockedBackupProviders when they are not found in validAddresses - getValidConsumerSessionsWithProviderFromBackupProviderList skips providers in blockedBackupProviders - UpdateAllProviders merges blockedBackupProviders into previousEpochBlockedProviders and re-blocks them in the new epoch - checkAndUnblockHealthyReBlockedProviders handles backup providers in both probe passes, including cross-role transitions (normal→backup) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(lavasession): handle backup providers in GenerateReconnectCallback Ports the GenerateReconnectCallback change from #2265 on top of #2257: when the periodic reconnect probe succeeds for a blocked backup provider, remove it from blockedBackupProviders instead of calling validateAndReturnBlockedProviderToValidAddressesList (a no-op for backups, which are not in validAddresses). Without this, periodic reconnection could never recover a blocked backup outside the epoch-transition path. Also expands the test suite: - TestCheckAndUnblock_BackupRoutedToComprehensiveProbe covers the !isBackup && !IsReported guard: confirms backups never take the immediate-unblock branch (which would not touch blockedBackupProviders and would silently stall the recovery). - TestCheckAndUnblock_BackupUnblockedWhenHealthy covers the positive path — a blocked backup with a reachable endpoint is unblocked via comprehensive probe at epoch transition. - TestGenerateReconnectCallback_BackupProviderUnblocked covers the new backup branch added by this commit. - TestGenerateReconnectCallback_NonBackupUsesValidAddressesPath guards against regressing the existing non-backup reconnect flow. Co-authored-by: avitenzer <avitenzer@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(rpcsmartrouter): reset endpoint health metric on epoch transition PR #2256 resets endpoint.ResetHealth() for both primary and backup providers on epoch transition, but only the in-memory struct was cleared — the lava_rpc_endpoint_overall_health Prometheus gauge stayed stuck at 0 (unhealthy) forever. The metric is only toggled on state transitions from relay outcomes (MarkUnhealthy/ResetHealth in the hot path), and backups rarely receive the successful relay needed to push the gauge back to 1. Net effect: operators see backups at 0% uptime on dashboards even though the router considers them healthy again after the epoch reset, which can mask real regressions and (worse) cause failover logic that reads the gauge to refuse routing to a provider that is in fact available. Emits SetEndpointOverallHealth(..., true) alongside the struct reset for both the primary and backup provider loops in updateEpoch, mirrored via rpsr.rpcServers[chainKey].smartRouterEndpointMetrics. Also adds TestUpdateEpoch_ResetsHealthMetric which seeds the gauge to 0 (via SetEndpointOverallHealth) for a primary and a backup, runs updateEpoch, then reads lava_rpc_endpoint_overall_health back from prometheus.DefaultGatherer and asserts both have transitioned to 1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore(metrics): remove unused ResetEndpointMetrics ResetEndpointMetrics had no callers and, despite its name, set endpoint_overall_health to 0 instead of 1 — a footgun if ever wired up. The two actual consumers (relay hot path and epoch transition) already call SetEndpointOverallHealth with an explicit bool, so there's no need for a helper. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: populate per-endpoint metrics for backup providers Two independent gaps combined to leave backup providers showing N/A latest-block on the dashboard: 1. GetAllDirectRPCEndpoints (consumer_session_manager.go) iterated only csm.pairing, so the smart router's initializeChainTrackers startup path skipped every backup. A ChainTracker for a dedicated-URL backup (e.g. base.lava.build with no primary sharing that URL) was only ever created lazily via the relay hot path — which backups rarely reach. Extended the function to iterate csm.backupProviders as well. 2. urlToProviderName was a map[string]string, so multiple providers configured on the same upstream URL collapsed to a single name (last writer wins). The ChainTracker's OnNewBlock callback keys emissions by URL and resolves to a provider name via this map — so every URL-keyed metric emission reached exactly one provider's label, leaving its peers stuck at zero on dashboards. Changed the map to map[string][]string with append+dedup in RegisterEndpoint, added resolveProviderNames returning all providers for a URL, and updated the two URL-keyed emitters (SetEndpointLatestBlock, RecordBlockFetch) to iterate and emit per provider. resolveProviderName is kept for callers that already pass a provider name (relay hot path) — it now returns the first name. Tests: - TestGetAllDirectRPCEndpoints_IncludesBackupProviders — primary and backup both appear in the startup iteration. - TestResolveProviderNames_FansOutAcrossSharedURL — both providers registered at the same URL resolve together. - TestResolveProviderNames_DeduplicatesSameProvider — repeated (URL, providerName) registrations (typical for providers whose node-urls lists the same URL twice) don't duplicate entries. - TestResolveProviderNames_FallsBackToInput — unknown inputs (provider names) round-trip as a single-element slice for backward compatibility with callers that pass names directly. - TestSetEndpointLatestBlock_FansOutAcrossSharedURL — one URL-keyed call populates the gauge for every provider sharing the URL. - TestRecordBlockFetch_FansOutAcrossSharedURL — same guarantee for the chain-tracker success/fail counters. The test files in protocol/metrics that constructed SmartRouterMetricsManager literals were updated to the new field name. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(rpcsmartrouter): implement CustomMessage on EndpointChainFetcher for Solana SVMChainTracker (protocol/chaintracker/svm_chain_tracker.go:55) fetches latest block via CustomMessage with a getLatestBlockhash JSON-RPC body — it needs the slot, block hash, and block height returned together, which has no equivalent in the generic FetchLatestBlockNum path. EndpointChainFetcher.CustomMessage was a stub returning "CustomMessage not supported for EndpointChainFetcher". On every Solana-family chain this caused the per-endpoint ChainTracker to fail its first fetch — no OnNewBlock callbacks ever fired, so the metric fan-out added earlier in this branch had nothing to fan out and every per-endpoint metric for backups remained at N/A. Delegate to the existing sendRawRequest helper: for POST (the SVM case) the `data` argument is the JSON-RPC body and sendRawRequest posts it through the direct RPC connection; for GET the `data` argument is already used as the URL suffix per existing REST convention, so the same delegation preserves GET semantics. Tests: - TestEndpointChainFetcher_CustomMessage_POSTDelegatesToConnection exercises the SVMChainTracker code path — asserts the upstream response body flows back through CustomMessage unchanged. - TestEndpointChainFetcher_CustomMessage_PropagatesUnhealthyConnection asserts the connection-health check still fails closed rather than silently returning empty data. Also fixes a latent compile-break in the package-shared mockDirectRPCConnection: its GetNodeUrl signature was returning interface{} while the lavasession.DirectRPCConnection interface requires *common.NodeUrl. The existing tracker tests never passed the mock through the interface, so it compiled on its own — but any new test using it as a DirectRPCConnection (including this one) hit the mismatch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(rpcsmartrouter): force blocksToSave=1 for Solana per-endpoint trackers With CustomMessage now working for EndpointChainFetcher, the SVM ChainTracker path made it past its initial getLatestBlockhash call on Solana — only to die seconds later with "ChainTracker stopped with error: slot not found in cache". Root cause is a pre-existing limitation in SVMChainTracker: fetchLatestBlockNumInner populates the blockNum→slot cache for the latest block only (svm_chain_tracker.go:70), but the generic ChainTracker init loop (chain_tracker.go readHashes) iterates latestBlock down to latestBlock - blocksToSave + 1 calling FetchBlockHashByNum for each. On SVM every lookup after the first hits the cache-miss path and returns "slot not found in cache", killing the tracker permanently (StartAndServe returns the error and the goroutine exits). For per-endpoint tracking we don't need history — each tracker watches a single URL, so there's no cross-endpoint fork detection to do; the manager only uses GetLatestBlockNum via ValidateEndpointSync and the OnNewBlock callback for per-provider metric emission. Forcing blocksToSave=1 on Solana-family chains (SOLANA, SOLANAT, KOII, KOIIT) sidesteps the SVMChainTracker limitation entirely without losing any capability the manager actually uses. EVM chains keep their caller-requested value. Test: TestEndpointChainTrackerManager_ForcesBlocksToSave1ForSolana asserts the override fires for every SVM family id and leaves non-SVM chains unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(lavasession): meaningful IsHealthy for HTTPDirectRPCConnection HTTPDirectRPCConnection.IsHealthy was hard-coded to return true, with a comment deferring health tracking to "the endpoint/QoS layer". But the comprehensive probe path used by checkAndUnblockHealthyReBlockedProviders (probeProvider → probeDirectRPCEndpoints for static providers) only consults conn.IsHealthy() — so for static HTTP/HTTPS backups the probe was a guaranteed pass regardless of reachability, and a blocked backup with an unreachable upstream would be silently unblocked at every epoch transition. Two complementary changes close this: 1. HTTPDirectRPCConnection grows an atomic healthy flag (initial true), flipped to false by transport-level failures in SendRequest and DoHTTPRequest (dial / TLS / body-read errors), flipped back to true on any successful exchange — including HTTP 4xx/5xx, which are *application* errors, not transport failures. Flipping on 4xx/5xx would make rate-limited endpoints flap unhealthy and starve the dashboard on routine server errors; the test TestHTTPDirectRPCConnection_IsHealthy_Stays4xxHealthy guards that carve-out explicitly. 2. probeDirectRPCEndpoints now also gates on endpoint.Enabled. This belt-and-suspenders with IsHealthy(): the endpoint struct accumulates relay-path failures via MarkUnhealthy → ConnectionRefusals ≥ MaxConsecutiveConnectionAttempts → Enabled=false. Reading that state here means a backup that the hot path has already declared dead won't be optimistically revived by a probe that happens to catch a lucky moment in IsHealthy's lifecycle. Known gap (flagged in the IsHealthy comment): a never-hit backup with an unreachable upstream still reports healthy on the very first epoch because nothing has exercised either signal yet. Closing that cold-start case requires an active probe request (e.g. chain-specific no-op via the ChainParser), which is a meaningfully bigger change and belongs in a separate follow-up. Tests: - TestHTTPDirectRPCConnection_IsHealthy_StartsTrue — documents the optimistic-initialization behavior. - TestHTTPDirectRPCConnection_IsHealthy_FlipsOnDialFailure — the core behavior: a transport error must flip IsHealthy to false. - TestHTTPDirectRPCConnection_IsHealthy_Stays4xxHealthy — guards against overreach on application-level HTTP errors. - TestHTTPDirectRPCConnection_IsHealthy_RecoversAfterFailure — after upstream recovery, a successful exchange must restore IsHealthy=true. - TestProbeDirectRPCEndpoints_RespectsDisabledEndpoint — the endpoint.Enabled gate blocks a disabled endpoint from passing the probe even when IsHealthy happens to say true. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(rpcsmartrouter): cap /debug/time-warp body at 1 KiB The JSON decoder was reading r.Body unbounded, letting a caller stream an arbitrarily large payload into an endpoint whose only legitimate content is {"offset_seconds": N}. Wrap the body in http.MaxBytesReader(w, r.Body, 1024) before decoding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(errors): honor Retryable flag on node-error retry decision Node errors classified to a LavaError with Retryable=false (e.g. CHAIN_EXECUTION_REVERTED, CHAIN_OUT_OF_GAS, CHAIN_DOUBLE_SPEND) were retried because the state machine only short-circuited on IsUnsupportedMethod / IsUserError subcategories and ignored the registry's Retryable flag. Populate RelayResult.IsNonRetryable from the classifier on both consumer and smart-router paths and make HasNonRetryableUserFacingErrors rest solely on that flag. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: avitenzer <avitenzer@users.noreply.github.com>

pull-request-size Bot added the size/L label Mar 31, 2026

github-actions Bot added C:protocol Team:Protocol labels Mar 31, 2026

qodo-code-review Bot reviewed Mar 31, 2026

View reviewed changes

Comment thread protocol/lavasession/consumer_session_manager.go

Comment thread protocol/lavasession/consumer_session_manager.go Outdated

baz-reviewer Bot approved these changes Mar 31, 2026

View reviewed changes

NadavLevi force-pushed the fix/backup-providers-health-decentralized branch from e97b576 to 32924f2 Compare March 31, 2026 12:44

NadavLevi force-pushed the fix/backup-providers-health-decentralized branch from 32924f2 to 630c2a9 Compare March 31, 2026 13:11

NadavLevi requested review from Tomelia1999, avitenzer and nimrod-teich March 31, 2026 14:34

NadavLevi closed this Apr 5, 2026

NadavLevi deleted the fix/backup-providers-health-decentralized branch April 12, 2026 07:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(lavasession): track and persist blocked backup provider state acr…#2257

fix(lavasession): track and persist blocked backup provider state acr…#2257
NadavLevi wants to merge 1 commit into
mainfrom
fix/backup-providers-health-decentralized

NadavLevi commented Mar 31, 2026 •

edited by baz-reviewer Bot

Loading

Uh oh!

qodo-code-review Bot commented Mar 31, 2026

Uh oh!

qodo-code-review Bot commented Mar 31, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Mar 31, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NadavLevi commented Mar 31, 2026 • edited by baz-reviewer Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Description

Author Checklist

Reviewers Checklist

Generated description

Uh oh!

qodo-code-review Bot commented Mar 31, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-code-review Bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

NadavLevi commented Mar 31, 2026 •

edited by baz-reviewer Bot

Loading

qodo-code-review Bot commented Mar 31, 2026 •

edited

Loading

codecov Bot commented Mar 31, 2026 •

edited

Loading

github-actions Bot commented Mar 31, 2026 •

edited

Loading