fix(rpcsmartrouter): reset endpoint health on epoch transition by NadavLevi · Pull Request #2256 · lavanet/lava

NadavLevi · 2026-03-31T10:08:00Z

User description

Disabled endpoints (Enabled=false, ConnectionRefusals>=5) were reused across epoch boundaries without being reset, causing a deadlock where they could never receive the successful relay needed to trigger ResetHealth. Reset all endpoint health state before handing endpoints to fresh sessions.

Description

Closes: #XXXX

Author Checklist

All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.

I have...

read the contribution guide
included the correct type prefix in the PR title, you can find examples of the prefixes below:
confirmed ! in the type prefix if API or client breaking change
targeted the main branch
provided a link to the relevant issue or specification
reviewed "Files changed" and left comments if necessary
included the necessary unit and integration tests
updated the relevant documentation or specification, including comments for documenting Go code
confirmed all CI checks have passed

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.

I have...

confirmed the correct type prefix in the PR title
confirmed all author checklist items have been addressed
reviewed state machine logic, API design and naming, documentation is accurate, tests and test coverage

Generated description

Below is a concise technical summary of the changes proposed in this PR:
Reset endpoint health state in RPCSmartRouter.updateEpoch so providers and backups enter each new epoch with clean connection counters. Validate that epoch transitions reopen disabled Endpoints by asserting their Enabled flag and ConnectionRefusals counters are reset in new coverage.

Topic

Details

Endpoint reset

Reset endpoint health state before constructing fresh sessions so disabled endpoints can be re-enabled across epochs.

Modified files (1)

protocol/rpcsmartrouter/rpcsmartrouter.go

Latest Contributors(2)

User	Commit	Date
NadavLevi	chore(metrics): cleanu...	March 23, 2026
nimrod.teich@gmail.com	fix(smart-router): ext...	March 23, 2026

Reset tests

Verify disabled provider and backup endpoints are re-enabled and refusal counts drop to zero after epoch transition.

Modified files (1)

protocol/rpcsmartrouter/rpcsmartrouter_test.go

Latest Contributors(2)

User	Commit	Date
anna@magmadevs.com	feat(provideroptimizer...	February 22, 2026
nimrod.teich@gmail.com	fix: Hiesen tests fail...	December 31, 2025

This pull request is reviewed by Baz. Review like a pro on (Baz).

qodo-code-review · 2026-03-31T10:08:17Z

ⓘ You are approaching your monthly quota for Qodo. Upgrade your plan

Review Summary by Qodo

Reset endpoint health on epoch transition to prevent deadlock

🐞 Bug fix

Walkthroughs

Description

• Reset endpoint health state on epoch transitions to prevent deadlock
• Disabled endpoints with connection failures now get fresh start each epoch
• Applies health reset to both provider and backup sessions
• Added comprehensive test verifying endpoint re-enablement after epoch transition

Diagram

flowchart LR
  A["Epoch Transition"] --> B["updateEpoch Called"]
  B --> C["Reset Endpoint Health"]
  C --> D["Provider Sessions"]
  C --> E["Backup Sessions"]
  D --> F["Fresh Sessions Created"]
  E --> F
  F --> G["Endpoints Re-enabled"]

File Changes

1. protocol/rpcsmartrouter/rpcsmartrouter.go 🐞 Bug fix +11/-3

Reset endpoint health on epoch transition

• Added endpoint.ResetHealth() call for all endpoints in provider sessions before creating fresh
 sessions
• Added endpoint.ResetHealth() call for all endpoints in backup sessions before creating fresh
 sessions
• Clarified comments explaining why endpoint health reset is necessary to prevent permanent
 disabling

protocol/rpcsmartrouter/rpcsmartrouter.go

2. protocol/rpcsmartrouter/rpcsmartrouter_test.go 🧪 Tests +64/-0

Add test for endpoint health reset on epoch transition

• Added new test TestUpdateEpoch_ResetsDisabledEndpoints to verify endpoint health reset behavior
• Test creates disabled endpoints with connection refusals and verifies they are re-enabled after
 epoch transition
• Test covers both provider and backup endpoint scenarios
• Validates that Enabled flag is set to true and ConnectionRefusals counter is reset to zero

protocol/rpcsmartrouter/rpcsmartrouter_test.go

qodo-code-review · 2026-03-31T10:08:18Z

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0) 📎 Requirement gaps (0)

1. ResetHealth not thread-safe 🐞 Bug ☼ Reliability

Description

updateEpoch now calls Endpoint.ResetHealth on epoch transitions, but ResetHealth/MarkUnhealthy write
Enabled/ConnectionRefusals without acquiring Endpoint.mu, while other goroutines read/write these
fields under that mutex. This breaks synchronization and can cause data races and inconsistent
endpoint health/selection under concurrent relay load.

Code

protocol/rpcsmartrouter/rpcsmartrouter.go[R1491-1493]
+			for _, endpoint := range oldSession.Endpoints {
+				endpoint.ResetHealth()
+			}

Evidence
The epoch timer runs updateEpoch periodically in the background, so ResetHealth can run concurrently
with relays. Endpoint.mu is explicitly used elsewhere to protect Enabled/ConnectionRefusals (with
RLock/Lock), but ResetHealth/MarkUnhealthy mutate those same fields without locking, so concurrent
access becomes racy/undefined.
protocol/rpcsmartrouter/rpcsmartrouter.go[282-293]
protocol/rpcsmartrouter/rpcsmartrouter.go[1469-1542]
protocol/rpcsmartrouter/rpcsmartrouter.go[1488-1499]
protocol/lavasession/consumer_types.go[188-205]
protocol/lavasession/consumer_types.go[234-255]
protocol/lavasession/consumer_types.go[654-662]
protocol/lavasession/consumer_types.go[696-801]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`Endpoint.ResetHealth()` (and `Endpoint.MarkUnhealthy()`) mutate `Enabled` and `ConnectionRefusals` without taking `Endpoint.mu`, but other code assumes `Endpoint.mu` synchronizes these fields. With the new `updateEpoch()` calls, this creates a concrete data race between the epoch timer goroutine and in-flight relays/connection selection.

### Issue Context
- `updateEpoch()` now calls `endpoint.ResetHealth()` for all endpoints on every epoch transition.
- `fetchEndpointConnectionFromConsumerSessionWithProvider()` uses `endpoint.mu` (RLock/Lock) around reads/writes of `Enabled` and `ConnectionRefusals`.
- `ResetHealth()` and `MarkUnhealthy()` currently bypass the mutex entirely.

### Fix Focus Areas
- protocol/lavasession/consumer_types.go[234-255]
- protocol/rpcsmartrouter/rpcsmartrouter.go[1488-1516]

### Suggested fix
1. Update `(*Endpoint).ResetHealth()` to:
  - `e.mu.Lock()` / `defer e.mu.Unlock()`
  - mutate `ConnectionRefusals` / `Enabled` under the lock
2. Update `(*Endpoint).MarkUnhealthy()` similarly.
3. (Optional) ensure any direct field reads of `Enabled`/`ConnectionRefusals` in rpcsmartrouter code use the mutex or are replaced with helper accessors if needed.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. Epoch reset log spam 🐞 Bug ✧ Quality

Description

updateEpoch resets every endpoint each epoch and ResetHealth emits an Info log unconditionally, so a
large deployment or short epoch duration can produce high-volume logs and misleading “re-enabled”
messages for already-healthy endpoints. This adds avoidable overhead and reduces signal-to-noise in
production logs.

Code

protocol/rpcsmartrouter/rpcsmartrouter.go[R1488-1493]

+			// Reset endpoint health so disabled endpoints get a fresh start each epoch.
+			// Without this, an endpoint disabled by ConnectionRefusals stays disabled
+			// forever since it can never receive the successful relay needed to trigger ResetHealth.
+			for _, endpoint := range oldSession.Endpoints {
+				endpoint.ResetHealth()
+			}

Evidence
updateEpoch is registered as a periodic epoch callback, and now calls ResetHealth for each endpoint.
ResetHealth always logs at Info level, so this change can multiply Info logs by (endpoints ×
epochs), regardless of whether an endpoint was actually disabled/unhealthy.
protocol/rpcsmartrouter/rpcsmartrouter.go[282-293]
protocol/rpcsmartrouter/rpcsmartrouter.go[1469-1530]
protocol/lavasession/consumer_types.go[247-255]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`updateEpoch()` now calls `ResetHealth()` for every endpoint each epoch, and `ResetHealth()` logs at Info level every time. This can flood logs and produces misleading messages when no state changed.

### Issue Context
Epoch transitions can be frequent (configurable), and the number of endpoints can be large.

### Fix Focus Areas
- protocol/rpcsmartrouter/rpcsmartrouter.go[1488-1516]
- protocol/lavasession/consumer_types.go[247-255]

### Suggested fix
Implement one of:
- Add a guard in `ResetHealth()` (under `e.mu`) to no-op without logging when `ConnectionRefusals == 0 && Enabled == true`.
- Or, in `updateEpoch()`, only call `ResetHealth()` for endpoints that are currently disabled or have `ConnectionRefusals > 0`.
- Optionally downgrade the log to Debug, or log a single aggregated message per epoch with counts reset.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ⓘ The new review experience is currently in Beta. Learn more

codecov · 2026-03-31T10:13:55Z

Codecov Report

❌ Patch coverage is 43.75000% with 18 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
protocol/lavasession/consumer_types.go	36.36%	14 Missing ⚠️
protocol/rpcsmartrouter/rpcsmartrouter_server.go	0.00%	4 Missing ⚠️

Flag	Coverage Δ
consensus	`8.70% <ø> (ø)`
protocol	`34.05% <43.75%> (+0.11%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
protocol/rpcsmartrouter/rpcsmartrouter.go	`5.03% <100.00%> (+2.04%)`	⬆️
protocol/rpcsmartrouter/rpcsmartrouter_server.go	`13.53% <0.00%> (ø)`
protocol/lavasession/consumer_types.go	`74.10% <36.36%> (+2.03%)`	⬆️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2026-03-31T10:27:46Z

Test Results

0 tests ±0 0 ✅ ±0 0s ⏱️ ±0s
0 suites ±0 0 💤 ±0
7 files ±0 0 ❌ ±0

Results for commit e7a79ee. ± Comparison against base commit f8670a7.

♻️ This comment has been updated with latest results.

Disabled endpoints (Enabled=false, ConnectionRefusals>=5) were reused across epoch boundaries without being reset, causing a deadlock where they could never receive the successful relay needed to trigger ResetHealth. Reset all endpoint health state before handing endpoints to fresh sessions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

PR #2256 resets endpoint.ResetHealth() for both primary and backup providers on epoch transition, but only the in-memory struct was cleared — the lava_rpc_endpoint_overall_health Prometheus gauge stayed stuck at 0 (unhealthy) forever. The metric is only toggled on state transitions from relay outcomes (MarkUnhealthy/ResetHealth in the hot path), and backups rarely receive the successful relay needed to push the gauge back to 1. Net effect: operators see backups at 0% uptime on dashboards even though the router considers them healthy again after the epoch reset, which can mask real regressions and (worse) cause failover logic that reads the gauge to refuse routing to a provider that is in fact available. Emits SetEndpointOverallHealth(..., true) alongside the struct reset for both the primary and backup provider loops in updateEpoch, mirrored via rpsr.rpcServers[chainKey].smartRouterEndpointMetrics. Also adds TestUpdateEpoch_ResetsHealthMetric which seeds the gauge to 0 (via SetEndpointOverallHealth) for a primary and a backup, runs updateEpoch, then reads lava_rpc_endpoint_overall_health back from prometheus.DefaultGatherer and asserts both have transitioned to 1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(lavasession): track and persist blocked backup provider state across epochs Backup providers could never be blocked: blockProvider was a no-op for them since they are not in validAddresses, and the backup selection path never checked any blocked list. Additionally, the epoch-transition re-blocking and health-probe logic only covered the main pairing, leaving previously-blocked backup providers with a clean slate. - Add blockedBackupProviders map to ConsumerSessionManager - blockProvider now adds backup providers to blockedBackupProviders when they are not found in validAddresses - getValidConsumerSessionsWithProviderFromBackupProviderList skips providers in blockedBackupProviders - UpdateAllProviders merges blockedBackupProviders into previousEpochBlockedProviders and re-blocks them in the new epoch - checkAndUnblockHealthyReBlockedProviders handles backup providers in both probe passes, including cross-role transitions (normal→backup) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(lavasession): handle backup providers in GenerateReconnectCallback Ports the GenerateReconnectCallback change from #2265 on top of #2257: when the periodic reconnect probe succeeds for a blocked backup provider, remove it from blockedBackupProviders instead of calling validateAndReturnBlockedProviderToValidAddressesList (a no-op for backups, which are not in validAddresses). Without this, periodic reconnection could never recover a blocked backup outside the epoch-transition path. Also expands the test suite: - TestCheckAndUnblock_BackupRoutedToComprehensiveProbe covers the !isBackup && !IsReported guard: confirms backups never take the immediate-unblock branch (which would not touch blockedBackupProviders and would silently stall the recovery). - TestCheckAndUnblock_BackupUnblockedWhenHealthy covers the positive path — a blocked backup with a reachable endpoint is unblocked via comprehensive probe at epoch transition. - TestGenerateReconnectCallback_BackupProviderUnblocked covers the new backup branch added by this commit. - TestGenerateReconnectCallback_NonBackupUsesValidAddressesPath guards against regressing the existing non-backup reconnect flow. Co-authored-by: avitenzer <avitenzer@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(rpcsmartrouter): reset endpoint health metric on epoch transition PR #2256 resets endpoint.ResetHealth() for both primary and backup providers on epoch transition, but only the in-memory struct was cleared — the lava_rpc_endpoint_overall_health Prometheus gauge stayed stuck at 0 (unhealthy) forever. The metric is only toggled on state transitions from relay outcomes (MarkUnhealthy/ResetHealth in the hot path), and backups rarely receive the successful relay needed to push the gauge back to 1. Net effect: operators see backups at 0% uptime on dashboards even though the router considers them healthy again after the epoch reset, which can mask real regressions and (worse) cause failover logic that reads the gauge to refuse routing to a provider that is in fact available. Emits SetEndpointOverallHealth(..., true) alongside the struct reset for both the primary and backup provider loops in updateEpoch, mirrored via rpsr.rpcServers[chainKey].smartRouterEndpointMetrics. Also adds TestUpdateEpoch_ResetsHealthMetric which seeds the gauge to 0 (via SetEndpointOverallHealth) for a primary and a backup, runs updateEpoch, then reads lava_rpc_endpoint_overall_health back from prometheus.DefaultGatherer and asserts both have transitioned to 1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore(metrics): remove unused ResetEndpointMetrics ResetEndpointMetrics had no callers and, despite its name, set endpoint_overall_health to 0 instead of 1 — a footgun if ever wired up. The two actual consumers (relay hot path and epoch transition) already call SetEndpointOverallHealth with an explicit bool, so there's no need for a helper. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: populate per-endpoint metrics for backup providers Two independent gaps combined to leave backup providers showing N/A latest-block on the dashboard: 1. GetAllDirectRPCEndpoints (consumer_session_manager.go) iterated only csm.pairing, so the smart router's initializeChainTrackers startup path skipped every backup. A ChainTracker for a dedicated-URL backup (e.g. base.lava.build with no primary sharing that URL) was only ever created lazily via the relay hot path — which backups rarely reach. Extended the function to iterate csm.backupProviders as well. 2. urlToProviderName was a map[string]string, so multiple providers configured on the same upstream URL collapsed to a single name (last writer wins). The ChainTracker's OnNewBlock callback keys emissions by URL and resolves to a provider name via this map — so every URL-keyed metric emission reached exactly one provider's label, leaving its peers stuck at zero on dashboards. Changed the map to map[string][]string with append+dedup in RegisterEndpoint, added resolveProviderNames returning all providers for a URL, and updated the two URL-keyed emitters (SetEndpointLatestBlock, RecordBlockFetch) to iterate and emit per provider. resolveProviderName is kept for callers that already pass a provider name (relay hot path) — it now returns the first name. Tests: - TestGetAllDirectRPCEndpoints_IncludesBackupProviders — primary and backup both appear in the startup iteration. - TestResolveProviderNames_FansOutAcrossSharedURL — both providers registered at the same URL resolve together. - TestResolveProviderNames_DeduplicatesSameProvider — repeated (URL, providerName) registrations (typical for providers whose node-urls lists the same URL twice) don't duplicate entries. - TestResolveProviderNames_FallsBackToInput — unknown inputs (provider names) round-trip as a single-element slice for backward compatibility with callers that pass names directly. - TestSetEndpointLatestBlock_FansOutAcrossSharedURL — one URL-keyed call populates the gauge for every provider sharing the URL. - TestRecordBlockFetch_FansOutAcrossSharedURL — same guarantee for the chain-tracker success/fail counters. The test files in protocol/metrics that constructed SmartRouterMetricsManager literals were updated to the new field name. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(rpcsmartrouter): implement CustomMessage on EndpointChainFetcher for Solana SVMChainTracker (protocol/chaintracker/svm_chain_tracker.go:55) fetches latest block via CustomMessage with a getLatestBlockhash JSON-RPC body — it needs the slot, block hash, and block height returned together, which has no equivalent in the generic FetchLatestBlockNum path. EndpointChainFetcher.CustomMessage was a stub returning "CustomMessage not supported for EndpointChainFetcher". On every Solana-family chain this caused the per-endpoint ChainTracker to fail its first fetch — no OnNewBlock callbacks ever fired, so the metric fan-out added earlier in this branch had nothing to fan out and every per-endpoint metric for backups remained at N/A. Delegate to the existing sendRawRequest helper: for POST (the SVM case) the `data` argument is the JSON-RPC body and sendRawRequest posts it through the direct RPC connection; for GET the `data` argument is already used as the URL suffix per existing REST convention, so the same delegation preserves GET semantics. Tests: - TestEndpointChainFetcher_CustomMessage_POSTDelegatesToConnection exercises the SVMChainTracker code path — asserts the upstream response body flows back through CustomMessage unchanged. - TestEndpointChainFetcher_CustomMessage_PropagatesUnhealthyConnection asserts the connection-health check still fails closed rather than silently returning empty data. Also fixes a latent compile-break in the package-shared mockDirectRPCConnection: its GetNodeUrl signature was returning interface{} while the lavasession.DirectRPCConnection interface requires *common.NodeUrl. The existing tracker tests never passed the mock through the interface, so it compiled on its own — but any new test using it as a DirectRPCConnection (including this one) hit the mismatch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(rpcsmartrouter): force blocksToSave=1 for Solana per-endpoint trackers With CustomMessage now working for EndpointChainFetcher, the SVM ChainTracker path made it past its initial getLatestBlockhash call on Solana — only to die seconds later with "ChainTracker stopped with error: slot not found in cache". Root cause is a pre-existing limitation in SVMChainTracker: fetchLatestBlockNumInner populates the blockNum→slot cache for the latest block only (svm_chain_tracker.go:70), but the generic ChainTracker init loop (chain_tracker.go readHashes) iterates latestBlock down to latestBlock - blocksToSave + 1 calling FetchBlockHashByNum for each. On SVM every lookup after the first hits the cache-miss path and returns "slot not found in cache", killing the tracker permanently (StartAndServe returns the error and the goroutine exits). For per-endpoint tracking we don't need history — each tracker watches a single URL, so there's no cross-endpoint fork detection to do; the manager only uses GetLatestBlockNum via ValidateEndpointSync and the OnNewBlock callback for per-provider metric emission. Forcing blocksToSave=1 on Solana-family chains (SOLANA, SOLANAT, KOII, KOIIT) sidesteps the SVMChainTracker limitation entirely without losing any capability the manager actually uses. EVM chains keep their caller-requested value. Test: TestEndpointChainTrackerManager_ForcesBlocksToSave1ForSolana asserts the override fires for every SVM family id and leaves non-SVM chains unchanged. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(lavasession): meaningful IsHealthy for HTTPDirectRPCConnection HTTPDirectRPCConnection.IsHealthy was hard-coded to return true, with a comment deferring health tracking to "the endpoint/QoS layer". But the comprehensive probe path used by checkAndUnblockHealthyReBlockedProviders (probeProvider → probeDirectRPCEndpoints for static providers) only consults conn.IsHealthy() — so for static HTTP/HTTPS backups the probe was a guaranteed pass regardless of reachability, and a blocked backup with an unreachable upstream would be silently unblocked at every epoch transition. Two complementary changes close this: 1. HTTPDirectRPCConnection grows an atomic healthy flag (initial true), flipped to false by transport-level failures in SendRequest and DoHTTPRequest (dial / TLS / body-read errors), flipped back to true on any successful exchange — including HTTP 4xx/5xx, which are *application* errors, not transport failures. Flipping on 4xx/5xx would make rate-limited endpoints flap unhealthy and starve the dashboard on routine server errors; the test TestHTTPDirectRPCConnection_IsHealthy_Stays4xxHealthy guards that carve-out explicitly. 2. probeDirectRPCEndpoints now also gates on endpoint.Enabled. This belt-and-suspenders with IsHealthy(): the endpoint struct accumulates relay-path failures via MarkUnhealthy → ConnectionRefusals ≥ MaxConsecutiveConnectionAttempts → Enabled=false. Reading that state here means a backup that the hot path has already declared dead won't be optimistically revived by a probe that happens to catch a lucky moment in IsHealthy's lifecycle. Known gap (flagged in the IsHealthy comment): a never-hit backup with an unreachable upstream still reports healthy on the very first epoch because nothing has exercised either signal yet. Closing that cold-start case requires an active probe request (e.g. chain-specific no-op via the ChainParser), which is a meaningfully bigger change and belongs in a separate follow-up. Tests: - TestHTTPDirectRPCConnection_IsHealthy_StartsTrue — documents the optimistic-initialization behavior. - TestHTTPDirectRPCConnection_IsHealthy_FlipsOnDialFailure — the core behavior: a transport error must flip IsHealthy to false. - TestHTTPDirectRPCConnection_IsHealthy_Stays4xxHealthy — guards against overreach on application-level HTTP errors. - TestHTTPDirectRPCConnection_IsHealthy_RecoversAfterFailure — after upstream recovery, a successful exchange must restore IsHealthy=true. - TestProbeDirectRPCEndpoints_RespectsDisabledEndpoint — the endpoint.Enabled gate blocks a disabled endpoint from passing the probe even when IsHealthy happens to say true. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(rpcsmartrouter): cap /debug/time-warp body at 1 KiB The JSON decoder was reading r.Body unbounded, letting a caller stream an arbitrarily large payload into an endpoint whose only legitimate content is {"offset_seconds": N}. Wrap the body in http.MaxBytesReader(w, r.Body, 1024) before decoding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(errors): honor Retryable flag on node-error retry decision Node errors classified to a LavaError with Retryable=false (e.g. CHAIN_EXECUTION_REVERTED, CHAIN_OUT_OF_GAS, CHAIN_DOUBLE_SPEND) were retried because the state machine only short-circuited on IsUnsupportedMethod / IsUserError subcategories and ignored the registry's Retryable flag. Populate RelayResult.IsNonRetryable from the classifier on both consumer and smart-router paths and make HasNonRetryableUserFacingErrors rest solely on that flag. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: avitenzer <avitenzer@users.noreply.github.com>

pull-request-size Bot added the size/M label Mar 31, 2026

github-actions Bot added C:protocol Team:Protocol labels Mar 31, 2026

baz-reviewer Bot reviewed Mar 31, 2026

View reviewed changes

Comment thread protocol/rpcsmartrouter/rpcsmartrouter.go

qodo-code-review Bot reviewed Mar 31, 2026

View reviewed changes

Comment thread protocol/rpcsmartrouter/rpcsmartrouter.go

baz-reviewer Bot approved these changes Mar 31, 2026

View reviewed changes

NadavLevi force-pushed the fix/backup-providers-health branch 2 times, most recently from df4c12c to b18dcd0 Compare March 31, 2026 13:05

pull-request-size Bot added size/L and removed size/M labels Mar 31, 2026

NadavLevi requested review from Tomelia1999, avitenzer and nimrod-teich March 31, 2026 14:34

Tomelia1999 previously approved these changes Apr 5, 2026

View reviewed changes

avitenzer reviewed Apr 5, 2026

View reviewed changes

Comment thread protocol/rpcsmartrouter/rpcsmartrouter_test.go

Comment thread protocol/lavasession/consumer_types.go

NadavLevi dismissed Tomelia1999’s stale review via 187538d April 5, 2026 18:31

NadavLevi force-pushed the fix/backup-providers-health branch from b18dcd0 to 187538d Compare April 5, 2026 18:31

NadavLevi requested a review from avitenzer April 5, 2026 18:34

avitenzer reviewed Apr 6, 2026

View reviewed changes

Comment thread protocol/lavasession/consumer_types.go

NadavLevi force-pushed the fix/backup-providers-health branch from 187538d to a739fff Compare April 6, 2026 13:53

NadavLevi force-pushed the fix/backup-providers-health branch from a739fff to e7a79ee Compare April 6, 2026 13:56

avitenzer approved these changes Apr 6, 2026

View reviewed changes

Harraken merged commit 1559d6b into main Apr 6, 2026
30 checks passed

Harraken deleted the fix/backup-providers-health branch April 6, 2026 14:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rpcsmartrouter): reset endpoint health on epoch transition#2256

fix(rpcsmartrouter): reset endpoint health on epoch transition#2256
Harraken merged 1 commit into
mainfrom
fix/backup-providers-health

NadavLevi commented Mar 31, 2026 •

edited by baz-reviewer Bot

Loading

Uh oh!

qodo-code-review Bot commented Mar 31, 2026

Uh oh!

qodo-code-review Bot commented Mar 31, 2026 •

edited

Loading

Uh oh!

Uh oh!

codecov Bot commented Mar 31, 2026 •

edited

Loading

Uh oh!

Uh oh!

github-actions Bot commented Mar 31, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

NadavLevi commented Mar 31, 2026 • edited by baz-reviewer Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Description

Author Checklist

Reviewers Checklist

Generated description

Uh oh!

qodo-code-review Bot commented Mar 31, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-code-review Bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

Uh oh!

codecov Bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

github-actions Bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

NadavLevi commented Mar 31, 2026 •

edited by baz-reviewer Bot

Loading

qodo-code-review Bot commented Mar 31, 2026 •

edited

Loading

codecov Bot commented Mar 31, 2026 •

edited

Loading

github-actions Bot commented Mar 31, 2026 •

edited

Loading