Skip to content

fix(rpcsmartrouter): reset endpoint health on epoch transition#2256

Merged
Harraken merged 1 commit into
mainfrom
fix/backup-providers-health
Apr 6, 2026
Merged

fix(rpcsmartrouter): reset endpoint health on epoch transition#2256
Harraken merged 1 commit into
mainfrom
fix/backup-providers-health

Conversation

@NadavLevi

@NadavLevi NadavLevi commented Mar 31, 2026

Copy link
Copy Markdown
Contributor

User description

Disabled endpoints (Enabled=false, ConnectionRefusals>=5) were reused across epoch boundaries without being reset, causing a deadlock where they could never receive the successful relay needed to trigger ResetHealth. Reset all endpoint health state before handing endpoints to fresh sessions.

Description

Closes: #XXXX


Author Checklist

All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.

I have...

  • read the contribution guide
  • included the correct type prefix in the PR title, you can find examples of the prefixes below:
  • confirmed ! in the type prefix if API or client breaking change
  • targeted the main branch
  • provided a link to the relevant issue or specification
  • reviewed "Files changed" and left comments if necessary
  • included the necessary unit and integration tests
  • updated the relevant documentation or specification, including comments for documenting Go code
  • confirmed all CI checks have passed

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.

I have...

  • confirmed the correct type prefix in the PR title
  • confirmed all author checklist items have been addressed
  • reviewed state machine logic, API design and naming, documentation is accurate, tests and test coverage

Generated description

Below is a concise technical summary of the changes proposed in this PR:
Reset endpoint health state in RPCSmartRouter.updateEpoch so providers and backups enter each new epoch with clean connection counters. Validate that epoch transitions reopen disabled Endpoints by asserting their Enabled flag and ConnectionRefusals counters are reset in new coverage.

TopicDetails
Endpoint reset Reset endpoint health state before constructing fresh sessions so disabled endpoints can be re-enabled across epochs.
Modified files (1)
  • protocol/rpcsmartrouter/rpcsmartrouter.go
Latest Contributors(2)
UserCommitDate
NadavLevichore(metrics): cleanu...March 23, 2026
nimrod.teich@gmail.comfix(smart-router): ext...March 23, 2026
Reset tests Verify disabled provider and backup endpoints are re-enabled and refusal counts drop to zero after epoch transition.
Modified files (1)
  • protocol/rpcsmartrouter/rpcsmartrouter_test.go
Latest Contributors(2)
UserCommitDate
anna@magmadevs.comfeat(provideroptimizer...February 22, 2026
nimrod.teich@gmail.comfix: Hiesen tests fail...December 31, 2025
This pull request is reviewed by Baz. Review like a pro on (Baz).

@qodo-code-review

Copy link
Copy Markdown
ⓘ You are approaching your monthly quota for Qodo. Upgrade your plan

Review Summary by Qodo

Reset endpoint health on epoch transition to prevent deadlock

🐞 Bug fix

Grey Divider

Walkthroughs

Description
• Reset endpoint health state on epoch transitions to prevent deadlock
• Disabled endpoints with connection failures now get fresh start each epoch
• Applies health reset to both provider and backup sessions
• Added comprehensive test verifying endpoint re-enablement after epoch transition
Diagram
flowchart LR
  A["Epoch Transition"] --> B["updateEpoch Called"]
  B --> C["Reset Endpoint Health"]
  C --> D["Provider Sessions"]
  C --> E["Backup Sessions"]
  D --> F["Fresh Sessions Created"]
  E --> F
  F --> G["Endpoints Re-enabled"]
Loading

Grey Divider

File Changes

1. protocol/rpcsmartrouter/rpcsmartrouter.go 🐞 Bug fix +11/-3

Reset endpoint health on epoch transition

• Added endpoint.ResetHealth() call for all endpoints in provider sessions before creating fresh
 sessions
• Added endpoint.ResetHealth() call for all endpoints in backup sessions before creating fresh
 sessions
• Clarified comments explaining why endpoint health reset is necessary to prevent permanent
 disabling

protocol/rpcsmartrouter/rpcsmartrouter.go


2. protocol/rpcsmartrouter/rpcsmartrouter_test.go 🧪 Tests +64/-0

Add test for endpoint health reset on epoch transition

• Added new test TestUpdateEpoch_ResetsDisabledEndpoints to verify endpoint health reset behavior
• Test creates disabled endpoints with connection refusals and verifies they are re-enabled after
 epoch transition
• Test covers both provider and backup endpoint scenarios
• Validates that Enabled flag is set to true and ConnectionRefusals counter is reset to zero

protocol/rpcsmartrouter/rpcsmartrouter_test.go


Grey Divider

Qodo Logo

@qodo-code-review

qodo-code-review Bot commented Mar 31, 2026

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider


Action required

1. ResetHealth not thread-safe 🐞 Bug ☼ Reliability
Description
updateEpoch now calls Endpoint.ResetHealth on epoch transitions, but ResetHealth/MarkUnhealthy write
Enabled/ConnectionRefusals without acquiring Endpoint.mu, while other goroutines read/write these
fields under that mutex. This breaks synchronization and can cause data races and inconsistent
endpoint health/selection under concurrent relay load.
Code

protocol/rpcsmartrouter/rpcsmartrouter.go[R1491-1493]

+			for _, endpoint := range oldSession.Endpoints {
+				endpoint.ResetHealth()
+			}
Evidence
The epoch timer runs updateEpoch periodically in the background, so ResetHealth can run concurrently
with relays. Endpoint.mu is explicitly used elsewhere to protect Enabled/ConnectionRefusals (with
RLock/Lock), but ResetHealth/MarkUnhealthy mutate those same fields without locking, so concurrent
access becomes racy/undefined.

protocol/rpcsmartrouter/rpcsmartrouter.go[282-293]
protocol/rpcsmartrouter/rpcsmartrouter.go[1469-1542]
protocol/rpcsmartrouter/rpcsmartrouter.go[1488-1499]
protocol/lavasession/consumer_types.go[188-205]
protocol/lavasession/consumer_types.go[234-255]
protocol/lavasession/consumer_types.go[654-662]
protocol/lavasession/consumer_types.go[696-801]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`Endpoint.ResetHealth()` (and `Endpoint.MarkUnhealthy()`) mutate `Enabled` and `ConnectionRefusals` without taking `Endpoint.mu`, but other code assumes `Endpoint.mu` synchronizes these fields. With the new `updateEpoch()` calls, this creates a concrete data race between the epoch timer goroutine and in-flight relays/connection selection.

### Issue Context
- `updateEpoch()` now calls `endpoint.ResetHealth()` for all endpoints on every epoch transition.
- `fetchEndpointConnectionFromConsumerSessionWithProvider()` uses `endpoint.mu` (RLock/Lock) around reads/writes of `Enabled` and `ConnectionRefusals`.
- `ResetHealth()` and `MarkUnhealthy()` currently bypass the mutex entirely.

### Fix Focus Areas
- protocol/lavasession/consumer_types.go[234-255]
- protocol/rpcsmartrouter/rpcsmartrouter.go[1488-1516]

### Suggested fix
1. Update `(*Endpoint).ResetHealth()` to:
  - `e.mu.Lock()` / `defer e.mu.Unlock()`
  - mutate `ConnectionRefusals` / `Enabled` under the lock
2. Update `(*Endpoint).MarkUnhealthy()` similarly.
3. (Optional) ensure any direct field reads of `Enabled`/`ConnectionRefusals` in rpcsmartrouter code use the mutex or are replaced with helper accessors if needed.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Advisory comments

2. Epoch reset log spam 🐞 Bug ✧ Quality
Description
updateEpoch resets every endpoint each epoch and ResetHealth emits an Info log unconditionally, so a
large deployment or short epoch duration can produce high-volume logs and misleading “re-enabled”
messages for already-healthy endpoints. This adds avoidable overhead and reduces signal-to-noise in
production logs.
Code

protocol/rpcsmartrouter/rpcsmartrouter.go[R1488-1493]

+			// Reset endpoint health so disabled endpoints get a fresh start each epoch.
+			// Without this, an endpoint disabled by ConnectionRefusals stays disabled
+			// forever since it can never receive the successful relay needed to trigger ResetHealth.
+			for _, endpoint := range oldSession.Endpoints {
+				endpoint.ResetHealth()
+			}
Evidence
updateEpoch is registered as a periodic epoch callback, and now calls ResetHealth for each endpoint.
ResetHealth always logs at Info level, so this change can multiply Info logs by (endpoints ×
epochs), regardless of whether an endpoint was actually disabled/unhealthy.

protocol/rpcsmartrouter/rpcsmartrouter.go[282-293]
protocol/rpcsmartrouter/rpcsmartrouter.go[1469-1530]
protocol/lavasession/consumer_types.go[247-255]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`updateEpoch()` now calls `ResetHealth()` for every endpoint each epoch, and `ResetHealth()` logs at Info level every time. This can flood logs and produces misleading messages when no state changed.

### Issue Context
Epoch transitions can be frequent (configurable), and the number of endpoints can be large.

### Fix Focus Areas
- protocol/rpcsmartrouter/rpcsmartrouter.go[1488-1516]
- protocol/lavasession/consumer_types.go[247-255]

### Suggested fix
Implement one of:
- Add a guard in `ResetHealth()` (under `e.mu`) to no-op without logging when `ConnectionRefusals == 0 && Enabled == true`.
- Or, in `updateEpoch()`, only call `ResetHealth()` for endpoints that are currently disabled or have `ConnectionRefusals > 0`.
- Optionally downgrade the log to Debug, or log a single aggregated message per epoch with counts reset.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment thread protocol/rpcsmartrouter/rpcsmartrouter.go
@codecov

codecov Bot commented Mar 31, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 43.75000% with 18 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
protocol/lavasession/consumer_types.go 36.36% 14 Missing ⚠️
protocol/rpcsmartrouter/rpcsmartrouter_server.go 0.00% 4 Missing ⚠️
Flag Coverage Δ
consensus 8.70% <ø> (ø)
protocol 34.05% <43.75%> (+0.11%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
protocol/rpcsmartrouter/rpcsmartrouter.go 5.03% <100.00%> (+2.04%) ⬆️
protocol/rpcsmartrouter/rpcsmartrouter_server.go 13.53% <0.00%> (ø)
protocol/lavasession/consumer_types.go 74.10% <36.36%> (+2.03%) ⬆️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread protocol/rpcsmartrouter/rpcsmartrouter.go
@github-actions

github-actions Bot commented Mar 31, 2026

Copy link
Copy Markdown

Test Results

0 tests  ±0   0 ✅ ±0   0s ⏱️ ±0s
0 suites ±0   0 💤 ±0 
7 files   ±0   0 ❌ ±0 

Results for commit e7a79ee. ± Comparison against base commit f8670a7.

♻️ This comment has been updated with latest results.

@NadavLevi NadavLevi force-pushed the fix/backup-providers-health branch 2 times, most recently from df4c12c to b18dcd0 Compare March 31, 2026 13:05
@pull-request-size pull-request-size Bot added size/L and removed size/M labels Mar 31, 2026
Tomelia1999
Tomelia1999 previously approved these changes Apr 5, 2026
Comment thread protocol/rpcsmartrouter/rpcsmartrouter_test.go
Comment thread protocol/lavasession/consumer_types.go
Comment thread protocol/lavasession/consumer_types.go
@NadavLevi NadavLevi force-pushed the fix/backup-providers-health branch from 187538d to a739fff Compare April 6, 2026 13:53
Disabled endpoints (Enabled=false, ConnectionRefusals>=5) were reused
across epoch boundaries without being reset, causing a deadlock where
they could never receive the successful relay needed to trigger ResetHealth.
Reset all endpoint health state before handing endpoints to fresh sessions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@NadavLevi NadavLevi force-pushed the fix/backup-providers-health branch from a739fff to e7a79ee Compare April 6, 2026 13:56
@Harraken Harraken merged commit 1559d6b into main Apr 6, 2026
30 checks passed
@Harraken Harraken deleted the fix/backup-providers-health branch April 6, 2026 14:31
NadavLevi added a commit that referenced this pull request Apr 12, 2026
PR #2256 resets endpoint.ResetHealth() for both primary and backup
providers on epoch transition, but only the in-memory struct was
cleared — the lava_rpc_endpoint_overall_health Prometheus gauge stayed
stuck at 0 (unhealthy) forever. The metric is only toggled on state
transitions from relay outcomes (MarkUnhealthy/ResetHealth in the
hot path), and backups rarely receive the successful relay needed to
push the gauge back to 1. Net effect: operators see backups at 0%
uptime on dashboards even though the router considers them healthy
again after the epoch reset, which can mask real regressions and
(worse) cause failover logic that reads the gauge to refuse routing
to a provider that is in fact available.

Emits SetEndpointOverallHealth(..., true) alongside the struct reset
for both the primary and backup provider loops in updateEpoch, mirrored
via rpsr.rpcServers[chainKey].smartRouterEndpointMetrics.

Also adds TestUpdateEpoch_ResetsHealthMetric which seeds the gauge to
0 (via SetEndpointOverallHealth) for a primary and a backup, runs
updateEpoch, then reads lava_rpc_endpoint_overall_health back from
prometheus.DefaultGatherer and asserts both have transitioned to 1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
NadavLevi added a commit that referenced this pull request Apr 13, 2026
PR #2256 resets endpoint.ResetHealth() for both primary and backup
providers on epoch transition, but only the in-memory struct was
cleared — the lava_rpc_endpoint_overall_health Prometheus gauge stayed
stuck at 0 (unhealthy) forever. The metric is only toggled on state
transitions from relay outcomes (MarkUnhealthy/ResetHealth in the
hot path), and backups rarely receive the successful relay needed to
push the gauge back to 1. Net effect: operators see backups at 0%
uptime on dashboards even though the router considers them healthy
again after the epoch reset, which can mask real regressions and
(worse) cause failover logic that reads the gauge to refuse routing
to a provider that is in fact available.

Emits SetEndpointOverallHealth(..., true) alongside the struct reset
for both the primary and backup provider loops in updateEpoch, mirrored
via rpsr.rpcServers[chainKey].smartRouterEndpointMetrics.

Also adds TestUpdateEpoch_ResetsHealthMetric which seeds the gauge to
0 (via SetEndpointOverallHealth) for a primary and a backup, runs
updateEpoch, then reads lava_rpc_endpoint_overall_health back from
prometheus.DefaultGatherer and asserts both have transitioned to 1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nimrod-teich pushed a commit that referenced this pull request Apr 14, 2026
* fix(lavasession): track and persist blocked backup provider state across epochs

Backup providers could never be blocked: blockProvider was a no-op for
them since they are not in validAddresses, and the backup selection path
never checked any blocked list. Additionally, the epoch-transition
re-blocking and health-probe logic only covered the main pairing,
leaving previously-blocked backup providers with a clean slate.

- Add blockedBackupProviders map to ConsumerSessionManager
- blockProvider now adds backup providers to blockedBackupProviders
  when they are not found in validAddresses
- getValidConsumerSessionsWithProviderFromBackupProviderList skips
  providers in blockedBackupProviders
- UpdateAllProviders merges blockedBackupProviders into
  previousEpochBlockedProviders and re-blocks them in the new epoch
- checkAndUnblockHealthyReBlockedProviders handles backup providers
  in both probe passes, including cross-role transitions (normal→backup)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(lavasession): handle backup providers in GenerateReconnectCallback

Ports the GenerateReconnectCallback change from #2265 on top of #2257:
when the periodic reconnect probe succeeds for a blocked backup provider,
remove it from blockedBackupProviders instead of calling
validateAndReturnBlockedProviderToValidAddressesList (a no-op for backups,
which are not in validAddresses). Without this, periodic reconnection
could never recover a blocked backup outside the epoch-transition path.

Also expands the test suite:
- TestCheckAndUnblock_BackupRoutedToComprehensiveProbe covers the
  !isBackup && !IsReported guard: confirms backups never take the
  immediate-unblock branch (which would not touch blockedBackupProviders
  and would silently stall the recovery).
- TestCheckAndUnblock_BackupUnblockedWhenHealthy covers the positive
  path — a blocked backup with a reachable endpoint is unblocked via
  comprehensive probe at epoch transition.
- TestGenerateReconnectCallback_BackupProviderUnblocked covers the
  new backup branch added by this commit.
- TestGenerateReconnectCallback_NonBackupUsesValidAddressesPath guards
  against regressing the existing non-backup reconnect flow.

Co-authored-by: avitenzer <avitenzer@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(rpcsmartrouter): reset endpoint health metric on epoch transition

PR #2256 resets endpoint.ResetHealth() for both primary and backup
providers on epoch transition, but only the in-memory struct was
cleared — the lava_rpc_endpoint_overall_health Prometheus gauge stayed
stuck at 0 (unhealthy) forever. The metric is only toggled on state
transitions from relay outcomes (MarkUnhealthy/ResetHealth in the
hot path), and backups rarely receive the successful relay needed to
push the gauge back to 1. Net effect: operators see backups at 0%
uptime on dashboards even though the router considers them healthy
again after the epoch reset, which can mask real regressions and
(worse) cause failover logic that reads the gauge to refuse routing
to a provider that is in fact available.

Emits SetEndpointOverallHealth(..., true) alongside the struct reset
for both the primary and backup provider loops in updateEpoch, mirrored
via rpsr.rpcServers[chainKey].smartRouterEndpointMetrics.

Also adds TestUpdateEpoch_ResetsHealthMetric which seeds the gauge to
0 (via SetEndpointOverallHealth) for a primary and a backup, runs
updateEpoch, then reads lava_rpc_endpoint_overall_health back from
prometheus.DefaultGatherer and asserts both have transitioned to 1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore(metrics): remove unused ResetEndpointMetrics

ResetEndpointMetrics had no callers and, despite its name, set
endpoint_overall_health to 0 instead of 1 — a footgun if ever wired up.
The two actual consumers (relay hot path and epoch transition) already
call SetEndpointOverallHealth with an explicit bool, so there's no need
for a helper.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: populate per-endpoint metrics for backup providers

Two independent gaps combined to leave backup providers showing N/A
latest-block on the dashboard:

1. GetAllDirectRPCEndpoints (consumer_session_manager.go) iterated only
   csm.pairing, so the smart router's initializeChainTrackers startup
   path skipped every backup. A ChainTracker for a dedicated-URL backup
   (e.g. base.lava.build with no primary sharing that URL) was only
   ever created lazily via the relay hot path — which backups rarely
   reach. Extended the function to iterate csm.backupProviders as well.

2. urlToProviderName was a map[string]string, so multiple providers
   configured on the same upstream URL collapsed to a single name (last
   writer wins). The ChainTracker's OnNewBlock callback keys emissions
   by URL and resolves to a provider name via this map — so every
   URL-keyed metric emission reached exactly one provider's label,
   leaving its peers stuck at zero on dashboards.

   Changed the map to map[string][]string with append+dedup in
   RegisterEndpoint, added resolveProviderNames returning all providers
   for a URL, and updated the two URL-keyed emitters
   (SetEndpointLatestBlock, RecordBlockFetch) to iterate and emit per
   provider. resolveProviderName is kept for callers that already pass
   a provider name (relay hot path) — it now returns the first name.

Tests:
- TestGetAllDirectRPCEndpoints_IncludesBackupProviders — primary and
  backup both appear in the startup iteration.
- TestResolveProviderNames_FansOutAcrossSharedURL — both providers
  registered at the same URL resolve together.
- TestResolveProviderNames_DeduplicatesSameProvider — repeated
  (URL, providerName) registrations (typical for providers whose
  node-urls lists the same URL twice) don't duplicate entries.
- TestResolveProviderNames_FallsBackToInput — unknown inputs
  (provider names) round-trip as a single-element slice for backward
  compatibility with callers that pass names directly.
- TestSetEndpointLatestBlock_FansOutAcrossSharedURL — one URL-keyed
  call populates the gauge for every provider sharing the URL.
- TestRecordBlockFetch_FansOutAcrossSharedURL — same guarantee for
  the chain-tracker success/fail counters.

The test files in protocol/metrics that constructed SmartRouterMetricsManager
literals were updated to the new field name.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(rpcsmartrouter): implement CustomMessage on EndpointChainFetcher for Solana

SVMChainTracker (protocol/chaintracker/svm_chain_tracker.go:55) fetches
latest block via CustomMessage with a getLatestBlockhash JSON-RPC body
— it needs the slot, block hash, and block height returned together,
which has no equivalent in the generic FetchLatestBlockNum path.

EndpointChainFetcher.CustomMessage was a stub returning
"CustomMessage not supported for EndpointChainFetcher". On every
Solana-family chain this caused the per-endpoint ChainTracker to fail
its first fetch — no OnNewBlock callbacks ever fired, so the metric
fan-out added earlier in this branch had nothing to fan out and every
per-endpoint metric for backups remained at N/A.

Delegate to the existing sendRawRequest helper: for POST (the SVM
case) the `data` argument is the JSON-RPC body and sendRawRequest
posts it through the direct RPC connection; for GET the `data`
argument is already used as the URL suffix per existing REST
convention, so the same delegation preserves GET semantics.

Tests:
- TestEndpointChainFetcher_CustomMessage_POSTDelegatesToConnection
  exercises the SVMChainTracker code path — asserts the upstream
  response body flows back through CustomMessage unchanged.
- TestEndpointChainFetcher_CustomMessage_PropagatesUnhealthyConnection
  asserts the connection-health check still fails closed rather than
  silently returning empty data.

Also fixes a latent compile-break in the package-shared
mockDirectRPCConnection: its GetNodeUrl signature was returning
interface{} while the lavasession.DirectRPCConnection interface
requires *common.NodeUrl. The existing tracker tests never passed the
mock through the interface, so it compiled on its own — but any new
test using it as a DirectRPCConnection (including this one) hit the
mismatch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(rpcsmartrouter): force blocksToSave=1 for Solana per-endpoint trackers

With CustomMessage now working for EndpointChainFetcher, the SVM
ChainTracker path made it past its initial getLatestBlockhash call on
Solana — only to die seconds later with "ChainTracker stopped with
error: slot not found in cache".

Root cause is a pre-existing limitation in SVMChainTracker:
fetchLatestBlockNumInner populates the blockNum→slot cache for the
latest block only (svm_chain_tracker.go:70), but the generic
ChainTracker init loop (chain_tracker.go readHashes) iterates
latestBlock down to latestBlock - blocksToSave + 1 calling
FetchBlockHashByNum for each. On SVM every lookup after the first
hits the cache-miss path and returns "slot not found in cache",
killing the tracker permanently (StartAndServe returns the error
and the goroutine exits).

For per-endpoint tracking we don't need history — each tracker watches
a single URL, so there's no cross-endpoint fork detection to do; the
manager only uses GetLatestBlockNum via ValidateEndpointSync and the
OnNewBlock callback for per-provider metric emission. Forcing
blocksToSave=1 on Solana-family chains (SOLANA, SOLANAT, KOII, KOIIT)
sidesteps the SVMChainTracker limitation entirely without losing any
capability the manager actually uses. EVM chains keep their
caller-requested value.

Test: TestEndpointChainTrackerManager_ForcesBlocksToSave1ForSolana
asserts the override fires for every SVM family id and leaves
non-SVM chains unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(lavasession): meaningful IsHealthy for HTTPDirectRPCConnection

HTTPDirectRPCConnection.IsHealthy was hard-coded to return true, with
a comment deferring health tracking to "the endpoint/QoS layer". But
the comprehensive probe path used by checkAndUnblockHealthyReBlockedProviders
(probeProvider → probeDirectRPCEndpoints for static providers) only
consults conn.IsHealthy() — so for static HTTP/HTTPS backups the probe
was a guaranteed pass regardless of reachability, and a blocked backup
with an unreachable upstream would be silently unblocked at every
epoch transition.

Two complementary changes close this:

1. HTTPDirectRPCConnection grows an atomic healthy flag (initial true),
   flipped to false by transport-level failures in SendRequest and
   DoHTTPRequest (dial / TLS / body-read errors), flipped back to true
   on any successful exchange — including HTTP 4xx/5xx, which are
   *application* errors, not transport failures. Flipping on 4xx/5xx
   would make rate-limited endpoints flap unhealthy and starve the
   dashboard on routine server errors; the test
   TestHTTPDirectRPCConnection_IsHealthy_Stays4xxHealthy guards that
   carve-out explicitly.

2. probeDirectRPCEndpoints now also gates on endpoint.Enabled. This
   belt-and-suspenders with IsHealthy(): the endpoint struct accumulates
   relay-path failures via MarkUnhealthy → ConnectionRefusals ≥
   MaxConsecutiveConnectionAttempts → Enabled=false. Reading that state
   here means a backup that the hot path has already declared dead won't
   be optimistically revived by a probe that happens to catch a lucky
   moment in IsHealthy's lifecycle.

Known gap (flagged in the IsHealthy comment): a never-hit backup with
an unreachable upstream still reports healthy on the very first epoch
because nothing has exercised either signal yet. Closing that cold-start
case requires an active probe request (e.g. chain-specific no-op via the
ChainParser), which is a meaningfully bigger change and belongs in a
separate follow-up.

Tests:
- TestHTTPDirectRPCConnection_IsHealthy_StartsTrue — documents the
  optimistic-initialization behavior.
- TestHTTPDirectRPCConnection_IsHealthy_FlipsOnDialFailure — the core
  behavior: a transport error must flip IsHealthy to false.
- TestHTTPDirectRPCConnection_IsHealthy_Stays4xxHealthy — guards against
  overreach on application-level HTTP errors.
- TestHTTPDirectRPCConnection_IsHealthy_RecoversAfterFailure — after
  upstream recovery, a successful exchange must restore IsHealthy=true.
- TestProbeDirectRPCEndpoints_RespectsDisabledEndpoint — the
  endpoint.Enabled gate blocks a disabled endpoint from passing the
  probe even when IsHealthy happens to say true.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(rpcsmartrouter): cap /debug/time-warp body at 1 KiB

The JSON decoder was reading r.Body unbounded, letting a caller stream an
arbitrarily large payload into an endpoint whose only legitimate content
is {"offset_seconds": N}. Wrap the body in http.MaxBytesReader(w, r.Body,
1024) before decoding.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(errors): honor Retryable flag on node-error retry decision

Node errors classified to a LavaError with Retryable=false (e.g.
CHAIN_EXECUTION_REVERTED, CHAIN_OUT_OF_GAS, CHAIN_DOUBLE_SPEND) were
retried because the state machine only short-circuited on
IsUnsupportedMethod / IsUserError subcategories and ignored the
registry's Retryable flag. Populate RelayResult.IsNonRetryable from the
classifier on both consumer and smart-router paths and make
HasNonRetryableUserFacingErrors rest solely on that flag.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: avitenzer <avitenzer@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants