fix(smart-router): do not panic when 1 static provider fails, trigger a background gorou…#2262
Conversation
Review Summary by QodoGraceful handling of failed static providers with background retry
WalkthroughsDescription• Prevent panic when static providers fail verification at startup • Exclude failed providers from session registration, allowing partial startup • Launch background retry goroutine to re-validate failed providers every 3 minutes • Add thread-safe copy-on-write merging when recovering providers during retries • Implement graceful degradation: endpoint serves with healthy providers while retrying failed ones Diagramflowchart LR
A["Provider Validation<br/>Phase 1"] --> B{"All Failed?"}
B -->|Yes| C["Return Error<br/>Block Startup"]
B -->|No| D["Filter Failed<br/>Providers"]
D --> E["Register Healthy<br/>Providers"]
E --> F["Launch Retry<br/>Goroutine"]
F --> G["Background Loop<br/>Every 3min"]
G --> H{"Provider<br/>Recovered?"}
H -->|Yes| I["Merge into Active<br/>Sessions"]
H -->|No| J["Keep Retrying"]
I --> K["Update Session<br/>Manager"]
File Changes1. protocol/rpcsmartrouter/rpcsmartrouter.go
|
Code Review by Qodo
|
| // Collect ALL WebSocket-capable endpoints from static providers for direct subscriptions | ||
| // WebSocket URLs are identified by ws:// or wss:// prefix | ||
| var wsEndpoints []*common.NodeUrl | ||
| for _, provider := range relevantStaticProviderList { | ||
| for i := range provider.NodeUrls { | ||
| url := strings.ToLower(provider.NodeUrls[i].Url) | ||
| if strings.HasPrefix(url, "ws://") || strings.HasPrefix(url, "wss://") { | ||
| wsEndpoints = append(wsEndpoints, &provider.NodeUrls[i]) | ||
| utils.LavaFormatInfo("Found WebSocket endpoint for direct subscriptions", | ||
| utils.LogAttr("url", provider.NodeUrls[i].Url), | ||
| utils.LogAttr("provider", provider.Name), | ||
| utils.LogAttr("chainID", provider.ChainID), | ||
| ) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // Create DirectWSSubscriptionManager if WebSocket endpoints are available | ||
| // Otherwise fall back to provider-based subscription manager | ||
| if len(wsEndpoints) > 0 { | ||
| directWSManager := NewDirectWSSubscriptionManager( | ||
| smartRouterMetricsManager, | ||
| spectypes.APIInterfaceJsonRPC, // WebSocket subscriptions use JSON-RPC | ||
| rpcEndpoint.ChainID, | ||
| rpcEndpoint.ApiInterface, | ||
| wsEndpoints, | ||
| optimizer, // Pass optimizer for endpoint selection | ||
| nil, // Use default WebSocket config (configurable via CLI flags later) | ||
| ) | ||
| // Start background cleanup goroutine | ||
| directWSManager.Start(ctx) | ||
| wsSubscriptionManager = directWSManager | ||
| utils.LavaFormatInfo("Using DirectWSSubscriptionManager for direct WebSocket subscriptions", | ||
| utils.LogAttr("chainID", rpcEndpoint.ChainID), | ||
| utils.LogAttr("apiInterface", rpcEndpoint.ApiInterface), | ||
| utils.LogAttr("wsEndpointCount", len(wsEndpoints)), | ||
| utils.LogAttr("optimizerEnabled", optimizer != nil), | ||
| ) | ||
| } else { | ||
| // No WebSocket endpoints configured - use NoOp manager that returns clear errors | ||
| // Smart router does NOT fall back to provider-based subscriptions (per implementation plan) | ||
| // Provider-based subscriptions are only for rpcconsumer, not rpcsmartrouter | ||
| wsSubscriptionManager = NewNoOpWSSubscriptionManager(rpcEndpoint.ChainID, rpcEndpoint.ApiInterface) | ||
| utils.LavaFormatInfo("No WebSocket endpoints configured for direct subscriptions", | ||
| utils.LogAttr("chainID", rpcEndpoint.ChainID), | ||
| utils.LogAttr("apiInterface", rpcEndpoint.ApiInterface), | ||
| utils.LogAttr("hint", "Add ws:// or wss:// URLs to static-providers-list to enable subscriptions"), | ||
| ) | ||
| } | ||
|
|
||
| // Create gRPC streaming subscription manager for gRPC server-streaming methods | ||
| // This supports Cosmos Event Streaming, Solana Geyser, and other gRPC streaming protocols | ||
| var grpcEndpoints []*common.NodeUrl | ||
| if rpcEndpoint.ApiInterface == spectypes.APIInterfaceGrpc { | ||
| // Collect gRPC endpoints from static providers | ||
| for _, provider := range relevantStaticProviderList { | ||
| if provider.ApiInterface == spectypes.APIInterfaceGrpc { | ||
| for i := range provider.NodeUrls { | ||
| grpcEndpoints = append(grpcEndpoints, &provider.NodeUrls[i]) | ||
| utils.LavaFormatInfo("Found gRPC endpoint for streaming subscriptions", | ||
| utils.LogAttr("url", provider.NodeUrls[i].Url), | ||
| utils.LogAttr("provider", provider.Name), | ||
| utils.LogAttr("chainID", provider.ChainID), | ||
| ) | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
2. Subscriptions use failed providers 🐞 Bug ≡ Correctness
CreateSmartRouterEndpoint filters static providers into healthyStaticProviders for session registration, but WebSocket and gRPC subscription endpoints are still collected from relevantStaticProviderList (unfiltered). This allows providers that failed validation (and were excluded from routing) to still be used for subscriptions, causing avoidable subscription failures against dead nodes.
Agent Prompt
### Issue description
WS/gRPC subscription endpoint collection uses the unfiltered static provider list, so endpoints that failed validation can still be used for subscriptions.
### Issue Context
Providers are filtered into `healthyStaticProviders` for session registration, but subscription endpoint collection still iterates `relevantStaticProviderList`.
### Fix Focus Areas
- protocol/rpcsmartrouter/rpcsmartrouter.go[821-849]
- protocol/rpcsmartrouter/rpcsmartrouter.go[921-988]
### Suggested fix
- Change the WS endpoint collection loop to iterate `healthyStaticProviders`.
- Change the gRPC endpoint collection loop to iterate `healthyStaticProviders` (and keep the `ApiInterface == grpc` check).
- (Optional) If you want subscriptions to automatically include recovered providers, consider rebuilding/updating the subscription managers when `retryFailedStaticProviders` re-registers recovered providers, or keep subscription endpoints tied strictly to currently-registered providers.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
There was a problem hiding this comment.
known but won't get handled on this PR, dedicated ticket will be opened
There was a problem hiding this comment.
There was a problem hiding this comment.
Commit 15d987c addressed this comment by collecting WebSocket and gRPC subscription endpoints from healthyStaticProviders instead of the unfiltered list, ensuring failed providers are excluded from subscriptions.
There was a problem hiding this comment.
▎ Issue: The condition len(failedStaticNames) >= validatedCount is semantically ambiguous based on whether validatedCount tracks all attempted providers or only successful ones. Based on the companion log healthy = validatedCount -
▎ len(failedStaticNames), the intent appears to be "total attempted", but the variable name says "validated" (successful). This can cause the endpoint to be killed when there are still healthy providers.
▎
▎ Suggestion:
▎ // Rename the loop counter and fix the check
▎ totalAttemptedCount := 0 // was: validatedCount
▎ // (increment on each iteration, not just on success)
▎
▎ healthyCount := totalAttemptedCount - len(failedStaticNames)
▎ if healthyCount == 0 {
▎ err := utils.LavaFormatError("all static providers failed — cannot serve endpoint", nil, ...)
▎ errCh <- err
▎ return err
▎ }
There was a problem hiding this comment.
logic is good, no changes needed.
renamed the param validatedCount -> totalAttemptedCount
There was a problem hiding this comment.
Commit 15d987c addressed this comment by renaming validatedCount to totalAttemptedCount, incrementing it for every provider attempt, and deriving healthyCount for failure detection so the endpoint only errors when all attempted providers fail while still logging healthy counts appropriately.
There was a problem hiding this comment.
WS endpoint collection loop (~line 960 in new file)
▎ Issue: relevantStaticProviderList is used to collect WebSocket and gRPC endpoints, but this is the unfiltered list including failed providers. This is a regression from the old panic-before-reaching-this-line behavior.
▎
▎ Suggestion: Change both loops to use healthyStaticProviders. Even if MAG-1569 tracks the full fix, using the unfiltered list here is strictly worse than before:
▎ for _, provider := range healthyStaticProviders { // not relevantStaticProviderList
There was a problem hiding this comment.
Commit 15d987c addressed this comment by switching the WebSocket and gRPC endpoint collection loops to iterate over healthyStaticProviders, ensuring only validated providers contribute endpoints instead of the unfiltered list.
There was a problem hiding this comment.
▎ Issue: A hung provider will block the retry loop indefinitely for all other failed providers sharing the same goroutine.
▎
▎ Suggestion:
▎ attemptCtx, attemptCancel := context.WithTimeout(ctx, 30*time.Second)
▎ defer attemptCancel()
There was a problem hiding this comment.
Commit 15d987c addressed this comment by wrapping each verification and retry attempt in a 30-second context.WithTimeout so that a hung provider can no longer block the shared goroutine loop, ensuring stalled connections time out instead of hanging indefinitely.
There was a problem hiding this comment.
Issue: UpdateAllProviders is called outside rpsr.mu from both updateEpoch and retryFailedStaticProviders. There is no ordering guarantee between concurrent calls, so retry can call UpdateAllProviders(oldEpoch, mergedSessions, …) after
updateEpoch has called UpdateAllProviders(newEpoch, freshSessions, …), leaving the session manager pointed at sessions whose PairingEpoch is one behind the current epoch.
Suggestion: Either (a) hold rpsr.mu across the UpdateAllProviders call, or (b) read currentEpoch := rpsr.epochTimer.GetCurrentEpoch() immediately before calling UpdateAllProviders (still under the lock) and skip the call if the epoch
has already advanced past it.
There was a problem hiding this comment.
Skipping the fix though: csm.UpdateAllProviders already rejects stale epochs (epoch <
previousEpoch), and the merged sessions are written to rpsr.providerSessions before the unlock, so the next updateEpoch
tick propagates the recoveries. Net effect is one logged error, no correctness loss.
…tine that re verifies the fails providers every 3 minutes
…to prevent stalls
15d987c to
3ebd3d7
Compare
Codecov Report❌ Patch coverage is
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 1 file with indirect coverage changes 🚀 New features to boost your workflow:
|
…tine that re verifies the fails providers every 3 minutes
Description
Closes: #XXXX
Author Checklist
All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.
I have...
!in the type prefix if API or client breaking changemainbranchReviewers Checklist
All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.
I have...