feat(smart-router): implement direct RPC mode for RPCSmartRouter#2231
Conversation
Review Summary by Qodofeat(smart-router): implement direct RPC mode for RPCSmartRouter with chain tracking and subscription management
WalkthroughsDescription• Implement direct RPC mode for RPCSmartRouter, enabling standalone routing to RPC nodes without blockchain state tracking for lower-latency relay operations • Add direct RPC connection handling with support for HTTP, WebSocket, gRPC, and REST protocols across EVM and Tendermint chains • Implement session management using composition pattern with DirectRPCConnection interfaces for endpoint tracking and block synchronization • Add WebSocket subscription management with multi-client deduplication, unique router IDs, and upstream connection pooling with backoff • Implement gRPC streaming subscription management with dynamic message handling, connection pooling, and reflection-based proxy support • Add per-endpoint ChainTracker manager for continuous block height polling across multiple direct RPC endpoints • Implement consistency validation with configurable block lag thresholds for pre-request endpoint health checking • Add cache integration with read/write support and stateful relay bypass • Implement batch request handling using original JSON bytes instead of chainMessage serialization • Add IP forwarding from client requests to upstream nodes with error mapping from node errors to protocol-compatible responses • Refactor configuration keys from static-providers/backup-providers to direct-rpc/backup-direct-rpc with backward compatibility • Add static provider validation before chain tracker setup with multi-URL group validation • Fix REST URL path joining to append absolute paths instead of replacing base path (preserves gateway prefixes) • Fix PANIC when nodeError leads to availabilityDegrader • Reduce debug logging noise from chain tracker polls by filtering internal block polling operations • Add comprehensive unit tests for gRPC subscription manager, WebSocket config, error mapper, batch requests, and session management • Add REST and direct RPC integration tests with mock RPC and REST servers for local development • Add smart router initialization scripts for Ethereum, Lava, gRPC, and Tendermint RPC endpoints Diagramflowchart LR
Client["Client Request"]
Router["RPCSmartRouter<br/>Direct RPC Mode"]
ChainTracker["EndpointChainTrackerManager<br/>Block Height Polling"]
Consistency["Consistency Validation<br/>Pre-Request Check"]
WSMgr["DirectWSSubscriptionManager<br/>Multi-Client Dedup"]
GRPCMgr["DirectGRPCSubscriptionManager<br/>Stream Pooling"]
Endpoints["Direct RPC Endpoints<br/>HTTP/WS/gRPC/REST"]
Cache["Cache Layer<br/>Read/Write"]
Client --> Router
Router --> ChainTracker
Router --> Consistency
Router --> WSMgr
Router --> GRPCMgr
Router --> Cache
Consistency --> Endpoints
WSMgr --> Endpoints
GRPCMgr --> Endpoints
Cache --> Endpoints
File Changes1. protocol/rpcsmartrouter/rpcsmartrouter_server.go
|
Code Review by Qodo
1. ETH_RPC_URL_2 hardcoded API key
|
| return nil, MapDirectRPCError(err, d.directConnection.GetProtocol()) | ||
| } | ||
|
|
||
| statusCode := response.StatusCode | ||
| responseData := response.Body | ||
|
|
||
| // Handle HTTP error status codes |
There was a problem hiding this comment.
2. err.error() logged unredacted 📘 Rule violation ⛨ Security
The direct RPC relay logs the raw error string, which may include full upstream URLs and embedded API keys/tokens. This can leak secrets into logs and violates secure logging requirements.
Agent Prompt
## Issue description
Raw `err.Error()` is logged and may contain full upstream URLs and embedded secrets.
## Issue Context
Even debug logs must not contain secrets; direct RPC endpoints commonly embed tokens in URL paths or query params.
## Fix Focus Areas
- protocol/rpcsmartrouter/direct_rpc_relay.go[292-306]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| if isConnectionRefused(err) { | ||
| return fmt.Errorf("RPC endpoint unavailable (connection refused): %w", err) | ||
| } | ||
|
|
||
| if isTimeout(err) { | ||
| return fmt.Errorf("RPC request timeout: %w", err) | ||
| } | ||
|
|
||
| // Protocol-specific error handling | ||
| switch protocol { | ||
| case lavasession.DirectRPCProtocolHTTP, lavasession.DirectRPCProtocolHTTPS: | ||
| return mapHTTPError(err) | ||
| case lavasession.DirectRPCProtocolGRPC: |
There was a problem hiding this comment.
3. mapdirectrpcerror exposes internal errors 📘 Rule violation ⛨ Security
The direct-RPC error mapping wraps and returns the underlying error (%w), which can propagate internal network/endpoint details to clients. User-facing errors should be generic, with detailed causes kept only in internal logs.
Agent Prompt
## Issue description
`MapDirectRPCError` returns errors that include underlying internal error details.
## Issue Context
User-facing errors should not expose internal details (endpoint addresses, low-level network errors). Details should be logged internally.
## Fix Focus Areas
- protocol/rpcsmartrouter/error_mapper.go[19-45]
- protocol/rpcsmartrouter/direct_rpc_relay.go[299-305]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| // MarkUnhealthy increments connection refusals and disables endpoint if threshold exceeded | ||
| func (e *Endpoint) MarkUnhealthy() { | ||
| e.ConnectionRefusals++ | ||
| if e.ConnectionRefusals >= MaxConsecutiveConnectionAttempts { | ||
| e.Enabled = false | ||
| utils.LavaFormatWarning("disabled unhealthy endpoint", nil, | ||
| utils.LogAttr("endpoint", e.NetworkAddress), | ||
| utils.LogAttr("refusals", e.ConnectionRefusals), | ||
| utils.LogAttr("is_direct_rpc", e.IsDirectRPC()), | ||
| ) | ||
| } | ||
| } | ||
|
|
||
| // ResetHealth resets connection refusals and re-enables endpoint | ||
| func (e *Endpoint) ResetHealth() { | ||
| e.ConnectionRefusals = 0 | ||
| e.Enabled = true | ||
| utils.LavaFormatInfo("re-enabled healthy endpoint", | ||
| utils.LogAttr("endpoint", e.NetworkAddress), | ||
| utils.LogAttr("is_direct_rpc", e.IsDirectRPC()), | ||
| ) | ||
| } |
There was a problem hiding this comment.
5. Endpoint health data races 🐞 Bug ⛯ Reliability
Endpoint health tracking mutates ConnectionRefusals/Enabled and writes LastBlockUpdate without synchronization, despite documenting that Endpoint.mu protects these fields. These are called/updated from concurrent request goroutines and tracker callbacks, risking undefined behavior and incorrect routing decisions under load.
Agent Prompt
### Issue description
`Endpoint.MarkUnhealthy` / `ResetHealth` and block tracking (`LastBlockUpdate`) perform unsynchronized reads/writes to shared fields that are used concurrently by smart-router request handlers and tracker callbacks, creating Go data races.
### Issue Context
- `Endpoint.mu` is documented as protecting `ConnectionRefusals` and `Enabled`, but these methods mutate without locking.
- `LastBlockUpdate` is a `time.Time` written from multiple goroutines without any lock/atomic.
### Fix Focus Areas
- protocol/lavasession/consumer_types.go[188-255]
- protocol/rpcsmartrouter/rpcsmartrouter_server.go[1766-1801]
- protocol/rpcsmartrouter/endpoint_chain_tracker_manager.go[160-172]
### Suggested changes
- Wrap `MarkUnhealthy` and `ResetHealth` bodies with `e.mu.Lock()/Unlock()` (or switch `ConnectionRefusals` to `atomic.Uint64` and `Enabled` to `atomic.Bool`).
- For `LastBlockUpdate`, either:
- Guard all reads/writes with `e.mu`, or
- Replace with `atomic.Int64` holding `time.Now().UnixNano()`.
- Update call sites that read `ConnectionRefusals`/`Enabled`/`LastBlockUpdate` to use the same locking/atomic approach (e.g., `if targetEndpoint != nil && targetEndpoint.ConnectionRefusals > 0` should not read without synchronization if you keep mutex-based protection).
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
b88ca1d to
1fa4b02
Compare
422807d to
adb116a
Compare
adb116a to
9b54ecb
Compare
f01f20f to
ddf9b85
Compare
7647fc1 to
e973ab7
Compare
Review Summary by QodoImplement direct RPC mode for RPCSmartRouter with endpoint health checking and session management
WalkthroughsDescription• Implement direct RPC mode for RPCSmartRouter enabling standalone operation without blockchain state tracking, supporting HTTP, WebSocket, gRPC, and REST protocols across EVM and Tendermint chains • Add per-endpoint ChainTracker management with continuous block height polling and consistency validation via EndpointChainTrackerManager for pre-request health checking • Implement DirectWSSubscriptionManager for WebSocket subscription management with multi-client deduplication, unique router IDs per client, upstream connection pooling, and backoff logic • Implement DirectGRPCSubscriptionManager for gRPC streaming with dynamic message handling, connection pooling, and reflection-based proxy support • Refactor configuration keys from static-providers/backup-providers to direct-rpc/backup-direct-rpc with backward compatibility and comprehensive static provider validation • Add sendRelayToDirectEndpoints() and relayInnerDirect() methods for parallel direct RPC relay handling with HTTP status code classification and per-endpoint metrics • Integrate cache write support for successful direct RPC responses with proper finalization and status code validation • Implement IP forwarding from client requests to upstream nodes via gRPC metadata and REST headers • Improve REST message error handling with explicit 5xx and 429 status code classification and error message extraction • Add new ConsistencyPreValidationError (699) error code for endpoints failing pre-request consistency validation • Refactor ConsumerWebSocketManager to use generic WSSubscriptionManager interface supporting both provider-relay and direct RPC modes • Add comprehensive unit tests for WebSocket subscription manager, gRPC subscription manager, error mapper, batch requests, and session management • Add integration tests for REST and direct RPC endpoints with mock RPC and REST servers for local development • Remove subscription-related test code and provider-relay specific logic from RPCSmartRouterServer Diagramflowchart LR
Client["Client Request"]
Router["RPCSmartRouter"]
Validation["Consistency Validation<br/>filterEndpointsByConsistency"]
ChainTracker["EndpointChainTrackerManager<br/>Per-endpoint Block Polling"]
DirectRelay["Direct RPC Relay<br/>relayInnerDirect"]
SubMgr["Subscription Managers<br/>WS/gRPC"]
Cache["Cache Integration<br/>Read/Write"]
Response["Response to Client"]
Client --> Router
Router --> Validation
Validation --> ChainTracker
ChainTracker --> DirectRelay
DirectRelay --> SubMgr
DirectRelay --> Cache
SubMgr --> Response
Cache --> Response
File Changes1. protocol/rpcsmartrouter/rpcsmartrouter_server.go
|
Code Review by Qodo
1. TestDefaultWebsocketConfig naming
|
| // Proper error classification (don't treat all 4xx as node errors) | ||
| var isNodeError bool | ||
| switch { | ||
| case response.StatusCode >= 500: | ||
| isNodeError = true // Server error | ||
| case response.StatusCode == 429: | ||
| isNodeError = false // Rate limit (not node issue) | ||
| case response.StatusCode >= 400: | ||
| isNodeError = false // Client error | ||
| default: | ||
| isNodeError = false // Success | ||
| } | ||
|
|
||
| // Let the chain message parse domain-specific REST errors (e.g. Cosmos tx errors on HTTP 200). | ||
| // NOTE: This should NOT be treated as "node error" by default; it is typically a request/application error. | ||
| hasError, errorMessage := chainMessage.CheckResponseError(response.Body, response.StatusCode) | ||
| if hasError && errorMessage != "" { | ||
| utils.LavaFormatDebug("REST response contains error", | ||
| utils.LogAttr("endpoint", d.endpointName), | ||
| utils.LogAttr("error", errorMessage), | ||
| ) | ||
| } | ||
|
|
||
| // Convert response headers to metadata | ||
| responseMetadata := convertHTTPHeadersToMetadata(response.Headers) | ||
|
|
||
| // Build result (include body even for 4xx/5xx!) | ||
| providerAddress := d.endpointName | ||
| if providerAddress == "" { | ||
| providerAddress = sanitizeEndpointURL(d.directConnection.GetURL()) | ||
| } | ||
|
|
||
| result := &common.RelayResult{ | ||
| Reply: &pairingtypes.RelayReply{ | ||
| Data: response.Body, // Include body even for errors! | ||
| Metadata: responseMetadata, // Include headers | ||
| }, | ||
| Finalized: true, | ||
| StatusCode: response.StatusCode, | ||
| ProviderInfo: common.ProviderInfo{ | ||
| ProviderAddress: providerAddress, | ||
| }, | ||
| IsNodeError: isNodeError, // Correct transport-level classification | ||
| } |
There was a problem hiding this comment.
4. Rest node-error mismatch 🐞 Bug ✓ Correctness
DirectRPCRelaySender.sendRESTRelay sets RelayResult.IsNodeError from HTTP status only, but the system’s authoritative node-error classification is ProtocolMessage.CheckResponseError; this creates split-brain behavior where relaycore treats responses (e.g., HTTP 429 / Cosmos tx errors on HTTP 200) as node errors while the returned RelayResult may still claim IsNodeError=false. As a result, RPCSmartRouterServer may omit the node-error header and apply success-path behaviors while relaycore is tracking node errors / retry logic.
Agent Prompt
### Issue description
`DirectRPCRelaySender.sendRESTRelay()` computes `isNodeError` purely from HTTP status codes (including treating HTTP 429 as non-node-error), but relaycore classifies node errors using `ProtocolMessage.CheckResponseError`. For REST, `RestMessage.CheckResponseError()` treats 429 (and Cosmos tx_response.code != 0 on HTTP 200) as node errors, so the smart router can end up with relaycore tracking a node error while `RelayResult.IsNodeError` remains false.
This causes inconsistent behavior: node-error headers and any IsNodeError-based logic in `RPCSmartRouterServer` diverge from relaycore’s node-error handling/retry decisions.
### Issue Context
- REST direct relay currently logs `hasError` but does not use it to set `IsNodeError`.
- relaycore’s ResultsManager treats `hasError==true` as a node error.
### Fix Focus Areas
- Align REST direct-relay `IsNodeError` with `chainMessage.CheckResponseError` (or change one side so both agree).
- Ensure smart-router headers/metrics reflect relaycore’s node-error decision.
- Update/adjust REST integration tests if their 429 expectation changes.
- file/path references:
- protocol/rpcsmartrouter/direct_rpc_relay.go[476-519]
- protocol/chainlib/chainproxy/rpcInterfaceMessages/restMessage.go[52-71]
- protocol/relaycore/results_manager.go[109-145]
- protocol/rpcsmartrouter/rpcsmartrouter_server.go[1870-1875]
- protocol/rpcsmartrouter/rpcsmartrouter_server.go[2184-2191]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
e973ab7 to
569b6fc
Compare
Introduces direct RPC mode for RPCSmartRouter, bypassing the Lava
provider relay path and routing requests directly to configured RPC
endpoints. Key changes:
- Direct RPC mode: RPCSmartRouter can now connect and relay directly
to RPC endpoints (HTTP/WS/gRPC) with provider selection via the
existing optimizer (QoS, latency, sync, stake weights)
- Backup provider selection: endpoints are probed and verified at
setup; QoS-based backup selection for failover
- Heavy request handling: skip latestBlock unmarshalling and response
unmarshalling above 1MB to avoid memory pressure
- Endpoint health tracking: mark endpoints unhealthy on 5xx/connection
errors; emit health state-transition metrics only on actual changes
- Metrics overhaul (SmartRouterMetricsManager):
- Per-endpoint and router-scoped Prometheus metrics
- Remove endpoint URLs from metric labels to avoid cardinality explosion
- Node error recovery metrics
- Router end-to-end latency now reflects true client-visible latency
(measured from SendParsedRelay entry to result return, capturing
provider selection overhead) rather than network-hop only
- Per-endpoint latency retains network-only measurement
- Block tracking: set endpoint latest block from relay response in
addition to ChainTracker
- Init script: remove legacy provider configs, adjust for direct RPC mode
- Spec: add health verification to AVAXP spec
- Relay core: use max(blockLagForQosSync*2, blockDistanceToFinalization)
for EndpointLagThreshold
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
569b6fc to
85b48d8
Compare
Add a standalone smart router that connects directly to RPC nodes without requiring blockchain state tracking, enabling lower-latency relay routing with built-in provider health checking and session management.
Core features:
Operational improvements:
Testing and tooling:
Description
Closes: #XXXX
Author Checklist
All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.
I have...
!in the type prefix if API or client breaking changemainbranchReviewers Checklist
All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.
I have...