refactor(state machine & retry): State machine and retry mechanism refactor#2278
Conversation
Replace duplicated ConsumerRelayStateMachine and SmartRouterRelayStateMachine with a single UnifiedRelayStateMachine in relaycore, configured via StateMachineConfig. Behavioral differences (circuit breaker, timeout priority, unsupported method check) are controlled by config flags instead of separate implementations. Package-level files become thin wrappers with type aliases and config constructors.
Ensures timeout log messages include consecutiveBatchErrors for observability parity between Consumer and SmartRouter timeout paths.
Phase 2 of the state machine refactor: extract all retry decision logic into a pure Go policy engine (relaypolicy package). - Create relaypolicy package with Policy.Decide() for post-relay decisions and Policy.OnSendRelayResult() for pre-relay (batch send) decisions - Add ClassifyError() and DecideEligibility() functions for worker-side error classification and provider eligibility - Add GetResultsSummary() to RelayProcessor as pure data aggregation - Simplify HasRequiredNodeResults() to remove shouldRetryRelay() call; it now returns false unconditionally when no successes, letting the state machine's policy.Decide() handle retry logic - Modify UnifiedRelayStateMachine to accept RelayPolicyInf and use it for all retry decisions instead of inline retryCondition/shouldRetry - Add policy unit tests covering mode checks, error tolerance, epoch mismatch, hash errors, circuit breaker, and batch send retry
- Fix SmartRouter regression: send Done immediately when policy.Decide() returns Stop in gotResults path (matching old shouldRetryRelay=false behavior). Add IsTickerHedge flag so ticker hedges skip error tolerance checks (matching old retryCondition behavior). - Wire archive mutation: stateTransition now reads DecisionOutput.Mutation and applies it via applyMutation(), falling back to UpgradeToArchiveIfNeeded only when no mutation is present. - Wire ClassifyNodeError into SmartRouter worker (rpcsmartrouter_server.go), matching the consumer worker wiring. Both now use relaypolicy.ClassifyNodeError. - Document eligibility behavior equivalence between relaypolicy.DecideEligibility and used_providers.shouldRetryWithThisError (circular import prevents direct call). - Remove dead config flags: EnableUnsupportedMethodCheck from StateMachineConfig, EnableUnsupportedCheck from PolicyConfig. - Add focused tests for IsTickerHedge and archive mutation in policy_test.go. - Update design doc: fix stale ClassifyError/EnableUnsupportedCheck references, mark old architecture section as historical, update implementation status table.
…assification - Remove ClassifyProtocolError() — unused, protocol error classification already happens in GetResultsSummary() - Remove IsRetryable and IsSolanaNonRetryable from ErrorClassification — dead fields, only IsUnsupportedMethod is consumed - Revert SmartRouter worker classification addition to preserve main behavior (SmartRouter never sets IsUnsupportedMethod)
Replace separate HasUnsupportedMethod/HasUserError checks in policy.Decide() with a single HasNonRetryableNodeError flag, matching main's HasNonRetryableUserFacingErrors() behavior. Move eligibility decision logic to common package to break circular import between relaypolicy and lavasession. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tream removal Align with main's removal of SubCategoryUserError (21fbf75). User-input errors now charge normal CU since responses are not cached. Only IsUnsupportedMethod retains the zero-CU carve-out. IsNonRetryable remains the umbrella retry gate covering all non-retryable error categories. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ches - Delete shouldRetryRelay, HasNonRetryableUserFacingErrors, HasUnsupportedMethodErrors, isBatchRequest from relay_processor.go (all superseded by GetResultsSummary + policy.Decide) - Extract addArchiveExtension, removeArchiveExtension, cacheBlockHashes helpers from UpgradeToArchiveIfNeeded; applyMutation now calls helpers directly instead of passing fake batch numbers - Migrate tests to exercise GetResultsSummary instead of deleted methods - Add archive mutation early-bail tests (upgraded + 2 errors) - Fix doc: DecideEligibility signature, shouldRetryWithThisError removal, remove all "not yet wired" references, update E2E scenario traces
stateTransition() falls back to legacy UpgradeToArchiveIfNeeded() when policy.Decide() returns a zero-value mutation (epoch-mismatch retries, early stops). Both paths produce the same results today but must stay in sync. Added explanatory note in the architecture doc and a clarifying comment at the fallback site.
… state machine Stop path regression: when policy.Decide() returned Stop on the gotResults path, the unified machine returned immediately, losing in-flight relays that might succeed slightly later. Restored the main branch behavior: validateReturnCondition(nil) + continue listening via readResultsFromProcessor(). Observability regression: ticker-driven hedges no longer incremented analytics.HedgeCount. Restored the increment so hedge incident metrics remain accurate.
- Restore StatusCode extraction from provider trailer in consumer (dropped during rebase — provider sends HTTP status via gRPC trailer metadata, consumer needs it for REST node error detection) - Fix provider test mode to set StatusCode trailer on gRPC response (test mode bypasses chain proxy where this is normally set) - Fix TestModeChainRouter.ExtensionsSupported to return true for empty extensions (was always returning false, blocking all test mode relays) - Update three-provider test script with proper proposal sequencing and test response config for retry policy testing
- TestModeChainRouter.ExtensionsSupported now returns true for all extensions (not just empty), enabling archive migration testing - Update archive test script with proper proposal sequencing - Add abci_query test response with execution reverted error for non-retryable (non-unsupported) error classification testing
- Add protocol/common/eligibility.go (was untracked, causing CI build failure: common.DecideEligibility undefined) - Replace TestShouldRetryWithThisError with TestDecideEligibility (old test referenced deleted shouldRetryWithThisError method)
Add upper/lower bounds check before converting int to uint32 in grpcCodeMatcher.Matches to prevent silent truncation on negative or oversized values. Fixes CodeQL high severity finding.
There was a problem hiding this comment.
Issue: Dead code. DecideEligibility, EligibilityAction, EligibilityResult, MarkUnwanted, AllowRetry re-exports have zero callers — used_providers.go calls common.DecideEligibility directly.
Suggestion: Delete this file in this PR. Landing unused abstraction creates a discoverability trap for the next reader.
There was a problem hiding this comment.
Commit 87ea930 addressed this comment by wiring UsedProviders.RemoveUsed through the new common.DecideEligibility helper, so the previously unused eligibility definitions now have an actual caller and are not dead code anymore.
There was a problem hiding this comment.
I've changed the code, so policy.DecideEligibility will be called
There was a problem hiding this comment.
Issue: The gate mutation.ArchiveAction != ArchiveNoChange silently drops a CacheHashes: true signal when no archive change is needed.
Suggestion:
if mutation != nil && (mutation.ArchiveAction != ArchiveNoChange || mutation.CacheHashes) {
upgradedProtocolMessage = sm.applyMutation(protocolMessage, archiveStatus, *mutation)
} else {
upgradedProtocolMessage = UpgradeToArchiveIfNeeded(...)
}
Or, document in MutationOutput that CacheHashes is only valid when paired with an archive action.
There was a problem hiding this comment.
Commit 87ea930 addressed this comment by updating UnifiedRelayStateMachine.stateTransition to invoke applyMutation whenever CacheHashes is true (even if ArchiveAction stays ArchiveNoChange) and by keeping cacheBlockHashes inside applyMutation. This ensures CacheHashes signals are honored instead of silently falling back to UpgradeToArchiveIfNeeded.
There was a problem hiding this comment.
Issue: Type assertion to *RelayProcessor falls back to empty ResultsSummary{}, which the policy treats as "no errors → retry." A test double or future implementation will silently break the policy.
Suggestion: Promote GetResultsSummary() to ResultsCheckerInf:
type ResultsCheckerInf interface {
WaitForResults(ctx context.Context) error
HasRequiredNodeResults(tries int) (bool, int)
GetResultsSummary() ResultsSummary
GetCrossValidationParams() *common.CrossValidationParams
}
There was a problem hiding this comment.
Commit faa4470 addressed this comment by adding GetResultsSummary to ResultsCheckerInf and letting the unified relay state machine call that interface method directly instead of asserting to *RelayProcessor, which removes the fallback that treated missing summaries as empty results.
There was a problem hiding this comment.
Add direct UnifiedRelayStateMachine tests, especially for EnableTimeoutPriority branches and the circuit-breaker / batch-error counter interaction.
There was a problem hiding this comment.
Commit faa4470 addressed this comment by adding TestConsumerStateMachineBatchErrorCounterResetsOnSuccess in consumer_relay_state_machine_test.go to exercise OnSendRelayResult and assert that the consecutive batch-error counter resets on success, ensuring the circuit-breaker interaction is covered by the new test.
The gate only checked ArchiveAction != ArchiveNoChange, which would silently drop a CacheHashes-only mutation. Include CacheHashes in the condition so applyMutation is called whenever the policy returns any non-empty mutation.
avitenzer
left a comment
There was a problem hiding this comment.
Issue 1 — Behavior regression: Archive extension is stripped where legacy preserved it (severity: normal)
File: protocol/relaypolicy/policy.go:85-90 (with downstream effect via protocol/relaycore/unified_relay_state_machine.go:135 → applyMutation).
The divergence. On the isUpgraded && NodeErrors >= 2 early-bail branch:
- Legacy UpgradeToArchiveIfNeeded() (relay_state.go:258-261) caches block hashes and returns the protocol message unchanged — Archive extension stays attached, archiveStatus.isArchive stays true.
- New decideMutation() returns {ArchiveAction: RemoveArchive, CacheHashes: true}. applyMutation() then strips the Archive extension via removeArchiveExtension() and stores archiveStatus.isArchive = false.
Why the safeguard misses it. The fallback gate at unified_relay_state_machine.go:135 is mutation.ArchiveAction != ArchiveNoChange. RemoveArchive is not NoChange, so the fallback never engages — the new path takes over and diverges. The inline comment
("Both paths must stay in sync until the fallback is eliminated") and the design-doc invariant at 14_STATE_MACHINE_ARCHITECTURE_AND_REFACTOR.md:822 are both violated.
Reproduction (Stateless, MaxRetries=10, RelayRetryLimit=2):
- Attempt 0 → node error → policy adds archive → isArchive=true, isUpgraded=true.
- Attempt 1 (with archive) → second node error → NodeErrors=2. Legacy: caches hashes, archive stays attached. New: caches hashes and removes archive extension, isArchive=false.
- Attempt 2 retry: legacy still sends archive on the wire; new code sends a non-archive request and reports GetIsArchive()==false to downstream consumers.
Impact. On-wire request shape changes; GetIsArchive() reads diverge from legacy for any caller that consults it (eligibility, consistency, logging). The PR's own policy_test.go:248-280 tests (upgraded with 2+ errors removes archive and caches hashes)
codify the new behavior — but the assertions contradict what the legacy code did, which is exactly the regression.
Suggested fix. Either:
- Return {ArchiveAction: ArchiveNoChange, CacheHashes: true} on this branch so the fallback at unified_relay_state_machine.go:135 routes through UpgradeToArchiveIfNeeded() (clean parity), and loosen the gate to mutation.ArchiveAction !=
ArchiveNoChange || mutation.CacheHashes so the cache signal isn't dropped (this is also the gate fix from the first review pass — both fixes are needed together). - Or decouple CacheHashes from ArchiveAction in applyMutation and gate ArchiveRemove strictly on attempt == 2, then update policy_test.go:248-280 to assert legacy-parity behavior.
Issue 2 — Observability regression: Circuit-breaker trip warning is silently dropped (severity: nit)
File: protocol/relaycore/unified_relay_state_machine.go:307-315 (root cause in protocol/relaypolicy/policy.go:117-121).
What's lost. Pre-refactor SmartRouter emitted a distinctive warning when consecutivePairingErrors >= 2:
Circuit breaker triggered: All providers exhausted, stopping retries
consecutivePairingErrors=2 batchNumber=N timesSaved="~8 seconds of futile retries avoided"
Post-refactor, OnSendRelayResult() returns SendStop silently; the only log on the SendStop branch is Failed Sending First Message, gated on BatchNumber()==0 && GetConsecutiveBatchErrors()==SendRelayAttempts+1.
Why the guard never fires for CB trips. The CB threshold (default 2) is reached before consecutiveBatchErrors reaches SendRelayAttempts+1 (default 4). With SendRelayAttempts=3, CircuitBreakerThreshold=2, two consecutive PairingListEmptyErrors yield
consecutiveBatchErrors=2 != 4 → guard false → no log. Additionally, CB can trip on BatchNumber > 0, which fails the other half of the guard.
Impact. Functional behavior is preserved — validateReturnCondition still fires, the error still propagates. But operators previously distinguished "provider pool exhausted" from "ordinary batch-retry exhaustion" by grepping for the timesSaved string.
That signal is gone.
Suggested fix. Cleanest option: extend SendResult in relaycore/interfaces.go with a SendStopCircuitBreaker variant; have policy.go:117-121 return it when the CB threshold is hit; add a branch in unified_relay_state_machine.go:310 that emits the
historical warning with consecutivePairingErrors, batchNumber, and timesSaved attributes. This preserves the "policy is pure decision logic" invariant the refactor otherwise upholds. (Logging directly inside policy.go works but couples the policy to
logging, which the refactor avoids.)
…ce, add tests - Delete relaypolicy/eligibility.go (zero callers — used_providers.go calls common.DecideEligibility directly) - Promote GetResultsSummary() to ResultsCheckerInf interface, removing the type assertion fallback that silently returned empty summaries for non-RelayProcessor implementations - Add SendNodeErrorWithRetryable to relaycoretest for cross-package use - Add TestConsumerStateMachineNonRetryableStopsImmediately: verifies policy stops on first attempt for non-retryable node errors - Add TestConsumerStateMachineBatchErrorCounterResetsOnSuccess: verifies OnSendRelayResult counter reset logic
Add EligibilityFunc callback to UsedProviders, breaking the circular dependency chain lavasession → relaypolicy → relaycore → lavasession. - UsedProviders.eligibilityFunc defaults to common.DecideEligibility - Consumer and SmartRouter inject relaypolicy.DecideEligibility via SetEligibilityFunc after creating UsedProviders - Restore relaypolicy/eligibility.go as the canonical entry point - RemoveUsed now calls the injected function instead of common directly
When SendStop is triggered by provider pool exhaustion (circuit breaker), log the distinctive warning that operators use to distinguish systemic pairing issues from transient batch errors. Matches main's behavior.
issue 1- No change needed — the branch behavior (remove archive when it fails) is more correct than main's |
Description
Closes: #XXXX
Author Checklist
All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.
I have...
!in the type prefix if API or client breaking changemainbranchReviewers Checklist
All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.
I have...