Skip to content

refactor(state machine & retry): State machine and retry mechanism refactor#2278

Merged
nimrod-teich merged 19 commits into
mainfrom
state_machine_changes
Apr 19, 2026
Merged

refactor(state machine & retry): State machine and retry mechanism refactor#2278
nimrod-teich merged 19 commits into
mainfrom
state_machine_changes

Conversation

@AnnaR-prog

Copy link
Copy Markdown
Contributor

Description

Closes: #XXXX


Author Checklist

All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.

I have...

  • read the contribution guide
  • included the correct type prefix in the PR title, you can find examples of the prefixes below:
  • confirmed ! in the type prefix if API or client breaking change
  • targeted the main branch
  • provided a link to the relevant issue or specification
  • reviewed "Files changed" and left comments if necessary
  • included the necessary unit and integration tests
  • updated the relevant documentation or specification, including comments for documenting Go code
  • confirmed all CI checks have passed

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.

I have...

  • confirmed the correct type prefix in the PR title
  • confirmed all author checklist items have been addressed
  • reviewed state machine logic, API design and naming, documentation is accurate, tests and test coverage

AnnaR-prog and others added 12 commits April 18, 2026 20:16
Replace duplicated ConsumerRelayStateMachine and SmartRouterRelayStateMachine
with a single UnifiedRelayStateMachine in relaycore, configured via
StateMachineConfig. Behavioral differences (circuit breaker, timeout priority,
unsupported method check) are controlled by config flags instead of separate
implementations. Package-level files become thin wrappers with type aliases
and config constructors.
Ensures timeout log messages include consecutiveBatchErrors for
observability parity between Consumer and SmartRouter timeout paths.
Phase 2 of the state machine refactor: extract all retry decision logic
into a pure Go policy engine (relaypolicy package).

- Create relaypolicy package with Policy.Decide() for post-relay decisions
  and Policy.OnSendRelayResult() for pre-relay (batch send) decisions
- Add ClassifyError() and DecideEligibility() functions for worker-side
  error classification and provider eligibility
- Add GetResultsSummary() to RelayProcessor as pure data aggregation
- Simplify HasRequiredNodeResults() to remove shouldRetryRelay() call;
  it now returns false unconditionally when no successes, letting the
  state machine's policy.Decide() handle retry logic
- Modify UnifiedRelayStateMachine to accept RelayPolicyInf and use it
  for all retry decisions instead of inline retryCondition/shouldRetry
- Add policy unit tests covering mode checks, error tolerance, epoch
  mismatch, hash errors, circuit breaker, and batch send retry
- Fix SmartRouter regression: send Done immediately when policy.Decide()
  returns Stop in gotResults path (matching old shouldRetryRelay=false
  behavior). Add IsTickerHedge flag so ticker hedges skip error tolerance
  checks (matching old retryCondition behavior).
- Wire archive mutation: stateTransition now reads DecisionOutput.Mutation
  and applies it via applyMutation(), falling back to UpgradeToArchiveIfNeeded
  only when no mutation is present.
- Wire ClassifyNodeError into SmartRouter worker (rpcsmartrouter_server.go),
  matching the consumer worker wiring. Both now use relaypolicy.ClassifyNodeError.
- Document eligibility behavior equivalence between relaypolicy.DecideEligibility
  and used_providers.shouldRetryWithThisError (circular import prevents direct call).
- Remove dead config flags: EnableUnsupportedMethodCheck from StateMachineConfig,
  EnableUnsupportedCheck from PolicyConfig.
- Add focused tests for IsTickerHedge and archive mutation in policy_test.go.
- Update design doc: fix stale ClassifyError/EnableUnsupportedCheck references,
  mark old architecture section as historical, update implementation status table.
…assification

- Remove ClassifyProtocolError() — unused, protocol error classification
  already happens in GetResultsSummary()
- Remove IsRetryable and IsSolanaNonRetryable from ErrorClassification —
  dead fields, only IsUnsupportedMethod is consumed
- Revert SmartRouter worker classification addition to preserve main
  behavior (SmartRouter never sets IsUnsupportedMethod)
Replace separate HasUnsupportedMethod/HasUserError checks in policy.Decide()
with a single HasNonRetryableNodeError flag, matching main's
HasNonRetryableUserFacingErrors() behavior. Move eligibility decision logic
to common package to break circular import between relaypolicy and lavasession.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tream removal

Align with main's removal of SubCategoryUserError (21fbf75). User-input
errors now charge normal CU since responses are not cached. Only
IsUnsupportedMethod retains the zero-CU carve-out. IsNonRetryable remains
the umbrella retry gate covering all non-retryable error categories.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ches

- Delete shouldRetryRelay, HasNonRetryableUserFacingErrors,
  HasUnsupportedMethodErrors, isBatchRequest from relay_processor.go
  (all superseded by GetResultsSummary + policy.Decide)
- Extract addArchiveExtension, removeArchiveExtension, cacheBlockHashes
  helpers from UpgradeToArchiveIfNeeded; applyMutation now calls helpers
  directly instead of passing fake batch numbers
- Migrate tests to exercise GetResultsSummary instead of deleted methods
- Add archive mutation early-bail tests (upgraded + 2 errors)
- Fix doc: DecideEligibility signature, shouldRetryWithThisError removal,
  remove all "not yet wired" references, update E2E scenario traces
stateTransition() falls back to legacy UpgradeToArchiveIfNeeded()
when policy.Decide() returns a zero-value mutation (epoch-mismatch
retries, early stops). Both paths produce the same results today
but must stay in sync. Added explanatory note in the architecture
doc and a clarifying comment at the fallback site.
… state machine

Stop path regression: when policy.Decide() returned Stop on the
gotResults path, the unified machine returned immediately, losing
in-flight relays that might succeed slightly later. Restored the
main branch behavior: validateReturnCondition(nil) + continue
listening via readResultsFromProcessor().

Observability regression: ticker-driven hedges no longer incremented
analytics.HedgeCount. Restored the increment so hedge incident
metrics remain accurate.
- Restore StatusCode extraction from provider trailer in consumer
  (dropped during rebase — provider sends HTTP status via gRPC trailer
  metadata, consumer needs it for REST node error detection)
- Fix provider test mode to set StatusCode trailer on gRPC response
  (test mode bypasses chain proxy where this is normally set)
- Fix TestModeChainRouter.ExtensionsSupported to return true for empty
  extensions (was always returning false, blocking all test mode relays)
- Update three-provider test script with proper proposal sequencing
  and test response config for retry policy testing
- TestModeChainRouter.ExtensionsSupported now returns true for all
  extensions (not just empty), enabling archive migration testing
- Update archive test script with proper proposal sequencing
- Add abci_query test response with execution reverted error for
  non-retryable (non-unsupported) error classification testing
@AnnaR-prog AnnaR-prog changed the title State machine and retry mechanism refactor: refactor(state machine & retry): State machine and retry mechanism refactor Apr 18, 2026
@codecov

codecov Bot commented Apr 18, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 29.78304% with 356 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
protocol/relaycore/unified_relay_state_machine.go 0.00% 242 Missing ⚠️
protocol/relaycore/relay_state.go 0.00% 55 Missing ⚠️
protocol/relaypolicy/classify.go 0.00% 13 Missing ⚠️
protocol/rpcconsumer/rpcconsumer_server.go 7.14% 13 Missing ⚠️
protocol/relaycore/relay_processor.go 75.00% 6 Missing and 2 partials ⚠️
protocol/common/eligibility.go 0.00% 6 Missing ⚠️
protocol/lavasession/used_providers.go 50.00% 6 Missing ⚠️
protocol/relaypolicy/policy.go 94.64% 3 Missing ⚠️
protocol/rpcsmartrouter/rpcsmartrouter_server.go 0.00% 3 Missing ⚠️
protocol/common/error_registry.go 0.00% 1 Missing and 1 partial ⚠️
... and 3 more
Flag Coverage Δ
consensus 8.96% <0.00%> (-0.01%) ⬇️
protocol 34.83% <29.78%> (-0.49%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
protocol/relaycore/interfaces.go 0.00% <ø> (ø)
protocol/relaycoretest/testing_helpers.go 99.08% <100.00%> (+0.16%) ⬆️
...otocol/rpcconsumer/consumer_relay_state_machine.go 100.00% <100.00%> (+23.41%) ⬆️
.../rpcsmartrouter/smartrouter_relay_state_machine.go 100.00% <100.00%> (+26.06%) ⬆️
protocol/chainlib/chainlib.go 38.59% <0.00%> (ø)
protocol/common/error_registry.go 79.48% <0.00%> (-1.04%) ⬇️
protocol/relaypolicy/eligibility.go 0.00% <0.00%> (ø)
protocol/rpcprovider/rpcprovider_server.go 9.61% <0.00%> (-0.03%) ⬇️
protocol/relaypolicy/policy.go 94.64% <94.64%> (ø)
protocol/rpcsmartrouter/rpcsmartrouter_server.go 13.33% <0.00%> (-0.02%) ⬇️
... and 7 more

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions

github-actions Bot commented Apr 18, 2026

Copy link
Copy Markdown

Test Results

0 tests  ±0   0 ✅ ±0   0s ⏱️ ±0s
0 suites ±0   0 💤 ±0 
7 files   ±0   0 ❌ ±0 

Results for commit 104e382. ± Comparison against base commit 5921aa2.

♻️ This comment has been updated with latest results.

- Add protocol/common/eligibility.go (was untracked, causing CI build
  failure: common.DecideEligibility undefined)
- Replace TestShouldRetryWithThisError with TestDecideEligibility
  (old test referenced deleted shouldRetryWithThisError method)
Add upper/lower bounds check before converting int to uint32 in
grpcCodeMatcher.Matches to prevent silent truncation on negative
or oversized values. Fixes CodeQL high severity finding.
@AnnaR-prog AnnaR-prog requested a review from avitenzer April 18, 2026 18:01

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: Dead code. DecideEligibility, EligibilityAction, EligibilityResult, MarkUnwanted, AllowRetry re-exports have zero callers — used_providers.go calls common.DecideEligibility directly.
Suggestion: Delete this file in this PR. Landing unused abstraction creates a discoverability trap for the next reader.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit 87ea930 addressed this comment by wiring UsedProviders.RemoveUsed through the new common.DecideEligibility helper, so the previously unused eligibility definitions now have an actual caller and are not dead code anymore.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the code, so policy.DecideEligibility will be called

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: The gate mutation.ArchiveAction != ArchiveNoChange silently drops a CacheHashes: true signal when no archive change is needed.
Suggestion:
if mutation != nil && (mutation.ArchiveAction != ArchiveNoChange || mutation.CacheHashes) {
upgradedProtocolMessage = sm.applyMutation(protocolMessage, archiveStatus, *mutation)
} else {
upgradedProtocolMessage = UpgradeToArchiveIfNeeded(...)
}
Or, document in MutationOutput that CacheHashes is only valid when paired with an archive action.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit 87ea930 addressed this comment by updating UnifiedRelayStateMachine.stateTransition to invoke applyMutation whenever CacheHashes is true (even if ArchiveAction stays ArchiveNoChange) and by keeping cacheBlockHashes inside applyMutation. This ensures CacheHashes signals are honored instead of silently falling back to UpgradeToArchiveIfNeeded.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: Type assertion to *RelayProcessor falls back to empty ResultsSummary{}, which the policy treats as "no errors → retry." A test double or future implementation will silently break the policy.
Suggestion: Promote GetResultsSummary() to ResultsCheckerInf:
type ResultsCheckerInf interface {
WaitForResults(ctx context.Context) error
HasRequiredNodeResults(tries int) (bool, int)
GetResultsSummary() ResultsSummary
GetCrossValidationParams() *common.CrossValidationParams
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit faa4470 addressed this comment by adding GetResultsSummary to ResultsCheckerInf and letting the unified relay state machine call that interface method directly instead of asserting to *RelayProcessor, which removes the fallback that treated missing summaries as empty results.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add direct UnifiedRelayStateMachine tests, especially for EnableTimeoutPriority branches and the circuit-breaker / batch-error counter interaction.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit faa4470 addressed this comment by adding TestConsumerStateMachineBatchErrorCounterResetsOnSuccess in consumer_relay_state_machine_test.go to exercise OnSendRelayResult and assert that the consecutive batch-error counter resets on success, ensuring the circuit-breaker interaction is covered by the new test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

The gate only checked ArchiveAction != ArchiveNoChange, which would
silently drop a CacheHashes-only mutation. Include CacheHashes in the
condition so applyMutation is called whenever the policy returns any
non-empty mutation.

@avitenzer avitenzer left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue 1 — Behavior regression: Archive extension is stripped where legacy preserved it (severity: normal)

File: protocol/relaypolicy/policy.go:85-90 (with downstream effect via protocol/relaycore/unified_relay_state_machine.go:135 → applyMutation).

The divergence. On the isUpgraded && NodeErrors >= 2 early-bail branch:

  • Legacy UpgradeToArchiveIfNeeded() (relay_state.go:258-261) caches block hashes and returns the protocol message unchanged — Archive extension stays attached, archiveStatus.isArchive stays true.
  • New decideMutation() returns {ArchiveAction: RemoveArchive, CacheHashes: true}. applyMutation() then strips the Archive extension via removeArchiveExtension() and stores archiveStatus.isArchive = false.

Why the safeguard misses it. The fallback gate at unified_relay_state_machine.go:135 is mutation.ArchiveAction != ArchiveNoChange. RemoveArchive is not NoChange, so the fallback never engages — the new path takes over and diverges. The inline comment
("Both paths must stay in sync until the fallback is eliminated") and the design-doc invariant at 14_STATE_MACHINE_ARCHITECTURE_AND_REFACTOR.md:822 are both violated.

Reproduction (Stateless, MaxRetries=10, RelayRetryLimit=2):

  1. Attempt 0 → node error → policy adds archive → isArchive=true, isUpgraded=true.
  2. Attempt 1 (with archive) → second node error → NodeErrors=2. Legacy: caches hashes, archive stays attached. New: caches hashes and removes archive extension, isArchive=false.
  3. Attempt 2 retry: legacy still sends archive on the wire; new code sends a non-archive request and reports GetIsArchive()==false to downstream consumers.

Impact. On-wire request shape changes; GetIsArchive() reads diverge from legacy for any caller that consults it (eligibility, consistency, logging). The PR's own policy_test.go:248-280 tests (upgraded with 2+ errors removes archive and caches hashes)
codify the new behavior — but the assertions contradict what the legacy code did, which is exactly the regression.

Suggested fix. Either:

  • Return {ArchiveAction: ArchiveNoChange, CacheHashes: true} on this branch so the fallback at unified_relay_state_machine.go:135 routes through UpgradeToArchiveIfNeeded() (clean parity), and loosen the gate to mutation.ArchiveAction !=
    ArchiveNoChange || mutation.CacheHashes so the cache signal isn't dropped (this is also the gate fix from the first review pass — both fixes are needed together).
  • Or decouple CacheHashes from ArchiveAction in applyMutation and gate ArchiveRemove strictly on attempt == 2, then update policy_test.go:248-280 to assert legacy-parity behavior.

Issue 2 — Observability regression: Circuit-breaker trip warning is silently dropped (severity: nit)

File: protocol/relaycore/unified_relay_state_machine.go:307-315 (root cause in protocol/relaypolicy/policy.go:117-121).

What's lost. Pre-refactor SmartRouter emitted a distinctive warning when consecutivePairingErrors >= 2:
Circuit breaker triggered: All providers exhausted, stopping retries
consecutivePairingErrors=2 batchNumber=N timesSaved="~8 seconds of futile retries avoided"
Post-refactor, OnSendRelayResult() returns SendStop silently; the only log on the SendStop branch is Failed Sending First Message, gated on BatchNumber()==0 && GetConsecutiveBatchErrors()==SendRelayAttempts+1.

Why the guard never fires for CB trips. The CB threshold (default 2) is reached before consecutiveBatchErrors reaches SendRelayAttempts+1 (default 4). With SendRelayAttempts=3, CircuitBreakerThreshold=2, two consecutive PairingListEmptyErrors yield
consecutiveBatchErrors=2 != 4 → guard false → no log. Additionally, CB can trip on BatchNumber > 0, which fails the other half of the guard.

Impact. Functional behavior is preserved — validateReturnCondition still fires, the error still propagates. But operators previously distinguished "provider pool exhausted" from "ordinary batch-retry exhaustion" by grepping for the timesSaved string.
That signal is gone.

Suggested fix. Cleanest option: extend SendResult in relaycore/interfaces.go with a SendStopCircuitBreaker variant; have policy.go:117-121 return it when the CB threshold is hit; add a branch in unified_relay_state_machine.go:310 that emits the
historical warning with consecutivePairingErrors, batchNumber, and timesSaved attributes. This preserves the "policy is pure decision logic" invariant the refactor otherwise upholds. (Logging directly inside policy.go works but couples the policy to
logging, which the refactor avoids.)

…ce, add tests

- Delete relaypolicy/eligibility.go (zero callers — used_providers.go
  calls common.DecideEligibility directly)
- Promote GetResultsSummary() to ResultsCheckerInf interface, removing
  the type assertion fallback that silently returned empty summaries
  for non-RelayProcessor implementations
- Add SendNodeErrorWithRetryable to relaycoretest for cross-package use
- Add TestConsumerStateMachineNonRetryableStopsImmediately: verifies
  policy stops on first attempt for non-retryable node errors
- Add TestConsumerStateMachineBatchErrorCounterResetsOnSuccess: verifies
  OnSendRelayResult counter reset logic
Add EligibilityFunc callback to UsedProviders, breaking the circular
dependency chain lavasession → relaypolicy → relaycore → lavasession.

- UsedProviders.eligibilityFunc defaults to common.DecideEligibility
- Consumer and SmartRouter inject relaypolicy.DecideEligibility via
  SetEligibilityFunc after creating UsedProviders
- Restore relaypolicy/eligibility.go as the canonical entry point
- RemoveUsed now calls the injected function instead of common directly
When SendStop is triggered by provider pool exhaustion (circuit breaker),
log the distinctive warning that operators use to distinguish systemic
pairing issues from transient batch errors. Matches main's behavior.
@AnnaR-prog

Copy link
Copy Markdown
Contributor Author

Issue 1 — Behavior regression: Archive extension is stripped where legacy preserved it (severity: normal)

File: protocol/relaypolicy/policy.go:85-90 (with downstream effect via protocol/relaycore/unified_relay_state_machine.go:135 → applyMutation).

The divergence. On the isUpgraded && NodeErrors >= 2 early-bail branch:

  • Legacy UpgradeToArchiveIfNeeded() (relay_state.go:258-261) caches block hashes and returns the protocol message unchanged — Archive extension stays attached, archiveStatus.isArchive stays true.
  • New decideMutation() returns {ArchiveAction: RemoveArchive, CacheHashes: true}. applyMutation() then strips the Archive extension via removeArchiveExtension() and stores archiveStatus.isArchive = false.

Why the safeguard misses it. The fallback gate at unified_relay_state_machine.go:135 is mutation.ArchiveAction != ArchiveNoChange. RemoveArchive is not NoChange, so the fallback never engages — the new path takes over and diverges. The inline comment ("Both paths must stay in sync until the fallback is eliminated") and the design-doc invariant at 14_STATE_MACHINE_ARCHITECTURE_AND_REFACTOR.md:822 are both violated.

Reproduction (Stateless, MaxRetries=10, RelayRetryLimit=2):

  1. Attempt 0 → node error → policy adds archive → isArchive=true, isUpgraded=true.
  2. Attempt 1 (with archive) → second node error → NodeErrors=2. Legacy: caches hashes, archive stays attached. New: caches hashes and removes archive extension, isArchive=false.
  3. Attempt 2 retry: legacy still sends archive on the wire; new code sends a non-archive request and reports GetIsArchive()==false to downstream consumers.

Impact. On-wire request shape changes; GetIsArchive() reads diverge from legacy for any caller that consults it (eligibility, consistency, logging). The PR's own policy_test.go:248-280 tests (upgraded with 2+ errors removes archive and caches hashes) codify the new behavior — but the assertions contradict what the legacy code did, which is exactly the regression.

Suggested fix. Either:

  • Return {ArchiveAction: ArchiveNoChange, CacheHashes: true} on this branch so the fallback at unified_relay_state_machine.go:135 routes through UpgradeToArchiveIfNeeded() (clean parity), and loosen the gate to mutation.ArchiveAction !=
    ArchiveNoChange || mutation.CacheHashes so the cache signal isn't dropped (this is also the gate fix from the first review pass — both fixes are needed together).
  • Or decouple CacheHashes from ArchiveAction in applyMutation and gate ArchiveRemove strictly on attempt == 2, then update policy_test.go:248-280 to assert legacy-parity behavior.

Issue 2 — Observability regression: Circuit-breaker trip warning is silently dropped (severity: nit)

File: protocol/relaycore/unified_relay_state_machine.go:307-315 (root cause in protocol/relaypolicy/policy.go:117-121).

What's lost. Pre-refactor SmartRouter emitted a distinctive warning when consecutivePairingErrors >= 2: Circuit breaker triggered: All providers exhausted, stopping retries consecutivePairingErrors=2 batchNumber=N timesSaved="~8 seconds of futile retries avoided" Post-refactor, OnSendRelayResult() returns SendStop silently; the only log on the SendStop branch is Failed Sending First Message, gated on BatchNumber()==0 && GetConsecutiveBatchErrors()==SendRelayAttempts+1.

Why the guard never fires for CB trips. The CB threshold (default 2) is reached before consecutiveBatchErrors reaches SendRelayAttempts+1 (default 4). With SendRelayAttempts=3, CircuitBreakerThreshold=2, two consecutive PairingListEmptyErrors yield consecutiveBatchErrors=2 != 4 → guard false → no log. Additionally, CB can trip on BatchNumber > 0, which fails the other half of the guard.

Impact. Functional behavior is preserved — validateReturnCondition still fires, the error still propagates. But operators previously distinguished "provider pool exhausted" from "ordinary batch-retry exhaustion" by grepping for the timesSaved string. That signal is gone.

Suggested fix. Cleanest option: extend SendResult in relaycore/interfaces.go with a SendStopCircuitBreaker variant; have policy.go:117-121 return it when the CB threshold is hit; add a branch in unified_relay_state_machine.go:310 that emits the historical warning with consecutivePairingErrors, batchNumber, and timesSaved attributes. This preserves the "policy is pure decision logic" invariant the refactor otherwise upholds. (Logging directly inside policy.go works but couples the policy to logging, which the refactor avoids.)

Issue 1 — Behavior regression: Archive extension is stripped where legacy preserved it (severity: normal)

File: protocol/relaypolicy/policy.go:85-90 (with downstream effect via protocol/relaycore/unified_relay_state_machine.go:135 → applyMutation).

The divergence. On the isUpgraded && NodeErrors >= 2 early-bail branch:

  • Legacy UpgradeToArchiveIfNeeded() (relay_state.go:258-261) caches block hashes and returns the protocol message unchanged — Archive extension stays attached, archiveStatus.isArchive stays true.
  • New decideMutation() returns {ArchiveAction: RemoveArchive, CacheHashes: true}. applyMutation() then strips the Archive extension via removeArchiveExtension() and stores archiveStatus.isArchive = false.

Why the safeguard misses it. The fallback gate at unified_relay_state_machine.go:135 is mutation.ArchiveAction != ArchiveNoChange. RemoveArchive is not NoChange, so the fallback never engages — the new path takes over and diverges. The inline comment ("Both paths must stay in sync until the fallback is eliminated") and the design-doc invariant at 14_STATE_MACHINE_ARCHITECTURE_AND_REFACTOR.md:822 are both violated.

Reproduction (Stateless, MaxRetries=10, RelayRetryLimit=2):

  1. Attempt 0 → node error → policy adds archive → isArchive=true, isUpgraded=true.
  2. Attempt 1 (with archive) → second node error → NodeErrors=2. Legacy: caches hashes, archive stays attached. New: caches hashes and removes archive extension, isArchive=false.
  3. Attempt 2 retry: legacy still sends archive on the wire; new code sends a non-archive request and reports GetIsArchive()==false to downstream consumers.

Impact. On-wire request shape changes; GetIsArchive() reads diverge from legacy for any caller that consults it (eligibility, consistency, logging). The PR's own policy_test.go:248-280 tests (upgraded with 2+ errors removes archive and caches hashes) codify the new behavior — but the assertions contradict what the legacy code did, which is exactly the regression.

Suggested fix. Either:

  • Return {ArchiveAction: ArchiveNoChange, CacheHashes: true} on this branch so the fallback at unified_relay_state_machine.go:135 routes through UpgradeToArchiveIfNeeded() (clean parity), and loosen the gate to mutation.ArchiveAction !=
    ArchiveNoChange || mutation.CacheHashes so the cache signal isn't dropped (this is also the gate fix from the first review pass — both fixes are needed together).
  • Or decouple CacheHashes from ArchiveAction in applyMutation and gate ArchiveRemove strictly on attempt == 2, then update policy_test.go:248-280 to assert legacy-parity behavior.

Issue 2 — Observability regression: Circuit-breaker trip warning is silently dropped (severity: nit)

File: protocol/relaycore/unified_relay_state_machine.go:307-315 (root cause in protocol/relaypolicy/policy.go:117-121).

What's lost. Pre-refactor SmartRouter emitted a distinctive warning when consecutivePairingErrors >= 2: Circuit breaker triggered: All providers exhausted, stopping retries consecutivePairingErrors=2 batchNumber=N timesSaved="~8 seconds of futile retries avoided" Post-refactor, OnSendRelayResult() returns SendStop silently; the only log on the SendStop branch is Failed Sending First Message, gated on BatchNumber()==0 && GetConsecutiveBatchErrors()==SendRelayAttempts+1.

Why the guard never fires for CB trips. The CB threshold (default 2) is reached before consecutiveBatchErrors reaches SendRelayAttempts+1 (default 4). With SendRelayAttempts=3, CircuitBreakerThreshold=2, two consecutive PairingListEmptyErrors yield consecutiveBatchErrors=2 != 4 → guard false → no log. Additionally, CB can trip on BatchNumber > 0, which fails the other half of the guard.

Impact. Functional behavior is preserved — validateReturnCondition still fires, the error still propagates. But operators previously distinguished "provider pool exhausted" from "ordinary batch-retry exhaustion" by grepping for the timesSaved string. That signal is gone.

Suggested fix. Cleanest option: extend SendResult in relaycore/interfaces.go with a SendStopCircuitBreaker variant; have policy.go:117-121 return it when the CB threshold is hit; add a branch in unified_relay_state_machine.go:310 that emits the historical warning with consecutivePairingErrors, batchNumber, and timesSaved attributes. This preserves the "policy is pure decision logic" invariant the refactor otherwise upholds. (Logging directly inside policy.go works but couples the policy to logging, which the refactor avoids.)

issue 1- No change needed — the branch behavior (remove archive when it fails) is more correct than main's
issue 2 - Fixed — restored the "All providers exhausted" warning when circuit breaker triggers

@nimrod-teich nimrod-teich merged commit 0ea2279 into main Apr 19, 2026
30 checks passed
@nimrod-teich nimrod-teich deleted the state_machine_changes branch April 19, 2026 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants