Skip to content

fix: make relay retry limit flag work on consumer/smart router#2232

Merged
nimrod-teich merged 2 commits into
mainfrom
fix/relay-retry-count-dead-code
Mar 1, 2026
Merged

fix: make relay retry limit flag work on consumer/smart router#2232
nimrod-teich merged 2 commits into
mainfrom
fix/relay-retry-count-dead-code

Conversation

@nimrod-teich

@nimrod-teich nimrod-teich commented Feb 25, 2026

Copy link
Copy Markdown
Contributor

The --set-retry-count-on-node-error flag (now --set-relay-retry-limit)
was dead code on the consumer/smart router side. The shouldRetryRelay()
function was unreachable inside HasRequiredNodeResults() due to a
needsMoreRetries guard that was always true when reached. This meant
consumers always retried up to the hardcoded MaximumNumberOfTickerRelayRetries=10
regardless of the flag value.

  • Remove dead needsMoreRetries guard in HasRequiredNodeResults
  • Move selection mode check into shouldRetryRelay for consolidated retry policy
  • Use single RelayRetryLimit flag for both node and protocol error limits
  • Add unit tests for relay processor retry limits (node + protocol errors)
  • Add integration tests for consumer and smart router state machines

Description

Closes: #XXXX


Author Checklist

All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.

I have...

  • read the contribution guide
  • included the correct type prefix in the PR title, you can find examples of the prefixes below:
  • confirmed ! in the type prefix if API or client breaking change
  • targeted the main branch
  • provided a link to the relevant issue or specification
  • reviewed "Files changed" and left comments if necessary
  • included the necessary unit and integration tests
  • updated the relevant documentation or specification, including comments for documenting Go code
  • confirmed all CI checks have passed

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.

I have...

  • confirmed the correct type prefix in the PR title
  • confirmed all author checklist items have been addressed
  • reviewed state machine logic, API design and naming, documentation is accurate, tests and test coverage

Note

Medium Risk
Changes core retry/stop conditions for relay processing and state machines, which can alter request behavior under error conditions; risk is mitigated by extensive new unit and integration tests covering node/protocol/mixed errors and header reporting.

Overview
Fixes consumer/smart-router relay retry behavior by replacing the node-error-only retry knob with a unified RelayRetryLimit that caps total retries across node + protocol errors (with epoch-mismatch still always retrying), and by making HasRequiredNodeResults actually consult the consolidated shouldRetryRelay policy.

Updates the consumer and smart router CLIs to use --set-relay-retry-limit, and adjusts the Lava-Retries header to report actual retries as total attempts - 1 rather than raw error counts. Adds targeted tests for relay-processor retry limiting (node/protocol/mixed), end-to-end state machine stopping behavior, and retry header correctness.

Written by Cursor Bugbot for commit 108fa70. This will update automatically on new commits. Configure here.

@qodo-code-review

Copy link
Copy Markdown

Review Summary by Qodo

Fix relay retry dead code and add protocol error retry limit

🐞 Bug fix ✨ Enhancement 🧪 Tests

Grey Divider

Walkthroughs

Description
• Fix dead code in HasRequiredNodeResults that prevented --set-retry-count-on-node-error flag
  from working
• Add new --set-retry-count-on-protocol-error flag for protocol error retry limits
• Remove unreachable needsMoreRetries guard and call shouldRetryRelay directly
• Add comprehensive tests for both node and protocol error retry limits
Diagram
flowchart LR
  A["HasRequiredNodeResults<br/>with dead guard"] -->|Remove guard| B["Call shouldRetryRelay<br/>directly"]
  B -->|Check node errors| C["RelayCountOnNodeError<br/>limit"]
  B -->|Check protocol errors| D["RelayCountOnProtocolError<br/>limit"]
  C -->|Exceeded| E["Stop retrying<br/>return error"]
  D -->|Exceeded| E
  C -->|Under limit| F["Continue retrying"]
  D -->|Under limit| F
Loading

Grey Divider

File Changes

1. protocol/common/cobra_common.go ⚙️ Configuration changes +2/-1

Add protocol error retry limit flag constant

• Add new flag constant SetRelayCountOnProtocolErrorFlag for protocol error retry configuration
• Align flag name formatting with existing code style

protocol/common/cobra_common.go


2. protocol/relaycore/selection.go ✨ Enhancement +1/-0

Add protocol error retry count global variable

• Add new global variable RelayCountOnProtocolError initialized to 2
• Provides configurable limit for protocol error retries

protocol/relaycore/selection.go


3. protocol/relaycore/relay_processor.go 🐞 Bug fix +26/-21

Fix dead code and implement protocol error retry logic

• Add protocolErrors parameter to shouldRetryRelay method signature
• Remove dead needsMoreRetries guard in HasRequiredNodeResults for Stateless selection
• Call shouldRetryRelay directly instead of inside unreachable conditional branch
• Add protocol error limit check logic in shouldRetryRelay method
• Update debug logging to include protocol error count

protocol/relaycore/relay_processor.go


View more (5)
4. protocol/relaycore/relay_processor_test.go 🧪 Tests +168/-0

Add relay processor retry limit unit tests

• Add TestHasRequiredNodeResults_RelayCountOnNodeError with 3 test cases covering under-limit,
 over-limit, and disabled scenarios
• Add TestHasRequiredNodeResults_RelayCountOnProtocolError with 3 test cases for protocol error
 retry behavior
• Tests verify that retry limits are properly enforced at the relay processor level

protocol/relaycore/relay_processor_test.go


5. protocol/rpcconsumer/rpcconsumer.go ⚙️ Configuration changes +1/-0

Add protocol error retry flag to consumer CLI

• Add CLI flag binding for SetRelayCountOnProtocolErrorFlag with default value of 2
• Connects flag to relaycore.RelayCountOnProtocolError global variable

protocol/rpcconsumer/rpcconsumer.go


6. protocol/rpcconsumer/consumer_relay_state_machine_test.go 🧪 Tests +157/-0

Add consumer state machine retry limit integration tests

• Add TestConsumerStateMachineNodeErrorRetryLimit with 2 test cases verifying state machine stops
 retrying when node error limit is reached
• Add TestConsumerStateMachineProtocolErrorRetryLimit with 2 test cases verifying state machine
 stops retrying when protocol error limit is reached
• Tests verify end-to-end behavior through full state machine

protocol/rpcconsumer/consumer_relay_state_machine_test.go


7. protocol/rpcsmartrouter/rpcsmartrouter.go ⚙️ Configuration changes +1/-0

Add protocol error retry flag to smart router CLI

• Add CLI flag binding for SetRelayCountOnProtocolErrorFlag with default value of 2
• Connects flag to relaycore.RelayCountOnProtocolError global variable

protocol/rpcsmartrouter/rpcsmartrouter.go


8. protocol/rpcsmartrouter/smartrouter_relay_state_machine_test.go 🧪 Tests +157/-0

Add smart router state machine retry limit integration tests

• Add TestSmartRouterStateMachineNodeErrorRetryLimit with 2 test cases verifying state machine
 stops retrying when node error limit is reached
• Add TestSmartRouterStateMachineProtocolErrorRetryLimit with 2 test cases verifying state machine
 stops retrying when protocol error limit is reached
• Tests verify end-to-end behavior through full smart router state machine

protocol/rpcsmartrouter/smartrouter_relay_state_machine_test.go


Grey Divider

Qodo Logo

@qodo-code-review

qodo-code-review Bot commented Feb 25, 2026

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (1) 📎 Requirement gaps (0)

Grey Divider


Action required

1. Protocol flag overridden 🐞 Bug ✓ Correctness
Description
shouldRetryRelay() will retry even for protocol-only failures whenever RelayCountOnNodeError>0
because the node-error retry branch does not require any node errors; this can make
--set-retry-count-on-protocol-error ineffective unless node-error retries are also disabled.
Code

protocol/relaycore/relay_processor.go[R287-291]

+func (rp *RelayProcessor) shouldRetryRelay(resultsCount int, hashErr error, nodeErrors int, specialNodeErrors int, protocolErrors int) bool {
+	utils.LavaFormatDebug("shouldRetryRelay called", utils.LogAttr("GUID", rp.guid), utils.LogAttr("resultsCount", resultsCount), utils.LogAttr("hashErr", hashErr), utils.LogAttr("nodeErrors", nodeErrors), utils.LogAttr("specialNodeErrors", specialNodeErrors), utils.LogAttr("protocolErrors", protocolErrors))

	// Never retry if we detect unsupported method errors
	if rp.HasUnsupportedMethodErrors() {
Evidence
Protocol errors are tracked separately from node errors. For protocol-only failures, nodeErrors==0
while protocolErrors>0, yet the node-error branch still returns true because it only checks
nodeErrors<=RelayCountOnNodeError (0 is always <= when enabled). This prevents the protocol-error
branch (and its limit/disable flag) from being the deciding factor.

protocol/relaycore/relay_processor.go[286-334]
protocol/relaycore/results_manager.go[150-167]
protocol/relaycore/results_manager.go[223-232]
protocol/relaycore/results_manager.go[45-58]
protocol/rpcconsumer/rpcconsumer.go[752-754]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`shouldRetryRelay()` currently retries based on `RelayCountOnNodeError` even when there are **no node errors** (e.g., protocol-only failures), which can override/ignore the protocol-error retry limit/disable flag.

### Issue Context
Protocol errors are counted separately (`protocolResponseErrors`), so protocol-only failures commonly have `nodeErrors==0` and `protocolErrors&gt;0`. The node-error retry branch returns `true` for `nodeErrors==0` when enabled.

### Fix Focus Areas
- protocol/relaycore/relay_processor.go[286-335]
- protocol/relaycore/results_manager.go[150-167]
- protocol/relaycore/relay_processor_test.go[2278-2444]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. RelayCountOnProtocolError unvalidated flag 📘 Rule violation ⛨ Security
Description
The new --set-retry-count-on-protocol-error CLI flag is accepted without any bounds/validation,
allowing extremely large values that can cause excessive retry loops and resource usage. External
inputs should be validated to prevent misuse and denial-of-service style behavior.
Code

protocol/rpcconsumer/rpcconsumer.go[753]

+	cmdRPCConsumer.Flags().IntVar(&relaycore.RelayCountOnProtocolError, common.SetRelayCountOnProtocolErrorFlag, 2, "set the number of retries attempt on protocol errors")
Evidence
Compliance requires validation/sanitization of external inputs. The PR introduces a new
externally-controlled retry count via CLI flags and assigns it directly to a global without
enforcing non-negative/reasonable maximum values.

Rule 6: Generic: Security-First Input Validation and Data Handling
protocol/rpcconsumer/rpcconsumer.go[752-754]
protocol/rpcsmartrouter/rpcsmartrouter.go[1055-1057]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The new `--set-retry-count-on-protocol-error` flag is a user-controlled external input that is assigned directly to `relaycore.RelayCountOnProtocolError` without validation. Very large values can cause excessive retries and resource usage.

## Issue Context
The retry count is read from CLI flags in both consumer and smart router commands. Add a validation step after flag parsing (e.g., `PreRunE`/`PersistentPreRunE`, or immediately after flags are parsed) to enforce a safe range (e.g., `0 &lt;= value &lt;= MaxRetryLimit`). Consider applying the same validation to `RelayCountOnNodeError` for consistency.

## Fix Focus Areas
- protocol/rpcconsumer/rpcconsumer.go[752-754]
- protocol/rpcsmartrouter/rpcsmartrouter.go[1055-1057]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Epoch mismatch bypasses limits 🐞 Bug ⛯ Reliability
Description
Epoch-mismatch protocol errors currently force a retry whenever resultsCount==0, bypassing both
RelayCountOnNodeError and RelayCountOnProtocolError limits (including when users set them to 0).
Code

protocol/relaycore/relay_processor.go[R300-306]

	// Check if we have epoch mismatch errors that warrant retry
-	_, _, protocolErrors := rp.GetResultsData()
+	_, _, protocolErrorResults := rp.GetResultsData()
	hasEpochMismatchError := false
-	for _, protocolError := range protocolErrors {
+	for _, protocolError := range protocolErrorResults {
		if lavasession.EpochMismatchError.Is(protocolError.GetError()) {
			hasEpochMismatchError = true
			break
Evidence
The epoch mismatch check returns true immediately before evaluating either retry limit, so it can
keep retrying regardless of configured limits intended to stop retries (especially relevant now that
protocol error retries are explicitly configurable).

protocol/relaycore/relay_processor.go[300-318]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
Epoch mismatch errors currently short-circuit `shouldRetryRelay()` to retry unconditionally when there are no successes, bypassing the configured retry limits.

### Issue Context
This is especially relevant now that protocol error retry behavior is exposed via `--set-retry-count-on-protocol-error`.

### Fix Focus Areas
- protocol/relaycore/relay_processor.go[300-331]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment thread protocol/relaycore/relay_processor.go

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

This is the final PR Bugbot will review for you during this billing cycle

Your free Bugbot reviews will reset on March 23

Details

You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

Comment thread protocol/relaycore/relay_processor.go
@nimrod-teich nimrod-teich force-pushed the fix/relay-retry-count-dead-code branch from 5e79a93 to 38b193b Compare February 25, 2026 10:54
@github-actions

github-actions Bot commented Feb 25, 2026

Copy link
Copy Markdown

Test Results

0 tests  ±0   0 ✅ ±0   0s ⏱️ ±0s
0 suites ±0   0 💤 ±0 
7 files   ±0   0 ❌ ±0 

Results for commit 108fa70. ± Comparison against base commit 51e3a64.

♻️ This comment has been updated with latest results.

@nimrod-teich nimrod-teich force-pushed the fix/relay-retry-count-dead-code branch from 38b193b to 9369f95 Compare February 25, 2026 11:18
@nimrod-teich nimrod-teich changed the title fix(relay): fix dead code in HasRequiredNodeResults and add protocol error retry limit fix: make relay retry limit flag work on consumer/smart router Feb 25, 2026
@nimrod-teich nimrod-teich reopened this Feb 25, 2026
@qodo-code-review

Copy link
Copy Markdown

Review Summary by Qodo

Fix relay retry limit flag to work on consumer/smart router

🐞 Bug fix 🧪 Tests

Grey Divider

Walkthroughs

Description
• Fix relay retry limit flag by removing unreachable code in HasRequiredNodeResults
  - Dead needsMoreRetries guard prevented shouldRetryRelay from being called
  - Consumers/smart routers always retried up to hardcoded limit regardless of flag
• Consolidate retry policy with single RelayRetryLimit flag for all error types
  - Applies independently to both node errors and protocol errors
  - Replaces separate RelayCountOnNodeError flag with unified configuration
• Move selection mode check into shouldRetryRelay for cleaner retry logic
  - Stateful mode always retries when no successful results
  - CrossValidation mode never retries in this function
• Add comprehensive unit and integration tests for retry limit behavior
  - Test node errors, protocol errors, and mixed error scenarios
  - Verify consumer and smart router state machines respect retry limits
Diagram
flowchart LR
  A["HasRequiredNodeResults<br/>called"] --> B["Remove dead<br/>needsMoreRetries guard"]
  B --> C["Call shouldRetryRelay<br/>with all error types"]
  C --> D["Check selection mode<br/>Stateful/CrossValidation"]
  D --> E["Check RelayRetryLimit<br/>against error counts"]
  E --> F["Return retry decision<br/>to state machine"]
Loading

Grey Divider

File Changes

1. protocol/common/cobra_common.go ⚙️ Configuration changes +3/-0

Add new relay retry limit CLI flag constant

• Add new SetRelayRetryLimitFlag constant for CLI flag
• Update documentation to clarify flag controls both node and protocol errors
• Keep old SetRelayCountOnNodeErrorFlag for backward compatibility reference

protocol/common/cobra_common.go


2. protocol/relaycore/selection.go ✨ Enhancement +1/-1

Rename relay retry limit global variable

• Rename global variable from RelayCountOnNodeError to RelayRetryLimit
• Clarify that single limit applies to all error types independently

protocol/relaycore/selection.go


3. protocol/relaycore/relay_processor.go 🐞 Bug fix +41/-32

Fix relay retry logic and consolidate retry policy

• Remove dead needsMoreRetries guard that prevented retry logic execution
• Add protocolErrors parameter to shouldRetryRelay function
• Move selection mode checks (Stateful/CrossValidation) into shouldRetryRelay
• Implement independent limit checking for node and protocol errors
• Simplify HasRequiredNodeResults to always call shouldRetryRelay for Stateless mode

protocol/relaycore/relay_processor.go


View more (5)
4. protocol/relaycore/relay_processor_test.go 🧪 Tests +261/-0

Add unit tests for relay retry limit behavior

• Add TestHasRequiredNodeResults_RelayRetryLimit for node error retry limits
• Add TestHasRequiredNodeResults_RelayRetryLimitProtocolError for protocol error limits
• Add TestHasRequiredNodeResults_RelayRetryLimitMixed for mixed error scenarios
• Test cases cover under-limit, over-limit, and disabled (0) limit conditions

protocol/relaycore/relay_processor_test.go


5. protocol/rpcconsumer/rpcconsumer.go ⚙️ Configuration changes +1/-1

Wire new relay retry limit flag in consumer CLI

• Update CLI flag binding from SetRelayCountOnNodeErrorFlag to SetRelayRetryLimitFlag
• Bind to RelayRetryLimit variable instead of RelayCountOnNodeError
• Update flag description to clarify it applies to both error types

protocol/rpcconsumer/rpcconsumer.go


6. protocol/rpcconsumer/consumer_relay_state_machine_test.go 🧪 Tests +117/-4

Add consumer state machine retry limit integration tests

• Update TestConsumerStateMachineHappyFlow to use RelayRetryLimit variable
• Add TestConsumerStateMachineRetryLimit integration test with multiple scenarios
• Test node errors, protocol errors, and mixed errors with various limits
• Verify state machine stops retrying when limit is exceeded

protocol/rpcconsumer/consumer_relay_state_machine_test.go


7. protocol/rpcsmartrouter/rpcsmartrouter.go ⚙️ Configuration changes +1/-1

Wire new relay retry limit flag in smart router CLI

• Update CLI flag binding from SetRelayCountOnNodeErrorFlag to SetRelayRetryLimitFlag
• Bind to RelayRetryLimit variable instead of RelayCountOnNodeError
• Update flag description to clarify it applies to both error types

protocol/rpcsmartrouter/rpcsmartrouter.go


8. protocol/rpcsmartrouter/smartrouter_relay_state_machine_test.go 🧪 Tests +118/-0

Add smart router state machine retry limit integration tests

• Update TestProcessingContextStillValidAllowsRetries to set high RelayRetryLimit
• Add TestSmartRouterStateMachineRetryLimit integration test with multiple scenarios
• Test node errors, protocol errors, and mixed errors with various limits
• Verify state machine stops retrying when limit is exceeded

protocol/rpcsmartrouter/smartrouter_relay_state_machine_test.go


Grey Divider

Qodo Logo

@qodo-code-review

qodo-code-review Bot commented Feb 25, 2026

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (5) 📘 Rule violations (2) 📎 Requirement gaps (0)

Grey Divider


Action required

1. Legacy retry flag removed 🐞 Bug ⛯ Reliability ⭐ New
Description
rpcconsumer and rpcsmartrouter no longer register the previous --set-retry-count-on-node-error flag,
so existing scripts/configs will fail at startup with an unknown flag. This is a breaking CLI change
unless an alias/deprecation path is provided.
Code

protocol/rpcconsumer/rpcconsumer.go[752]

+	cmdRPCConsumer.Flags().IntVar(&relaycore.RelayRetryLimit, common.SetRelayRetryLimitFlag, 2, "set the number of retries on relay errors (node errors and protocol errors)")
Evidence
The consumer/smart-router commands bind only the new flag name to the retry-limit variable, while
the old flag constant still exists (and is still used by rpcprovider), indicating the old consumer
flag is now orphaned rather than supported/aliased.

protocol/rpcconsumer/rpcconsumer.go[748-753]
protocol/rpcsmartrouter/rpcsmartrouter.go[1051-1056]
protocol/common/cobra_common.go[31-36]
protocol/rpcprovider/rpcprovider.go[1304-1306]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`rpcconsumer` and `rpcsmartrouter` no longer accept the legacy `--set-retry-count-on-node-error` flag; they only register `--set-relay-retry-limit`. This can break existing deployments at startup.

### Issue Context
The old flag constant still exists and is used by `rpcprovider`, which increases the chance users keep passing it to consumer/smart-router binaries.

### Fix Focus Areas
- protocol/rpcconsumer/rpcconsumer.go[748-753]
- protocol/rpcsmartrouter/rpcsmartrouter.go[1051-1056]
- protocol/common/cobra_common.go[31-36]

### Suggested fix
- Register **both** flags on consumer and smart-router:
 - `--set-relay-retry-limit` (preferred)
 - `--set-retry-count-on-node-error` (deprecated alias)
- Bind both to `relaycore.RelayRetryLimit`.
- Mark the legacy flag as deprecated (Cobra supports `MarkDeprecated`) and/or hidden.
- Update the legacy flag help text to indicate it’s deprecated and aliases the new behavior.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Protocol flag overridden 🐞 Bug ✓ Correctness
Description
shouldRetryRelay() will retry even for protocol-only failures whenever RelayCountOnNodeError>0
because the node-error retry branch does not require any node errors; this can make
--set-retry-count-on-protocol-error ineffective unless node-error retries are also disabled.
Code

protocol/relaycore/relay_processor.go[R287-291]

+func (rp *RelayProcessor) shouldRetryRelay(resultsCount int, hashErr error, nodeErrors int, specialNodeErrors int, protocolErrors int) bool {
+	utils.LavaFormatDebug("shouldRetryRelay called", utils.LogAttr("GUID", rp.guid), utils.LogAttr("resultsCount", resultsCount), utils.LogAttr("hashErr", hashErr), utils.LogAttr("nodeErrors", nodeErrors), utils.LogAttr("specialNodeErrors", specialNodeErrors), utils.LogAttr("protocolErrors", protocolErrors))

  // Never retry if we detect unsupported method errors
  if rp.HasUnsupportedMethodErrors() {
Evidence
Protocol errors are tracked separately from node errors. For protocol-only failures, nodeErrors==0
while protocolErrors>0, yet the node-error branch still returns true because it only checks
nodeErrors<=RelayCountOnNodeError (0 is always <= when enabled). This prevents the protocol-error
branch (and its limit/disable flag) from being the deciding factor.

protocol/relaycore/relay_processor.go[286-334]
protocol/relaycore/results_manager.go[150-167]
protocol/relaycore/results_manager.go[223-232]
protocol/relaycore/results_manager.go[45-58]
protocol/rpcconsumer/rpcconsumer.go[752-754]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`shouldRetryRelay()` currently retries based on `RelayCountOnNodeError` even when there are **no node errors** (e.g., protocol-only failures), which can override/ignore the protocol-error retry limit/disable flag.
### Issue Context
Protocol errors are counted separately (`protocolResponseErrors`), so protocol-only failures commonly have `nodeErrors==0` and `protocolErrors&amp;gt;0`. The node-error retry branch returns `true` for `nodeErrors==0` when enabled.
### Fix Focus Areas
- protocol/relaycore/relay_processor.go[286-335]
- protocol/relaycore/results_manager.go[150-167]
- protocol/relaycore/relay_processor_test.go[2278-2444]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

3. relayCountOnError misnamed 📘 Rule violation ✓ Correctness ⭐ New
Description
The new retry-limit tests use relayCountOnError to represent RelayRetryLimit, which is
misleading and reduces readability/maintainability. This violates the requirement for
self-documenting identifier names.
Code

protocol/relaycore/relay_processor_test.go[R2279-2284]

+	tests := []struct {
+		name               string
+		relayCountOnError  int
+		nodeErrorsToSend   int
+		expectedDone       bool // true = stop retrying, false = keep retrying
+		expectedNodeErrors int
Evidence
Compliance requires identifiers to clearly express intent; the test struct field name
relayCountOnError no longer matches the new unified RelayRetryLimit concept and is used to set
RelayRetryLimit, making the code misleading.

Rule 2: Generic: Meaningful Naming and Self-Documenting Code
protocol/relaycore/relay_processor_test.go[2279-2284]
protocol/relaycore/relay_processor_test.go[2312-2314]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
New tests introduced for `RelayRetryLimit` still use the identifier `relayCountOnError`, which is misleading because it no longer represents node-error-only retry counts.

## Issue Context
The PR consolidates retry behavior under `RelayRetryLimit` for both node and protocol errors; test identifiers should reflect that to remain self-documenting.

## Fix Focus Areas
- protocol/relaycore/relay_processor_test.go[2279-2284]
- protocol/relaycore/relay_processor_test.go[2312-2314]
- protocol/relaycore/relay_processor_test.go[2362-2367]
- protocol/relaycore/relay_processor_test.go[2394-2397]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


4. Retry-limit semantics unclear 🐞 Bug ✓ Correctness ⭐ New
Description
RelayRetryLimit is enforced per error category (node vs protocol vs special-node), which can permit
more total retry attempts than the configured value under mixed errors. The CLI/help text reads like
a single global retry limit, which can mislead operators and complicate incident response.
Code

protocol/relaycore/relay_processor.go[R330-347]

+	// RelayRetryLimit controls the retry limit for all error types.
+	// Each error type that has occurred is checked independently against the limit.
+	if RelayRetryLimit > 0 && resultsCount == 0 && hashErr == nil {
+		hasNodeErrors := nodeErrors > 0 || specialNodeErrors > 0
+		hasProtocolErrors := protocolErrors > 0
+
+		if !hasNodeErrors && !hasProtocolErrors {
+			return false
+		}
+
+		// Check each error type that has occurred against the limit.
+		// An error type that hasn't occurred doesn't block retries.
+		nodeErrorsWithinLimit := !hasNodeErrors || (nodeErrors <= RelayRetryLimit && specialNodeErrors <= RelayRetryLimit)
+		protocolErrorsWithinLimit := !hasProtocolErrors || protocolErrors <= RelayRetryLimit
+
+		if nodeErrorsWithinLimit && protocolErrorsWithinLimit {
			return true
		}
Evidence
The retry decision checks node-error counts and protocol-error counts independently against the same
limit, and only stops retrying when any category exceeds the limit. Meanwhile, the CLI describes it
as the number of retries on 'relay errors' (without clarifying it is per-category), which can be
interpreted as a global budget.

protocol/relaycore/relay_processor.go[320-347]
protocol/rpcconsumer/rpcconsumer.go[751-753]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`RelayRetryLimit` is enforced per error category (node vs protocol vs special-node). Under mixed errors, total attempts can exceed the configured value, but the CLI flag description reads like a single global retry budget.

### Issue Context
Example: with `RelayRetryLimit=2`, a sequence of alternating node/protocol errors can produce more than 2 total retries before stopping (it stops only when one category exceeds 2).

### Fix Focus Areas
- protocol/relaycore/relay_processor.go[320-347]
- protocol/rpcconsumer/rpcconsumer.go[751-753]
- protocol/rpcsmartrouter/rpcsmartrouter.go[1054-1056]

### Suggested fix options
1) **Documentation fix (lowest risk):**
  - Update CLI help strings to say: &quot;per error type&quot; / &quot;applies independently to node and protocol errors&quot;.
  - Update the comment bullet list in `shouldRetryRelay` (it currently says &quot;only node errors&quot;).

2) **Behavior fix (if global limit is desired):**
  - Track a combined retry counter (e.g., totalErrors = nodeErrors+specialNodeErrors+protocolErrors) and compare that single value to `RelayRetryLimit`.
  - Adjust/extend tests to cover the chosen semantics.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


5. Aggressive test timeouts 🐞 Bug ⛯ Reliability ⭐ New
Description
New retry-limit unit tests use 10ms/100ms context timeouts in loops, which can become flaky under CI
load and cause intermittent failures unrelated to correctness. This reduces confidence in the suite
and may create noisy failures.
Code

protocol/relaycore/relay_processor_test.go[R2339-2351]

+				ctx, cancel := context.WithTimeout(context.Background(), time.Millisecond*10)
+				canUse := usedProviders.TryLockSelection(ctx)
+				require.NoError(t, ctx.Err())
+				require.Nil(t, canUse)
+				cancel()
+				consumerSessionsMap := lavasession.ConsumerSessionsMap{provider: &lavasession.SessionInfo{}}
+				usedProviders.AddUsed(consumerSessionsMap, nil)
+
+				go SendNodeError(relayProcessor, provider, 0)
+				ctx, cancel = context.WithTimeout(context.Background(), time.Millisecond*100)
+				err = relayProcessor.WaitForResults(ctx)
+				cancel()
+				require.NoError(t, err)
Evidence
The tests repeatedly assert operations complete within extremely short deadlines. On slower
machines, goroutine scheduling and HTTP/mock setup can exceed these windows, leading to sporadic
ctx.Err() failures.

protocol/relaycore/relay_processor_test.go[2338-2352]
protocol/relaycore/relay_processor_test.go[2418-2432]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
New unit tests use very small timeouts (10ms/100ms) repeatedly in loops, which can make CI flaky.

### Issue Context
These tests validate retry-limit logic; failures due to scheduling/timeouts would be noise and reduce confidence.

### Fix Focus Areas
- protocol/relaycore/relay_processor_test.go[2338-2352]
- protocol/relaycore/relay_processor_test.go[2418-2432]
- protocol/relaycore/relay_processor_test.go[2511-2529]

### Suggested fix
- Increase `context.WithTimeout` durations (e.g., 250ms-1s for TryLockSelection; 1s-2s for WaitForResults).
- Consider using `require.Eventually` (with a reasonable total timeout) for conditions that depend on goroutines.
- Keep the assertions (don’t remove them), just make them robust under CI variability.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


View more (2)
6. RelayCountOnProtocolError unvalidated flag 📘 Rule violation ⛨ Security
Description
The new --set-retry-count-on-protocol-error CLI flag is accepted without any bounds/validation,
allowing extremely large values that can cause excessive retry loops and resource usage. External
inputs should be validated to prevent misuse and denial-of-service style behavior.
Code

protocol/rpcconsumer/rpcconsumer.go[753]

+	cmdRPCConsumer.Flags().IntVar(&relaycore.RelayCountOnProtocolError, common.SetRelayCountOnProtocolErrorFlag, 2, "set the number of retries attempt on protocol errors")
Evidence
Compliance requires validation/sanitization of external inputs. The PR introduces a new
externally-controlled retry count via CLI flags and assigns it directly to a global without
enforcing non-negative/reasonable maximum values.

Rule 6: Generic: Security-First Input Validation and Data Handling
protocol/rpcconsumer/rpcconsumer.go[752-754]
protocol/rpcsmartrouter/rpcsmartrouter.go[1055-1057]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The new `--set-retry-count-on-protocol-error` flag is a user-controlled external input that is assigned directly to `relaycore.RelayCountOnProtocolError` without validation. Very large values can cause excessive retries and resource usage.
## Issue Context
The retry count is read from CLI flags in both consumer and smart router commands. Add a validation step after flag parsing (e.g., `PreRunE`/`PersistentPreRunE`, or immediately after flags are parsed) to enforce a safe range (e.g., `0 &amp;lt;= value &amp;lt;= MaxRetryLimit`). Consider applying the same validation to `RelayCountOnNodeError` for consistency.
## Fix Focus Areas
- protocol/rpcconsumer/rpcconsumer.go[752-754]
- protocol/rpcsmartrouter/rpcsmartrouter.go[1055-1057]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


7. Epoch mismatch bypasses limits 🐞 Bug ⛯ Reliability
Description
Epoch-mismatch protocol errors currently force a retry whenever resultsCount==0, bypassing both
RelayCountOnNodeError and RelayCountOnProtocolError limits (including when users set them to 0).
Code

protocol/relaycore/relay_processor.go[R300-306]

  // Check if we have epoch mismatch errors that warrant retry
-	_, _, protocolErrors := rp.GetResultsData()
+	_, _, protocolErrorResults := rp.GetResultsData()
  hasEpochMismatchError := false
-	for _, protocolError := range protocolErrors {
+	for _, protocolError := range protocolErrorResults {
  	if lavasession.EpochMismatchError.Is(protocolError.GetError()) {
  		hasEpochMismatchError = true
  		break
Evidence
The epoch mismatch check returns true immediately before evaluating either retry limit, so it can
keep retrying regardless of configured limits intended to stop retries (especially relevant now that
protocol error retries are explicitly configurable).

protocol/relaycore/relay_processor.go[300-318]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Epoch mismatch errors currently short-circuit `shouldRetryRelay()` to retry unconditionally when there are no successes, bypassing the configured retry limits.
### Issue Context
This is especially relevant now that protocol error retry behavior is exposed via `--set-retry-count-on-protocol-error`.
### Fix Focus Areas
- protocol/relaycore/relay_processor.go[300-331]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment thread protocol/rpcconsumer/rpcconsumer.go Outdated
@nimrod-teich nimrod-teich force-pushed the fix/relay-retry-count-dead-code branch from 9369f95 to ec666e8 Compare February 25, 2026 13:11
The --set-retry-count-on-node-error flag (now --set-relay-retry-limit)
was dead code on the consumer/smart router side. The shouldRetryRelay()
function was unreachable inside HasRequiredNodeResults() due to a
needsMoreRetries guard that was always true when reached. This meant
consumers always retried up to the hardcoded MaximumNumberOfTickerRelayRetries=10
regardless of the flag value.

- Remove dead needsMoreRetries guard in HasRequiredNodeResults
- Move selection mode check into shouldRetryRelay for consolidated retry policy
- Use single RelayRetryLimit flag for both node and protocol error limits
- Add unit tests for relay processor retry limits (node + protocol errors)
- Add integration tests for consumer and smart router state machines

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@nimrod-teich nimrod-teich force-pushed the fix/relay-retry-count-dead-code branch from ec666e8 to e4d5c2f Compare February 25, 2026 13:18
AnnaR-prog
AnnaR-prog previously approved these changes Feb 25, 2026
The header was computed as protocolErrors + nodeErrors, which counts
raw error responses. This is off-by-one when the final result is an
error (e.g., unsupported method with no retry), reporting 1 retry
when 0 occurred. Fix by computing totalAttempts - 1 instead, where
totalAttempts = successes + nodeErrors + protocolErrors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codecov

codecov Bot commented Mar 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 62.50000% with 9 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
protocol/relaycore/relay_processor.go 56.25% 5 Missing and 2 partials ⚠️
protocol/rpcconsumer/rpcconsumer.go 0.00% 1 Missing ⚠️
protocol/rpcsmartrouter/rpcsmartrouter.go 0.00% 1 Missing ⚠️
Flag Coverage Δ
consensus 8.71% <ø> (ø)
protocol 34.66% <62.50%> (+0.14%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
protocol/common/cobra_common.go 0.00% <ø> (ø)
protocol/rpcconsumer/rpcconsumer_server.go 32.87% <100.00%> (+1.79%) ⬆️
protocol/rpcsmartrouter/rpcsmartrouter_server.go 33.72% <100.00%> (+1.84%) ⬆️
protocol/rpcconsumer/rpcconsumer.go 0.00% <0.00%> (ø)
protocol/rpcsmartrouter/rpcsmartrouter.go 4.30% <0.00%> (ø)
protocol/relaycore/relay_processor.go 54.87% <56.25%> (+5.62%) ⬆️

... and 6 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nimrod-teich nimrod-teich merged commit 9861746 into main Mar 1, 2026
32 checks passed
@nimrod-teich nimrod-teich deleted the fix/relay-retry-count-dead-code branch March 1, 2026 12:07
@nimrod-teich nimrod-teich restored the fix/relay-retry-count-dead-code branch March 1, 2026 12:16
@NadavLevi NadavLevi deleted the fix/relay-retry-count-dead-code branch March 17, 2026 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants