Skip to content

fix: context deadline/cancelled issues#2258

Merged
Harraken merged 1 commit into
mainfrom
fix/context-cancellation-misclassification
Apr 6, 2026
Merged

fix: context deadline/cancelled issues#2258
Harraken merged 1 commit into
mainfrom
fix/context-cancellation-misclassification

Conversation

@NadavLevi

Copy link
Copy Markdown
Contributor

Description

Closes: #XXXX


Author Checklist

All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.

I have...

  • read the contribution guide
  • included the correct type prefix in the PR title, you can find examples of the prefixes below:
  • confirmed ! in the type prefix if API or client breaking change
  • targeted the main branch
  • provided a link to the relevant issue or specification
  • reviewed "Files changed" and left comments if necessary
  • included the necessary unit and integration tests
  • updated the relevant documentation or specification, including comments for documenting Go code
  • confirmed all CI checks have passed

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.

I have...

  • confirmed the correct type prefix in the PR title
  • confirmed all author checklist items have been addressed
  • reviewed state machine logic, API design and naming, documentation is accurate, tests and test coverage

@qodo-code-review

Copy link
Copy Markdown
ⓘ You are approaching your monthly quota for Qodo. Upgrade your plan

Review Summary by Qodo

Fix context cancellation misclassification and make relay timeout configurable

🐞 Bug fix ✨ Enhancement

Grey Divider

Walkthroughs

Description
• Skip ConnectionRefusals and MarkUnhealthy on client-side context cancellation
  - Guard both connection and relay paths to check ctx.Err() before attributing errors to provider
• Make minimum relay timeout configurable via --min-relay-timeout flag
  - Convert hardcoded MinimumTimePerRelayDelay from 1s to configurable variable
  - Add ValidateAndCapMinRelayTimeout() to prevent misconfiguration by capping to 50% of
  DefaultTimeout
• Add comprehensive test coverage for timeout validation and GetTimePerCu behavior
Diagram
flowchart LR
  A["Client Context Cancellation"] -->|Previously Treated As| B["Provider Failure"]
  A -->|Now Skipped| C["ConnectionRefusals & MarkUnhealthy"]
  D["Hardcoded MinimumTimePerRelayDelay 1s"] -->|Converted To| E["Configurable --min-relay-timeout"]
  E -->|Validated By| F["ValidateAndCapMinRelayTimeout"]
  F -->|Capped To| G["50% of DefaultTimeout if >= DefaultTimeout"]
Loading

Grey Divider

File Changes

1. protocol/common/cobra_common.go ⚙️ Configuration changes +1/-0

Add minimum relay timeout flag constant

• Add MinRelayTimeoutFlagName constant for the new --min-relay-timeout flag

protocol/common/cobra_common.go


2. protocol/common/timeout.go ✨ Enhancement +21/-2

Make minimum relay timeout configurable with validation

• Convert MinimumTimePerRelayDelay from constant to configurable variable
• Add ValidateAndCapMinRelayTimeout() function to cap MinimumTimePerRelayDelay at 50% of
 DefaultTimeout if misconfigured
• Add utils import for logging warnings
• Update documentation comments for DefaultTimeout and MinimumTimePerRelayDelay

protocol/common/timeout.go


3. protocol/common/timeout_test.go 🧪 Tests +106/-0

Add timeout validation and GetTimePerCu floor tests

• Add 9 comprehensive test cases for ValidateAndCapMinRelayTimeout() covering no-cap, equal,
 greater, and half-of-default scenarios
• Add 3 test cases for GetTimePerCu() respecting MinimumTimePerRelayDelay floor with default,
 raised, and post-validation floors
• Add resetTimeoutGlobals() helper to prevent test pollution

protocol/common/timeout_test.go


View more (4)
4. protocol/lavasession/consumer_types.go 🐞 Bug fix +9/-0

Skip ConnectionRefusals on context cancellation

• Add context cancellation check before incrementing ConnectionRefusals in
 fetchEndpointConnectionFromConsumerSession
• Log debug message when skipping ConnectionRefusals due to client-side context cancellation

protocol/lavasession/consumer_types.go


5. protocol/rpcconsumer/rpcconsumer.go ✨ Enhancement +2/-0

Add min-relay-timeout flag to rpcconsumer

• Call ValidateAndCapMinRelayTimeout() at startup in RunE function
• Add --min-relay-timeout flag binding to common.MinimumTimePerRelayDelay with descriptive help text

protocol/rpcconsumer/rpcconsumer.go


6. protocol/rpcsmartrouter/rpcsmartrouter.go ✨ Enhancement +2/-0

Add min-relay-timeout flag to rpcsmartrouter

• Call ValidateAndCapMinRelayTimeout() at startup in RunE function
• Add --min-relay-timeout flag binding to common.MinimumTimePerRelayDelay with descriptive help text

protocol/rpcsmartrouter/rpcsmartrouter.go


7. protocol/rpcsmartrouter/rpcsmartrouter_server.go 🐞 Bug fix +13/-2

Skip MarkUnhealthy on context cancellation

• Add errors import for errors.Is() check
• Add context cancellation check in relayInnerDirect before marking endpoint unhealthy
• Skip MarkUnhealthy when context.Canceled error occurs with canceled request context (client
 disconnect)
• Log debug message when skipping MarkUnhealthy due to client-side context cancellation

protocol/rpcsmartrouter/rpcsmartrouter_server.go


Grey Divider

Qodo Logo

@qodo-code-review

qodo-code-review Bot commented Mar 31, 2026

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (2) 📎 Requirement gaps (0) 🎨 UX Issues (0)

Grey Divider


Action required

1. Missing newline in timeout_test.go 📘 Rule violation ⚙ Maintainability
Description
protocol/common/timeout_test.go is added without a trailing newline (`\ No newline at end of
file`), which indicates the file is not fully gofmt-clean. This can cause formatting diffs and
tooling/lint inconsistencies.
Code

protocol/common/timeout_test.go[106]

+}
Evidence
The compliance rule requires modified Go files to be gofmt-clean; the PR diff explicitly marks the
new test file as missing a final newline, which gofmt expects.

AGENTS.md
protocol/common/timeout_test.go[106-106]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The new Go test file is missing a trailing newline, which violates gofmt-clean expectations.
## Issue Context
The PR diff indicates `\ No newline at end of file` for `protocol/common/timeout_test.go`.
## Fix Focus Areas
- protocol/common/timeout_test.go[106-106]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Min relay floor can be zero 🐞 Bug ≡ Correctness
Description
ValidateAndCapMinRelayTimeout can set MinimumTimePerRelayDelay to DefaultTimeout/2, which becomes 0
when --default-processing-timeout=0s (or extremely small), causing downstream CapContextTimeout(…,0)
to create an immediately-expired context and fail relays instantly for low-CU requests.
Code

protocol/common/timeout.go[R36-48]

+// ValidateAndCapMinRelayTimeout ensures MinimumTimePerRelayDelay < DefaultTimeout.
+// If not, it caps MinimumTimePerRelayDelay at 50% of DefaultTimeout and logs a warning.
+func ValidateAndCapMinRelayTimeout() {
+	if MinimumTimePerRelayDelay >= DefaultTimeout {
+		capped := DefaultTimeout / 2
+		utils.LavaFormatWarning("min-relay-timeout >= default-processing-timeout, capping to 50% of processing timeout",
+			nil,
+			utils.LogAttr("min_relay_timeout", MinimumTimePerRelayDelay),
+			utils.LogAttr("default_processing_timeout", DefaultTimeout),
+			utils.LogAttr("capped_to", capped),
+		)
+		MinimumTimePerRelayDelay = capped
+	}
Evidence
The new validation caps the minimum relay floor to DefaultTimeout/2 without guarding against
DefaultTimeout<=0, which can set the floor to 0. Relay timeout is derived from GetTimePerCu (which
uses that floor), then processing timeout is applied through NodeUrl.LowerContextTimeout which calls
CapContextTimeout; when timeout=0 and the parent context has no deadline (remaining time is
MaxInt64), CapContextTimeout uses context.WithTimeout(ctx, 0), producing an immediate deadline and
failing the request.

protocol/common/timeout.go[28-83]
protocol/chainlib/common.go[110-115]
protocol/chainlib/common.go[362-377]
protocol/common/endpoints.go[138-147]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`ValidateAndCapMinRelayTimeout()` can set `MinimumTimePerRelayDelay` to `DefaultTimeout/2`. If `DefaultTimeout` is configured as `0s` (or becomes very small), the computed cap becomes `0`, which later leads to `CapContextTimeout(ctx, 0)` and immediate request timeouts.
### Issue Context
- `MinimumTimePerRelayDelay` participates in relay timeout calculation via `GetTimePerCu()` / `GetRelayTimeout()`.
- `processingTimeout` is enforced via `NodeUrl.LowerContextTimeout()` → `CapContextTimeout()`.
### Fix Focus Areas
- Add explicit validation that `DefaultTimeout > 0` (and optionally `MinimumTimePerRelayDelay > 0`) and either:
- return an error / fatal at startup for invalid values, OR
- clamp the computed `capped` value to a safe minimum (e.g., `time.Millisecond` or `time.Second`) so it never becomes `0`.
- Consider also guarding the cap path for `DefaultTimeout < 2` nanoseconds (integer division rounding to 0).
- protocol/common/timeout.go[36-49]
- protocol/common/timeout.go[78-83]
- protocol/chainlib/common.go[110-115]
- protocol/common/endpoints.go[138-147]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

3. ValidateAndCapMinRelayTimeout tests not table-driven 📘 Rule violation ⚙ Maintainability
Description
Multiple closely-related cases for ValidateAndCapMinRelayTimeout() are implemented as separate
tests instead of a table-driven test, reducing consistency and making extension harder. This
conflicts with the table-driven style requirement for multi-case validations.
Code

protocol/common/timeout_test.go[R22-64]

+func TestValidateAndCapMinRelayTimeout_NoCapNeeded(t *testing.T) {
+	resetTimeoutGlobals(t, MinimumTimePerRelayDelay, DefaultTimeout)
+
+	MinimumTimePerRelayDelay = 5 * time.Second
+	DefaultTimeout = 30 * time.Second
+
+	ValidateAndCapMinRelayTimeout()
+
+	require.Equal(t, 5*time.Second, MinimumTimePerRelayDelay, "should not be modified when min < default")
+}
+
+func TestValidateAndCapMinRelayTimeout_EqualTriggersCap(t *testing.T) {
+	resetTimeoutGlobals(t, MinimumTimePerRelayDelay, DefaultTimeout)
+
+	MinimumTimePerRelayDelay = 30 * time.Second
+	DefaultTimeout = 30 * time.Second
+
+	ValidateAndCapMinRelayTimeout()
+
+	require.Equal(t, 15*time.Second, MinimumTimePerRelayDelay, "should cap to 50% when min == default")
+}
+
+func TestValidateAndCapMinRelayTimeout_GreaterTriggersCap(t *testing.T) {
+	resetTimeoutGlobals(t, MinimumTimePerRelayDelay, DefaultTimeout)
+
+	MinimumTimePerRelayDelay = 60 * time.Second
+	DefaultTimeout = 30 * time.Second
+
+	ValidateAndCapMinRelayTimeout()
+
+	require.Equal(t, 15*time.Second, MinimumTimePerRelayDelay, "should cap to 50% when min > default")
+}
+
+func TestValidateAndCapMinRelayTimeout_CapIsHalfOfDefault(t *testing.T) {
+	resetTimeoutGlobals(t, MinimumTimePerRelayDelay, DefaultTimeout)
+
+	DefaultTimeout = 20 * time.Second
+	MinimumTimePerRelayDelay = 25 * time.Second
+
+	ValidateAndCapMinRelayTimeout()
+
+	require.Equal(t, DefaultTimeout/2, MinimumTimePerRelayDelay)
+}
Evidence
The checklist requires updated tests to use table-driven style when multiple cases are being
validated; the added tests repeat the same arrange/act/assert structure across several functions and
are suitable for a single table-driven test.

AGENTS.md
protocol/common/timeout_test.go[22-64]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Several tests validate the same function with different inputs using repetitive standalone test functions; these should be consolidated into a table-driven test.
## Issue Context
The suite for `ValidateAndCapMinRelayTimeout()` repeats the same setup and assertions with only duration values changing.
## Fix Focus Areas
- protocol/common/timeout_test.go[22-64]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


4. Deadline treated as canceled🐞 Bug ☼ Reliability
Description
fetchEndpointConnectionFromConsumerSessionWithProvider skips incrementing ConnectionRefusals for any
ctx.Err()!=nil (including context.DeadlineExceeded) while logging "request context canceled", which
can suppress refusal tracking under deadline-expired requests and produces misleading logs.
Code

protocol/lavasession/consumer_types.go[R780-788]

+					if ctx.Err() != nil {
+						utils.LavaFormatDebug("skipping ConnectionRefusals increment: request context canceled",
+							utils.LogAttr("err", err),
+							utils.LogAttr("ctx_err", ctx.Err()),
+							utils.LogAttr("provider endpoint", networkAddress),
+							utils.LogAttr("GUID", ctx),
+						)
+						return nil, false
+					}
Evidence
The new check uses ctx.Err() != nil, which includes both cancellation and deadline-exceeded;
however the log message and the smart-router counterpart logic specifically target cancellation
(errors.Is(err, context.Canceled)). The codebase already distinguishes deadline-exceeded
explicitly via ContextOutOfTime, suggesting this consumer-side condition is overly
broad/inconsistent.

protocol/lavasession/consumer_types.go[773-807]
protocol/rpcsmartrouter/rpcsmartrouter_server.go[1833-1845]
protocol/common/timeout.go[66-68]
protocol/lavasession/consumer_types.go[482-505]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The consumer endpoint connection code currently treats any `ctx.Err()!=nil` as a cancellation case, skipping `ConnectionRefusals++` and logging "request context canceled". This conflates `context.Canceled` and `context.DeadlineExceeded` and is inconsistent with the smart-router direct relay logic, which checks `errors.Is(err, context.Canceled)`.
### Issue Context
- `ctx.Err()` may be `context.DeadlineExceeded` when the request times out.
- Skipping refusal increments on deadline-exceeded can hide slow/unreachable endpoints under tight deadlines.
### Fix Focus Areas
- Change the condition to specifically detect cancellation (e.g., `errors.Is(ctx.Err(), context.Canceled)` or `errors.Is(err, context.Canceled)`) before skipping refusal increments.
- Update the log message to match the actual condition (canceled vs deadline exceeded).
- protocol/lavasession/consumer_types.go[779-807]
- protocol/rpcsmartrouter/rpcsmartrouter_server.go[1833-1845]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

@codecov

codecov Bot commented Mar 31, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 52.32558% with 41 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
protocol/rpcsmartrouter/rpcsmartrouter_server.go 28.57% 35 Missing ⚠️
protocol/common/timeout.go 93.10% 1 Missing and 1 partial ⚠️
protocol/rpcconsumer/rpcconsumer.go 0.00% 2 Missing ⚠️
protocol/rpcsmartrouter/rpcsmartrouter.go 0.00% 2 Missing ⚠️
Flag Coverage Δ
consensus 8.74% <93.10%> (+0.03%) ⬆️
protocol 34.13% <52.32%> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
protocol/chaintracker/chain_tracker.go 61.36% <100.00%> (+1.36%) ⬆️
protocol/common/cobra_common.go 0.00% <ø> (ø)
protocol/lavasession/consumer_types.go 74.10% <100.00%> (ø)
protocol/common/timeout.go 55.00% <93.10%> (+55.00%) ⬆️
protocol/rpcconsumer/rpcconsumer.go 0.00% <0.00%> (ø)
protocol/rpcsmartrouter/rpcsmartrouter.go 5.02% <0.00%> (-0.02%) ⬇️
protocol/rpcsmartrouter/rpcsmartrouter_server.go 14.31% <28.57%> (+0.77%) ⬆️

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@NadavLevi NadavLevi force-pushed the fix/context-cancellation-misclassification branch from 42e1f76 to b3a506f Compare March 31, 2026 13:53
Comment thread protocol/common/timeout_test.go Outdated
Comment thread protocol/common/timeout.go Outdated
@NadavLevi NadavLevi force-pushed the fix/context-cancellation-misclassification branch 3 times, most recently from 0d34291 to 1ecc17b Compare March 31, 2026 14:29
@github-actions

github-actions Bot commented Mar 31, 2026

Copy link
Copy Markdown

Test Results

0 tests  ±0   0 ✅ ±0   0s ⏱️ ±0s
0 suites ±0   0 💤 ±0 
7 files   ±0   0 ❌ ±0 

Results for commit 26fd64e. ± Comparison against base commit 1559d6b.

♻️ This comment has been updated with latest results.

Comment thread protocol/rpcsmartrouter/rpcsmartrouter_server.go
Comment thread protocol/common/timeout.go
Comment thread protocol/lavasession/consumer_types.go
@NadavLevi NadavLevi force-pushed the fix/context-cancellation-misclassification branch from 3f19d07 to 3180255 Compare April 5, 2026 18:36
@NadavLevi NadavLevi requested a review from avitenzer April 5, 2026 18:39
@NadavLevi NadavLevi force-pushed the fix/context-cancellation-misclassification branch from 3180255 to 3447f3c Compare April 5, 2026 18:44
Comment thread protocol/rpcconsumer/rpcconsumer.go
Comment thread protocol/common/timeout.go
Comment thread protocol/rpcsmartrouter/rpcsmartrouter_server.go
Comment thread protocol/rpcsmartrouter/rpcsmartrouter_server.go
@NadavLevi NadavLevi force-pushed the fix/context-cancellation-misclassification branch 2 times, most recently from b04db2f to baa663a Compare April 6, 2026 13:49
avitenzer
avitenzer previously approved these changes Apr 6, 2026
- Skip ConnectionRefusals and MarkUnhealthy on client-side context
  cancellation (relay race losers / client disconnects) — these are not
  provider faults. Also skip backoff since the client is gone.
- Add --min-relay-timeout flag to configure minimum relay timeout floor
  (default 1s), with validation against --default-processing-timeout.
- Increase chain tracker fetch timeout from hardcoded 3s to
  max(10s, MinimumTimePerRelayDelay) to prevent spurious deadline errors
  on nodes behind CDNs like Cloudflare.
- Extract classifyHTTPStatus helper to deduplicate status code
  classification logic between classifyRelayError and inline handling.
- Add debug logging for non-HTTP errors marked unhealthy in
  logRelayErrorClassification.
- Fix double call to LocalNodeTimePerCu in GetTimePerCu.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@NadavLevi NadavLevi force-pushed the fix/context-cancellation-misclassification branch from baa663a to 26fd64e Compare April 6, 2026 14:36
@NadavLevi NadavLevi requested a review from avitenzer April 6, 2026 14:38
@Harraken Harraken merged commit b9c7b53 into main Apr 6, 2026
30 checks passed
@Harraken Harraken deleted the fix/context-cancellation-misclassification branch April 6, 2026 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants