Skip to content

fix(onboard): retry inference validation on transient upstream failures#3104

Merged
cv merged 6 commits into
mainfrom
fix/issue-2980-retry-transient-upstream
May 6, 2026
Merged

fix(onboard): retry inference validation on transient upstream failures#3104
cv merged 6 commits into
mainfrom
fix/issue-2980-retry-transient-upstream

Conversation

@laitingsheng

@laitingsheng laitingsheng commented May 6, 2026

Copy link
Copy Markdown
Contributor

Summary

Treat HTTP 502/503/504 as retriable in addition to 429, and extend the curl-timeout retry into the same backoff schedule. Hosted providers (NVIDIA Endpoints in particular) periodically emit these for tens of seconds at a time, which currently fails onboard and nightly E2E on the first response.

Related Issue

Closes #2980
Closes #3033

Changes

  • src/lib/onboard-inference-probes.ts: add 502/503/504 to RETRIABLE_HTTP_PROBE_STATUSES, export it, and replace the single doubled-timeout curl-failure retry with a loop over HTTP_PROBE_RETRY_DELAYS_MS (5s/15s/30s).
  • src/lib/onboard-inference-probes.ts: gate sleepSync on NEMOCLAW_TEST_NO_SLEEP=1 so behavioural tests don't burn 50s of wall-clock per run.
  • src/lib/onboard-inference-probes.test.ts: add membership tests for the retriable-status set, plus a fake-curl behavioural regression test that returns 502 then 200 and asserts the probe recovers (counter pinned at 2 invocations).

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tinson Lai tinsonl@nvidia.com

Summary by CodeRabbit

  • Bug Fixes

    • Improved retry logic to better handle transient HTTP failures (429, 502, 503, 504), with adaptive timeouts and backoff for chat completion requests to reduce interruptions from flaky endpoints.
  • Tests

    • Added comprehensive tests covering retriable HTTP statuses and realistic retry scenarios (including mocked upstream recoveries) to prevent regressions.
    • Tests include safeguards to avoid real wait times, speeding up and stabilizing runs.

Treat HTTP 502/503/504 as retriable in addition to 429, and extend the
curl-timeout retry into the same backoff schedule. Hosted providers
(NVIDIA Endpoints in particular) periodically emit these for tens of
seconds at a time, currently failing onboard and nightly E2E on the
first response.

Closes #2980
Closes #3033

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@coderabbitai

coderabbitai Bot commented May 6, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

Adds retriable HTTP statuses and backoff retry logic to the onboard inference probe, doubles timeouts for retry attempts on timeout/connection failures, adds a test-mode sleep guard to avoid real delays, exports RETRIABLE_HTTP_PROBE_STATUSES for tests, and adds tests for retriable statuses and a regression retry scenario.

Changes

Endpoint Validation Retry Mechanism

Layer / File(s) Summary
Retry Configuration
src/lib/onboard-inference-probes.ts
Adds RETRIABLE_HTTP_PROBE_STATUSES = new Set([429, 502, 503, 504]) and HTTP_PROBE_RETRY_DELAYS_MS = [5_000, 15_000, 30_000].
Test Sleep Guard
src/lib/onboard-inference-probes.ts
sleepSync bypasses real sleep when process.env.VITEST === "true" or process.env.NEMOCLAW_TEST_NO_SLEEP === "1".
Retry / Backoff Logic
src/lib/onboard-inference-probes.ts
Adds isTimeoutOrConnFailure, isRetriableProbeResult, retriedAfterTimeout tracking, buildRetryArgs (doubles numeric timeouts), and a retry loop for Chat Completions probe that logs reasons and applies backoff delays.
Public Export
src/lib/onboard-inference-probes.ts
Exports RETRIABLE_HTTP_PROBE_STATUSES from the module for external/tests consumption.
Status Tests
src/lib/onboard-inference-probes.test.ts
Adds tests asserting 429, 502, 503, 504 are retriable and 400, 401, 403, 404, 500, 200 are not retriable.
Regression Recovery Test
src/lib/onboard-inference-probes.test.ts
Adds regression-style test that mocks curl to simulate an upstream 502 that clears on retry, verifying logs, a counter file increment, and successful probe completion.
sequenceDiagram
    participant Onboard as Onboard
    participant Probe as Probe
    participant Curl as Curl
    participant API as API

    Onboard->>Probe: Start inference endpoint validation
    Probe->>Curl: Invoke request
    Curl->>API: HTTP request
    alt Success (200)
        API-->>Curl: 200 OK
        Curl-->>Probe: Success
        Probe-->>Onboard: Validation passed
    else Retriable (429/502/503/504) or Timeout/Conn Failure
        API-->>Curl: 502 or Timeout
        Curl-->>Probe: Error/status
        Probe->>Probe: Check RETRIABLE_HTTP_PROBE_STATUSES or isTimeoutOrConnFailure
        Probe->>Probe: Backoff delay (5s,15s,30s)
        Probe->>Curl: Retry with doubled timeouts
        Curl->>API: HTTP request (retry)
        API-->>Curl: eventual 200 OK
        Curl-->>Probe: Success
        Probe-->>Onboard: Validation passed
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 A rabbit hops between retries,
429 and 502 skies—
five, fifteen, thirty tap the drum,
doubled timeouts, try once more—then, hum,
logs sing: the probe returned OK.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding retry logic for transient upstream failures in inference validation.
Linked Issues check ✅ Passed The PR implements all coding requirements from both issues: treats 502/503/504 as retriable [#2980], adds backoff retry loop [#2980, #3033], gates sleepSync for tests, and includes comprehensive test coverage.
Out of Scope Changes check ✅ Passed All changes directly support the linked issues' objectives—retriable status constants, backoff retry logic, test safeguards, and test coverage for transient HTTP failures.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/issue-2980-retry-transient-upstream

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/lib/onboard-inference-probes.ts (1)

445-455: 💤 Low value

Optional: Timeout-retry path doesn't handle retriable HTTP statuses.

If the doubled-timeout retry succeeds at the curl level but returns a retriable HTTP status (e.g., 502), the loop breaks because isTimeoutOrConnFailure(0) is false. This is an edge case since primary HTTP retry is handled by executeProbeWithHttpRetry earlier. No action required now, but worth noting for future hardening.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard-inference-probes.ts` around lines 445 - 455, The retry loop
only continues when isTimeoutOrConnFailure(retryResult.curlStatus) is true, so
if runCurlProbe/buildRetryArgs returns curlStatus 0 but a retriable HTTP status
(e.g., 502) the loop will break; update the loop condition or post-probe check
to also treat retriable HTTP statuses as retriable: add or call a helper like
isRetriableHttpStatus(retryResult.httpStatus) (or reuse existing HTTP retry
logic from executeProbeWithHttpRetry) so the loop continues on either a
timeout/connection failure OR a retriable HTTP status, and ensure the return
path still returns ok when retryResult.ok becomes true (symbols to edit:
HTTP_PROBE_RETRY_DELAYS_MS, isTimeoutOrConnFailure, retryResult.curlStatus,
retryResult.httpStatus, runCurlProbe, buildRetryArgs,
executeProbeWithHttpRetry).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/lib/onboard-inference-probes.ts`:
- Around line 445-455: The retry loop only continues when
isTimeoutOrConnFailure(retryResult.curlStatus) is true, so if
runCurlProbe/buildRetryArgs returns curlStatus 0 but a retriable HTTP status
(e.g., 502) the loop will break; update the loop condition or post-probe check
to also treat retriable HTTP statuses as retriable: add or call a helper like
isRetriableHttpStatus(retryResult.httpStatus) (or reuse existing HTTP retry
logic from executeProbeWithHttpRetry) so the loop continues on either a
timeout/connection failure OR a retriable HTTP status, and ensure the return
path still returns ok when retryResult.ok becomes true (symbols to edit:
HTTP_PROBE_RETRY_DELAYS_MS, isTimeoutOrConnFailure, retryResult.curlStatus,
retryResult.httpStatus, runCurlProbe, buildRetryArgs,
executeProbeWithHttpRetry).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 50ff2100-87d7-4887-89c9-94e73059f56b

📥 Commits

Reviewing files that changed from the base of the PR and between 3800964 and 6d19fed.

📒 Files selected for processing (2)
  • src/lib/onboard-inference-probes.test.ts
  • src/lib/onboard-inference-probes.ts

The new curl-failure backoff loop ran real 5/15/30s sleeps under
vitest, which exceeded the 15s test timeout in
test/onboard-selection.test.ts:2565. Detect VITEST=true (set
automatically by the test runner) and short-circuit sleepSync, in
addition to the existing NEMOCLAW_TEST_NO_SLEEP opt-in.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/lib/onboard-inference-probes.ts (1)

432-455: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve the original chat probe parameters in the timeout retry path.

Lines 432-443 rebuild the retry request manually, which drops options.authMode === "query-param" and options.isWsl. A transient timeout on those endpoints will retry with a different URL/auth/timeout profile than the initial probe, so the retry can fail for the wrong reason. Please derive the retry from the same appendKey(...) / authHeader / options.isWsl inputs as chatCompletionsProbe instead of reconstructing it here.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard-inference-probes.ts` around lines 432 - 455, The retry path
is rebuilding curl args (buildRetryArgs) and losing the original authMode
(appendKey/query-param) and options.isWsl profile; instead, reuse the exact curl
args used by the initial chatCompletionsProbe so retries use the same
URL/auth/timeout. Capture or reconstruct the original probe args from the same
sources (the appendKey(...) result, authHeader, options.isWsl, and
getValidationProbeCurlArgs()) rather than hardcoding doubledArgs here — e.g.,
obtain the initial probe's final curl arguments (or call the same helper that
chatCompletionsProbe uses) and pass those unchanged into runCurlProbe for each
retry so appendKey, authMode, authHeader and isWsl are preserved.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard-inference-probes.ts`:
- Line 430: The current check uses failures[0] to detect curl
timeouts/connection errors, which misses cases where a later probe (e.g., the
chat-completions entry) timed out; update the logic in
onboard-inference-probes.ts to search the failures array for an entry where
isTimeoutOrConnFailure(failure.curlStatus) is true (e.g., find or findIndex) and
base the fallback/retry on that found failure (the chat-completions failure
entry) instead of always inspecting failures[0]; ensure you reference the
failures array and isTimeoutOrConnFailure exactly so the retry targets the
actual timed-out probe.

---

Outside diff comments:
In `@src/lib/onboard-inference-probes.ts`:
- Around line 432-455: The retry path is rebuilding curl args (buildRetryArgs)
and losing the original authMode (appendKey/query-param) and options.isWsl
profile; instead, reuse the exact curl args used by the initial
chatCompletionsProbe so retries use the same URL/auth/timeout. Capture or
reconstruct the original probe args from the same sources (the appendKey(...)
result, authHeader, options.isWsl, and getValidationProbeCurlArgs()) rather than
hardcoding doubledArgs here — e.g., obtain the initial probe's final curl
arguments (or call the same helper that chatCompletionsProbe uses) and pass
those unchanged into runCurlProbe for each retry so appendKey, authMode,
authHeader and isWsl are preserved.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ac16c734-418b-4418-9479-7a0d4fd5172a

📥 Commits

Reviewing files that changed from the base of the PR and between 6d19fed and c26d52f.

📒 Files selected for processing (1)
  • src/lib/onboard-inference-probes.ts

Comment thread src/lib/onboard-inference-probes.ts Outdated
Address two CodeRabbit findings on the inference probe retry loop:

* The trigger condition checked only `failures[0].curlStatus`, so a probe
  ordering like /responses (HTTP error) followed by /chat/completions
  (curl 28) skipped the retry path entirely. Trigger on any timeout entry
  in the failure list.
* The retry loop broke out as soon as the curl invocation succeeded at
  the transport level, even when the response was a retriable 5xx. Treat
  curl timeout/connection failure OR retriable HTTP status as a reason
  to keep retrying, and surface a status-aware "retrying in Ns..." log
  so the cause of each retry is clear.

Add behavioural coverage for both flows: /responses 404 + chat-completions
curl 28 + chat-completions 200 on retry, and curl 28 + 502 + 200.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/lib/onboard-inference-probes.ts (1)

458-484: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Retry the same chat-completions probe configuration here.

Lines 458-484 rebuild the fallback request with generic curl timings, a hardcoded Bearer header, and runCurlProbe(). That is no longer the same probe you ran at Lines 343-348, so the retry path drops authMode: "query-param", ignores options.isWsl, and skips the model-specific chat-completions configuration. Endpoints/models that only pass with the original probe shape can still fail on the “recovery” attempt.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard-inference-probes.ts` around lines 458 - 484, The retry block
rebuilds a different curl request and drops important flags (authMode:
"query-param", options.isWsl, and the original model-specific probe shape), so
change it to reuse the exact initial probe configuration instead of creating a
new generic buildRetryArgs: capture the original curl args used for the first
chat-completions probe (the array returned by getValidationProbeCurlArgs or the
variable passed to runCurlProbe earlier), and call runCurlProbe with that
preserved args array on each retry (or merge only the timing/timeout tweaks into
that same args array); ensure the retry still uses the same
getChatCompletionsProbePayload(model) shape, respects options.isWsl and the
authMode="query-param" handling (do not hardcode only Authorization: Bearer) and
keep normalizeCredentialValue(apiKey) usage consistent so endpoints that require
the original probe shape will succeed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/lib/onboard-inference-probes.ts`:
- Around line 458-484: The retry block rebuilds a different curl request and
drops important flags (authMode: "query-param", options.isWsl, and the original
model-specific probe shape), so change it to reuse the exact initial probe
configuration instead of creating a new generic buildRetryArgs: capture the
original curl args used for the first chat-completions probe (the array returned
by getValidationProbeCurlArgs or the variable passed to runCurlProbe earlier),
and call runCurlProbe with that preserved args array on each retry (or merge
only the timing/timeout tweaks into that same args array); ensure the retry
still uses the same getChatCompletionsProbePayload(model) shape, respects
options.isWsl and the authMode="query-param" handling (do not hardcode only
Authorization: Bearer) and keep normalizeCredentialValue(apiKey) usage
consistent so endpoints that require the original probe shape will succeed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 91f35605-a20f-4717-9efb-ebe1fa25a3a3

📥 Commits

Reviewing files that changed from the base of the PR and between 3c35f7c and c41345f.

📒 Files selected for processing (2)
  • src/lib/onboard-inference-probes.test.ts
  • src/lib/onboard-inference-probes.ts

@prekshivyas prekshivyas self-assigned this May 6, 2026

@prekshivyas prekshivyas left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct fix for #2980 / #3033. Adding 502/503/504 to RETRIABLE_HTTP_PROBE_STATUSES is the right call — NVIDIA Endpoints does emit these for tens of seconds during incidents, and the existing 429-only retry was too narrow. The failures.some(...) broadening from failures[0] also fixes a real ordering bug (/responses=404 → /chat-completions=timeout no longer hard-fails on the first attempt).

The new doubled-timeout retry loop in the chat-completions retry block uses isRetriableProbeResult (timeout OR retriable HTTP), which is the right predicate for that path — it sits after the timeout that triggered it, so retried calls can encounter either failure mode. The existing executeProbeWithHttpRetry keeps using shouldRetryHttpProbe (curlStatus===0 + retriable HTTP only) for the entry-level probe loop, where timeouts have their own dedicated downstream handler. The two predicates are correctly separated by stage.

Test coverage of the membership set and end-to-end recovery scenarios is solid — the fake-bash-curl + counter-file pattern is a faithful reproduction of the upstream failure modes. Bug fix correctness is well-pinned.

Worth a release note (non-blocking): Worst-case wait on a permanently-broken upstream now reaches ~150s of sleepSync waits across up to 12 curl invocations (entry-level /responses retries: ~50s, entry-level /chat-completions retries: ~50s, timeout-retry block backoff: ~50s). That's a real shift from the old ~5s instant-fail behavior. Mitigated by the per-attempt console.log lines so the wizard doesn't appear hung, but users on hard-failure paths will see the wizard sit for 2–3 minutes before reporting failure. Recommend calling this out in release notes so users on permanently-misconfigured endpoints know what to expect.

Minor follow-up nits (non-blocking):

  1. process.env.VITEST === \"true\" in sleepSync — production sleep code shouldn't read the test runner's env var. The new tests already set NEMOCLAW_TEST_NO_SLEEP=1 explicitly; the VITEST branch is doing double duty as a "don't update existing tests" hack and creates a small risk if anything else inadvertently sets that env in a non-test context.
  2. New retry loop is exercised at most 1 iteration deep across the three new tests. No test covers iterations 2 or 3 of the [5_000, 15_000, 30_000] schedule, and no test covers the exhaustion path (all 4 retries return retriable failures → final hard-failure surfaces). A future patch could add a counter==4 case.

CI: pr.yaml checks all pass (commit-lint, dco, layer-boundary, check-hash, changes, get-pr-info, CodeRabbit). checks (3m29s) and macos-e2e (2m38s) green. test-e2e-ollama-proxy pass. 4 prior pr-self-hosted runs success on pull-request/3104, current in_progress. build-sandbox-images pending.

@cv cv merged commit 283ee50 into main May 6, 2026
15 checks passed
miyoungc added a commit that referenced this pull request May 7, 2026
## Summary
- Bump docs metadata to 0.0.36 and refresh generated NemoClaw user
skills.
- Document Model Router onboarding, validation retries, Ollama tool
checks, Hermes policy behavior, and deployment verification updates.
- Remove suppressed experimental command references from public docs per
`docs/.docs-skip`.

## Source summary
- #2202 -> `docs/get-started/quickstart.md`,
`docs/inference/inference-options.md`, `docs/reference/architecture.md`:
Document Model Router setup and routed inference architecture.
- #3128 -> `docs/get-started/quickstart.md`,
`docs/reference/commands.md`: Document deployment verification and HTTP
401 health handling.
- #3104 -> `docs/inference/inference-options.md`: Document retry
behavior for transient provider validation failures.
- #3121 -> `docs/reference/architecture.md`: Document agent-scoped
model/provider compatibility manifests.
- #3046 -> `docs/reference/architecture.md`: Tie model-specific
compatibility setup to known model/provider behavior.
- #3097 -> `docs/inference/use-local-inference.md`: Document Ollama
tool-calling capability validation.
- #3082 -> `docs/reference/commands.md`: Document
`NEMOCLAW_SANDBOX_NAME` as the interactive sandbox-name default.
- f586cc5, 3442adf -> `docs/get-started/quickstart-hermes.md`,
`docs/reference/network-policies.md`: Document Hermes agent-specific
baseline policy endpoints.

## Test plan
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user`
- `make docs`
- `npm run build:cli`
- `rg` skip-term scan for `docs/` and generated user skills

Made with [Cursor](https://cursor.com)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
  * Model Router provider for complexity-based routed inference.
* Ollama/local inference onboarding now validates tool-calling
capability.
  * Added `local-inference` network policy preset.

* **Documentation**
* New integration policy examples (Outlook, Telegram, Slack, Discord,
GitHub, Jira, etc.).
  * Clarified config immutability workflow and sandbox writable paths.
  * Hermes baseline network policy documented.

* **Improvements**
* Health checks treat device-auth responses as live; transient
validation retries.
* Installer performs pre-install reachability checks; CLI onboarding
gained a --fresh option.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@wscurran wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression

Projects

None yet

4 participants