fix(onboard): retry inference validation on transient upstream failures#3104
Conversation
Treat HTTP 502/503/504 as retriable in addition to 429, and extend the curl-timeout retry into the same backoff schedule. Hosted providers (NVIDIA Endpoints in particular) periodically emit these for tens of seconds at a time, currently failing onboard and nightly E2E on the first response. Closes #2980 Closes #3033 Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
📝 WalkthroughWalkthroughAdds retriable HTTP statuses and backoff retry logic to the onboard inference probe, doubles timeouts for retry attempts on timeout/connection failures, adds a test-mode sleep guard to avoid real delays, exports ChangesEndpoint Validation Retry Mechanism
sequenceDiagram
participant Onboard as Onboard
participant Probe as Probe
participant Curl as Curl
participant API as API
Onboard->>Probe: Start inference endpoint validation
Probe->>Curl: Invoke request
Curl->>API: HTTP request
alt Success (200)
API-->>Curl: 200 OK
Curl-->>Probe: Success
Probe-->>Onboard: Validation passed
else Retriable (429/502/503/504) or Timeout/Conn Failure
API-->>Curl: 502 or Timeout
Curl-->>Probe: Error/status
Probe->>Probe: Check RETRIABLE_HTTP_PROBE_STATUSES or isTimeoutOrConnFailure
Probe->>Probe: Backoff delay (5s,15s,30s)
Probe->>Curl: Retry with doubled timeouts
Curl->>API: HTTP request (retry)
API-->>Curl: eventual 200 OK
Curl-->>Probe: Success
Probe-->>Onboard: Validation passed
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/lib/onboard-inference-probes.ts (1)
445-455: 💤 Low valueOptional: Timeout-retry path doesn't handle retriable HTTP statuses.
If the doubled-timeout retry succeeds at the curl level but returns a retriable HTTP status (e.g., 502), the loop breaks because
isTimeoutOrConnFailure(0)is false. This is an edge case since primary HTTP retry is handled byexecuteProbeWithHttpRetryearlier. No action required now, but worth noting for future hardening.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard-inference-probes.ts` around lines 445 - 455, The retry loop only continues when isTimeoutOrConnFailure(retryResult.curlStatus) is true, so if runCurlProbe/buildRetryArgs returns curlStatus 0 but a retriable HTTP status (e.g., 502) the loop will break; update the loop condition or post-probe check to also treat retriable HTTP statuses as retriable: add or call a helper like isRetriableHttpStatus(retryResult.httpStatus) (or reuse existing HTTP retry logic from executeProbeWithHttpRetry) so the loop continues on either a timeout/connection failure OR a retriable HTTP status, and ensure the return path still returns ok when retryResult.ok becomes true (symbols to edit: HTTP_PROBE_RETRY_DELAYS_MS, isTimeoutOrConnFailure, retryResult.curlStatus, retryResult.httpStatus, runCurlProbe, buildRetryArgs, executeProbeWithHttpRetry).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/lib/onboard-inference-probes.ts`:
- Around line 445-455: The retry loop only continues when
isTimeoutOrConnFailure(retryResult.curlStatus) is true, so if
runCurlProbe/buildRetryArgs returns curlStatus 0 but a retriable HTTP status
(e.g., 502) the loop will break; update the loop condition or post-probe check
to also treat retriable HTTP statuses as retriable: add or call a helper like
isRetriableHttpStatus(retryResult.httpStatus) (or reuse existing HTTP retry
logic from executeProbeWithHttpRetry) so the loop continues on either a
timeout/connection failure OR a retriable HTTP status, and ensure the return
path still returns ok when retryResult.ok becomes true (symbols to edit:
HTTP_PROBE_RETRY_DELAYS_MS, isTimeoutOrConnFailure, retryResult.curlStatus,
retryResult.httpStatus, runCurlProbe, buildRetryArgs,
executeProbeWithHttpRetry).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 50ff2100-87d7-4887-89c9-94e73059f56b
📒 Files selected for processing (2)
src/lib/onboard-inference-probes.test.tssrc/lib/onboard-inference-probes.ts
The new curl-failure backoff loop ran real 5/15/30s sleeps under vitest, which exceeded the 15s test timeout in test/onboard-selection.test.ts:2565. Detect VITEST=true (set automatically by the test runner) and short-circuit sleepSync, in addition to the existing NEMOCLAW_TEST_NO_SLEEP opt-in. Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/onboard-inference-probes.ts (1)
432-455:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPreserve the original chat probe parameters in the timeout retry path.
Lines 432-443 rebuild the retry request manually, which drops
options.authMode === "query-param"andoptions.isWsl. A transient timeout on those endpoints will retry with a different URL/auth/timeout profile than the initial probe, so the retry can fail for the wrong reason. Please derive the retry from the sameappendKey(...)/authHeader/options.isWslinputs aschatCompletionsProbeinstead of reconstructing it here.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard-inference-probes.ts` around lines 432 - 455, The retry path is rebuilding curl args (buildRetryArgs) and losing the original authMode (appendKey/query-param) and options.isWsl profile; instead, reuse the exact curl args used by the initial chatCompletionsProbe so retries use the same URL/auth/timeout. Capture or reconstruct the original probe args from the same sources (the appendKey(...) result, authHeader, options.isWsl, and getValidationProbeCurlArgs()) rather than hardcoding doubledArgs here — e.g., obtain the initial probe's final curl arguments (or call the same helper that chatCompletionsProbe uses) and pass those unchanged into runCurlProbe for each retry so appendKey, authMode, authHeader and isWsl are preserved.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard-inference-probes.ts`:
- Line 430: The current check uses failures[0] to detect curl
timeouts/connection errors, which misses cases where a later probe (e.g., the
chat-completions entry) timed out; update the logic in
onboard-inference-probes.ts to search the failures array for an entry where
isTimeoutOrConnFailure(failure.curlStatus) is true (e.g., find or findIndex) and
base the fallback/retry on that found failure (the chat-completions failure
entry) instead of always inspecting failures[0]; ensure you reference the
failures array and isTimeoutOrConnFailure exactly so the retry targets the
actual timed-out probe.
---
Outside diff comments:
In `@src/lib/onboard-inference-probes.ts`:
- Around line 432-455: The retry path is rebuilding curl args (buildRetryArgs)
and losing the original authMode (appendKey/query-param) and options.isWsl
profile; instead, reuse the exact curl args used by the initial
chatCompletionsProbe so retries use the same URL/auth/timeout. Capture or
reconstruct the original probe args from the same sources (the appendKey(...)
result, authHeader, options.isWsl, and getValidationProbeCurlArgs()) rather than
hardcoding doubledArgs here — e.g., obtain the initial probe's final curl
arguments (or call the same helper that chatCompletionsProbe uses) and pass
those unchanged into runCurlProbe for each retry so appendKey, authMode,
authHeader and isWsl are preserved.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ac16c734-418b-4418-9479-7a0d4fd5172a
📒 Files selected for processing (1)
src/lib/onboard-inference-probes.ts
Address two CodeRabbit findings on the inference probe retry loop: * The trigger condition checked only `failures[0].curlStatus`, so a probe ordering like /responses (HTTP error) followed by /chat/completions (curl 28) skipped the retry path entirely. Trigger on any timeout entry in the failure list. * The retry loop broke out as soon as the curl invocation succeeded at the transport level, even when the response was a retriable 5xx. Treat curl timeout/connection failure OR retriable HTTP status as a reason to keep retrying, and surface a status-aware "retrying in Ns..." log so the cause of each retry is clear. Add behavioural coverage for both flows: /responses 404 + chat-completions curl 28 + chat-completions 200 on retry, and curl 28 + 502 + 200. Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/onboard-inference-probes.ts (1)
458-484:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRetry the same chat-completions probe configuration here.
Lines 458-484 rebuild the fallback request with generic curl timings, a hardcoded Bearer header, and
runCurlProbe(). That is no longer the same probe you ran at Lines 343-348, so the retry path dropsauthMode: "query-param", ignoresoptions.isWsl, and skips the model-specific chat-completions configuration. Endpoints/models that only pass with the original probe shape can still fail on the “recovery” attempt.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard-inference-probes.ts` around lines 458 - 484, The retry block rebuilds a different curl request and drops important flags (authMode: "query-param", options.isWsl, and the original model-specific probe shape), so change it to reuse the exact initial probe configuration instead of creating a new generic buildRetryArgs: capture the original curl args used for the first chat-completions probe (the array returned by getValidationProbeCurlArgs or the variable passed to runCurlProbe earlier), and call runCurlProbe with that preserved args array on each retry (or merge only the timing/timeout tweaks into that same args array); ensure the retry still uses the same getChatCompletionsProbePayload(model) shape, respects options.isWsl and the authMode="query-param" handling (do not hardcode only Authorization: Bearer) and keep normalizeCredentialValue(apiKey) usage consistent so endpoints that require the original probe shape will succeed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/lib/onboard-inference-probes.ts`:
- Around line 458-484: The retry block rebuilds a different curl request and
drops important flags (authMode: "query-param", options.isWsl, and the original
model-specific probe shape), so change it to reuse the exact initial probe
configuration instead of creating a new generic buildRetryArgs: capture the
original curl args used for the first chat-completions probe (the array returned
by getValidationProbeCurlArgs or the variable passed to runCurlProbe earlier),
and call runCurlProbe with that preserved args array on each retry (or merge
only the timing/timeout tweaks into that same args array); ensure the retry
still uses the same getChatCompletionsProbePayload(model) shape, respects
options.isWsl and the authMode="query-param" handling (do not hardcode only
Authorization: Bearer) and keep normalizeCredentialValue(apiKey) usage
consistent so endpoints that require the original probe shape will succeed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 91f35605-a20f-4717-9efb-ebe1fa25a3a3
📒 Files selected for processing (2)
src/lib/onboard-inference-probes.test.tssrc/lib/onboard-inference-probes.ts
prekshivyas
left a comment
There was a problem hiding this comment.
Correct fix for #2980 / #3033. Adding 502/503/504 to RETRIABLE_HTTP_PROBE_STATUSES is the right call — NVIDIA Endpoints does emit these for tens of seconds during incidents, and the existing 429-only retry was too narrow. The failures.some(...) broadening from failures[0] also fixes a real ordering bug (/responses=404 → /chat-completions=timeout no longer hard-fails on the first attempt).
The new doubled-timeout retry loop in the chat-completions retry block uses isRetriableProbeResult (timeout OR retriable HTTP), which is the right predicate for that path — it sits after the timeout that triggered it, so retried calls can encounter either failure mode. The existing executeProbeWithHttpRetry keeps using shouldRetryHttpProbe (curlStatus===0 + retriable HTTP only) for the entry-level probe loop, where timeouts have their own dedicated downstream handler. The two predicates are correctly separated by stage.
Test coverage of the membership set and end-to-end recovery scenarios is solid — the fake-bash-curl + counter-file pattern is a faithful reproduction of the upstream failure modes. Bug fix correctness is well-pinned.
Worth a release note (non-blocking): Worst-case wait on a permanently-broken upstream now reaches ~150s of sleepSync waits across up to 12 curl invocations (entry-level /responses retries: ~50s, entry-level /chat-completions retries: ~50s, timeout-retry block backoff: ~50s). That's a real shift from the old ~5s instant-fail behavior. Mitigated by the per-attempt console.log lines so the wizard doesn't appear hung, but users on hard-failure paths will see the wizard sit for 2–3 minutes before reporting failure. Recommend calling this out in release notes so users on permanently-misconfigured endpoints know what to expect.
Minor follow-up nits (non-blocking):
process.env.VITEST === \"true\"insleepSync— production sleep code shouldn't read the test runner's env var. The new tests already setNEMOCLAW_TEST_NO_SLEEP=1explicitly; theVITESTbranch is doing double duty as a "don't update existing tests" hack and creates a small risk if anything else inadvertently sets that env in a non-test context.- New retry loop is exercised at most 1 iteration deep across the three new tests. No test covers iterations 2 or 3 of the
[5_000, 15_000, 30_000]schedule, and no test covers the exhaustion path (all 4 retries return retriable failures → final hard-failure surfaces). A future patch could add a counter==4 case.
CI: pr.yaml checks all pass (commit-lint, dco, layer-boundary, check-hash, changes, get-pr-info, CodeRabbit). checks (3m29s) and macos-e2e (2m38s) green. test-e2e-ollama-proxy pass. 4 prior pr-self-hosted runs success on pull-request/3104, current in_progress. build-sandbox-images pending.
## Summary - Bump docs metadata to 0.0.36 and refresh generated NemoClaw user skills. - Document Model Router onboarding, validation retries, Ollama tool checks, Hermes policy behavior, and deployment verification updates. - Remove suppressed experimental command references from public docs per `docs/.docs-skip`. ## Source summary - #2202 -> `docs/get-started/quickstart.md`, `docs/inference/inference-options.md`, `docs/reference/architecture.md`: Document Model Router setup and routed inference architecture. - #3128 -> `docs/get-started/quickstart.md`, `docs/reference/commands.md`: Document deployment verification and HTTP 401 health handling. - #3104 -> `docs/inference/inference-options.md`: Document retry behavior for transient provider validation failures. - #3121 -> `docs/reference/architecture.md`: Document agent-scoped model/provider compatibility manifests. - #3046 -> `docs/reference/architecture.md`: Tie model-specific compatibility setup to known model/provider behavior. - #3097 -> `docs/inference/use-local-inference.md`: Document Ollama tool-calling capability validation. - #3082 -> `docs/reference/commands.md`: Document `NEMOCLAW_SANDBOX_NAME` as the interactive sandbox-name default. - f586cc5, 3442adf -> `docs/get-started/quickstart-hermes.md`, `docs/reference/network-policies.md`: Document Hermes agent-specific baseline policy endpoints. ## Test plan - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user` - `make docs` - `npm run build:cli` - `rg` skip-term scan for `docs/` and generated user skills Made with [Cursor](https://cursor.com) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Model Router provider for complexity-based routed inference. * Ollama/local inference onboarding now validates tool-calling capability. * Added `local-inference` network policy preset. * **Documentation** * New integration policy examples (Outlook, Telegram, Slack, Discord, GitHub, Jira, etc.). * Clarified config immutability workflow and sandbox writable paths. * Hermes baseline network policy documented. * **Improvements** * Health checks treat device-auth responses as live; transient validation retries. * Installer performs pre-install reachability checks; CLI onboarding gained a --fresh option. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Treat HTTP 502/503/504 as retriable in addition to 429, and extend the curl-timeout retry into the same backoff schedule. Hosted providers (NVIDIA Endpoints in particular) periodically emit these for tens of seconds at a time, which currently fails onboard and nightly E2E on the first response.
Related Issue
Closes #2980
Closes #3033
Changes
src/lib/onboard-inference-probes.ts: add 502/503/504 toRETRIABLE_HTTP_PROBE_STATUSES, export it, and replace the single doubled-timeout curl-failure retry with a loop overHTTP_PROBE_RETRY_DELAYS_MS(5s/15s/30s).src/lib/onboard-inference-probes.ts: gatesleepSynconNEMOCLAW_TEST_NO_SLEEP=1so behavioural tests don't burn 50s of wall-clock per run.src/lib/onboard-inference-probes.test.ts: add membership tests for the retriable-status set, plus a fake-curl behavioural regression test that returns 502 then 200 and asserts the probe recovers (counter pinned at 2 invocations).Type of Change
Verification
npx prek run --all-filespassesnpm testpassesmake docsbuilds without warnings (doc changes only)Signed-off-by: Tinson Lai tinsonl@nvidia.com
Summary by CodeRabbit
Bug Fixes
Tests