Skip to content

fix(onboard): smoke test inference before success#3603

Merged
jyaunches merged 10 commits into
mainfrom
fix-3253-onboard-inference-smoke
May 15, 2026
Merged

fix(onboard): smoke test inference before success#3603
jyaunches merged 10 commits into
mainfrom
fix-3253-onboard-inference-smoke

Conversation

@jyaunches

@jyaunches jyaunches commented May 15, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #3253 by adding a post-route onboard inference smoke check before setup reports success.

After openshell inference set and route verification, OpenAI-compatible onboard providers now run a one-shot chat completions probe against the configured endpoint/model. If the probe fails, onboard exits before the success summary and prints actionable diagnostics: provider, model, API base, credential env, and upstream probe error.

Validation

  • npm run build:cli
  • bash test/e2e/test-onboard-inference-smoke.sh
  • npm test -- --run test/onboard.test.ts

Regression guard

Guard PR #3595 is merged. The fix is complete when this branch flips onboard-inference-smoke-e2e green:

gh workflow run regression-e2e.yaml --repo NVIDIA/NemoClaw -f jobs=onboard-inference-smoke-e2e --ref fix-3253-onboard-inference-smoke

Related: #3253, #3595

Summary by CodeRabbit

  • New Features
    • Added an extra onboard inference smoke check during startup to validate configured inference endpoints and credentials.
    • Introduced lightweight probing that resolves credentials, skips probes during test runs, logs redacted error context on failure, and stops startup if verification fails.

Review Change Stack

Add a post-route inference smoke probe so onboard fails before reporting success when the configured provider/model cannot serve chat completions.

Fixes #3253
@coderabbitai

coderabbitai Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds a lightweight OpenAI-like endpoint smoke test to onboarding and invokes it immediately after verifyInferenceRoute() in both the Hermes and non-Hermes setupInference() paths.

Changes

Onboarding inference smoke check

Layer / File(s) Summary
Core smoke verification logic
src/lib/inference/onboard-probes.ts
shouldSmokeOpenAiLikeOnboardRoute() selects OpenAI-like providers. verifyOnboardInferenceSmoke() resolves endpoint and credentials, skips under Vitest, runs probeOpenAiLikeEndpoint with skipResponsesProbe: true, logs success, and exits on failure.
Wiring into onboarding flow
src/lib/onboard.ts
Import verifyOnboardInferenceSmoke and invoke it immediately after verifyInferenceRoute() in both the Hermes provider branch and the non-Hermes final verification path within setupInference().

Sequence Diagram(s)

sequenceDiagram
  participant SetupInference
  participant VerifyOnboardInferenceSmoke
  participant CredentialStore
  participant ProbeOpenAiLikeEndpoint
  SetupInference->>VerifyOnboardInferenceSmoke: verifyOnboardInferenceSmoke({provider, model, endpointUrl, credentialEnv})
  VerifyOnboardInferenceSmoke->>CredentialStore: resolveProviderCredential/getCredential
  VerifyOnboardInferenceSmoke->>ProbeOpenAiLikeEndpoint: probeOpenAiLikeEndpoint(skipResponsesProbe: true)
  alt probe succeeds
    ProbeOpenAiLikeEndpoint-->>VerifyOnboardInferenceSmoke: 200 OK
    VerifyOnboardInferenceSmoke-->>SetupInference: log ✓
  else probe fails
    ProbeOpenAiLikeEndpoint-->>VerifyOnboardInferenceSmoke: error
    VerifyOnboardInferenceSmoke-->>SetupInference: process.exit(1)
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#3595: Adds E2E smoke guard that exercises setupInference() expecting the new onboard inference smoke probe logic to reject broken OpenAI-like /chat/completions routes.

Suggested labels

fix, enhancement: inference, enhancement: testing, Getting Started

Suggested reviewers

  • ericksoa
  • prekshivyas

Poem

🐰 A tiny probe hops down the line,
It pings the route to make sure it's fine,
No more false "complete" at the end of the run,
A sniff, a hop, and a verified sun.
✓ Hooray — onboarding's work is done!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(onboard): smoke test inference before success' clearly and accurately summarizes the main change: adding a smoke test to verify inference functionality during onboarding before reporting success.
Linked Issues check ✅ Passed The PR implements all primary coding requirements from issue #3253: adds end-of-onboard smoke probe for configured inference route, prevents success reporting on probe failure, surfaces upstream error and configuration details for diagnosis.
Out of Scope Changes check ✅ Passed All changes in src/lib/onboard.ts and src/lib/inference/onboard-probes.ts are directly scoped to implementing the smoke test verification requirement specified in issue #3253; no extraneous modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-3253-onboard-inference-smoke

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread src/lib/onboard.ts Fixed
@github-actions

github-actions Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: onboard-inference-smoke-e2e, model-router-provider-routed-inference-e2e
Optional E2E: cloud-onboard-e2e, inference-routing-e2e, ubuntu-repo-cloud-openclaw

Dispatch hint: onboard-inference-smoke-e2e,model-router-provider-routed-inference-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • onboard-inference-smoke-e2e (low): Direct coverage for this change: verifies onboard does not accept a configured-but-runtime-broken inference route and surfaces provider/model/API-base/credential diagnostics before success.
  • model-router-provider-routed-inference-e2e (medium): The new shouldSmokeOpenAiLikeOnboardRoute() explicitly enables smoke checks for nvidia-router/routed inference; this E2E validates real model-router onboard and inference.local behavior with the routed provider.

Optional E2E

  • cloud-onboard-e2e (high): Useful full real-user confidence for non-interactive onboard after setupInference changes, including install, sandbox health, credential handling, and inference.local HTTPS probe.
  • inference-routing-e2e (medium): Adjacent coverage for provider routing, credential isolation, and inference error classification if maintainers want broader confidence around the new credential resolution and probe failure paths.
  • ubuntu-repo-cloud-openclaw (high): Scenario-runner coverage for the standard repo-current cloud OpenClaw onboarding path with smoke, inference, and credentials suites.

New E2E recommendations

  • OpenAI-like onboard provider matrix (medium): Existing targeted E2E covers a broken compatible endpoint and real nvidia-router, but there is no compact E2E matrix asserting which onboard providers should trigger or skip the smoke check across openai-api, gemini-api, compatible-endpoint, nvidia-nim, nvidia-prod, anthropic, and Hermes.
    • Suggested test: Add a hermetic onboard inference smoke provider-matrix E2E that stubs setupInference dependencies and asserts smoke invocation/skip behavior per provider.

Dispatch hint

  • Workflow: regression-e2e.yaml
  • jobs input: onboard-inference-smoke-e2e,model-router-provider-routed-inference-e2e

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 2044-2082: The new smoke helper functions
(shouldSmokeOpenAiLikeOnboardRoute and verifyOnboardInferenceSmoke) must be
extracted into a new module and imported into src/lib/onboard.ts to reduce that
file's size: create a dedicated file that exports these two functions and
re-export any dependent symbols it needs (REMOTE_PROVIDER_CONFIG,
INFERENCE_ROUTE_URL, probeOpenAiLikeEndpoint, hydrateCredentialEnv,
getCredential, getProbeAuthMode, redact, compactText or import them from their
existing modules), move the implementations exactly as-is into that file, then
replace the local definitions in onboard.ts with imports of the two functions
and update any references; ensure the new module preserves the same signatures
and side-effects (process.exit(1), console logging) and run tests/CI to confirm
no behavioral changes.
- Around line 2051-2064: The helper verifyOnboardInferenceSmoke currently falls
back to INFERENCE_ROUTE_URL and null when options.endpointUrl/credentialEnv are
omitted, which can mismatch the actual provider config used during setup; update
verifyOnboardInferenceSmoke to first resolve endpointUrl and credentialEnv from
the provider's configuration (e.g., via the project's provider config lookup
such as getProviderConfig(options.provider) or
providerConfigs[options.provider]) and only if those are absent fall back to
options.* values and then INFERENCE_ROUTE_URL/null; keep existing behavior when
explicit options are passed, and continue to use
hydrateCredentialEnv/getCredential to resolve the final apiKey from the resolved
credentialEnv.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a0a245bf-15eb-450c-b873-c9bffb4dda75

📥 Commits

Reviewing files that changed from the base of the PR and between da1ad74 and c108c07.

📒 Files selected for processing (1)
  • src/lib/onboard.ts

Comment thread src/lib/onboard.ts Outdated
Comment thread src/lib/onboard.ts Outdated
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25930044916
Target ref: c108c07b143fe3862721b43e3407a5aff54f63c2
Workflow ref: main
Requested jobs: messaging-compatible-endpoint-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
messaging-compatible-endpoint-e2e ✅ success

Comment thread src/lib/onboard/inference-smoke.ts Fixed

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 2044-2057: The local wrapper verifyOnboardInferenceSmoke is adding
unnecessary lines; remove the local function and instead export a trimmed alias
that immediately wires hydrateCredentialEnv into
verifyOnboardInferenceSmokeWithDeps (so callers get the same behavior without
the extra wrapper). Concretely, drop the verifyOnboardInferenceSmoke function
body and replace it with a single export/re-export that binds/passes
hydrateCredentialEnv into verifyOnboardInferenceSmokeWithDeps (keep the existing
require/destructure of verifyOnboardInferenceSmokeWithDeps and
shouldSmokeOpenAiLikeOnboardRoute and continue to reference
hydrateCredentialEnv).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c258d80e-5988-4066-9083-b8050e1c8881

📥 Commits

Reviewing files that changed from the base of the PR and between c108c07 and 57769e4.

📒 Files selected for processing (2)
  • src/lib/onboard.ts
  • src/lib/onboard/inference-smoke.ts

Comment thread src/lib/onboard.ts Outdated
Comment thread src/lib/inference/onboard-probes.ts Fixed

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/inference/onboard-probes.ts`:
- Around line 835-842: The smoke-check logging prints endpointUrl raw and may
leak credentials; update the log to redact and compact the URL before printing
by passing endpointUrl through redact() and compactText() like the upstream
error, i.e., replace the raw endpointUrl usage in the Onboard probes logging
block that references endpointUrl with compactText(redact(endpointUrl ||
"unknown API base")) so the diagnostics show a redacted URL while preserving
existing provider/model/credentialEnv logging and behavior.

In `@src/lib/onboard.ts`:
- Line 7713: When the onboarding code takes the resume fast-path (when resume &&
isInferenceRouteReady(provider, model)) it skips setupInference and therefore
skips the smoke probe; ensure verifyOnboardInferenceSmoke({ provider, model,
endpointUrl, credentialEnv }) is invoked in the onboard resume fast-path before
returning success (i.e., add the same smoke validation call to the branch
guarded by resume && isInferenceRouteReady in the onboard function), and mirror
the same insertion for the other resume-fast-path occurrence referenced (the
other isInferenceRouteReady/resume branch) so resumed runs run the smoke probe
as well.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0d444574-c7a0-4d77-a773-f43ed2d01096

📥 Commits

Reviewing files that changed from the base of the PR and between 57769e4 and 6ee069f.

📒 Files selected for processing (2)
  • src/lib/inference/onboard-probes.ts
  • src/lib/onboard.ts

Comment on lines +835 to +842
const { compactText } = require("../core/url-utils");
const { redact } = require("../runner");
console.error(" Onboard inference smoke check failed.");
console.error(` Provider: ${options.provider}`);
console.error(` Model: ${options.model}`);
console.error(` API base: ${endpointUrl}`);
if (credentialEnv) console.error(` Credential env: ${credentialEnv}`);
console.error(` Upstream error: ${compactText(redact(probe.message || "unknown inference failure"))}`);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Redact API base in smoke-failure diagnostics.

Line 840 logs endpointUrl raw. If the configured base URL contains embedded credentials (query param or userinfo), this leaks secrets to console/log capture. Redact before printing.

Suggested patch
   const { compactText } = require("../core/url-utils");
   const { redact } = require("../runner");
+  const safeEndpointUrl = compactText(redact(String(endpointUrl || "")));
   console.error("  Onboard inference smoke check failed.");
   console.error(`  Provider: ${options.provider}`);
   console.error(`  Model: ${options.model}`);
-  console.error(`  API base: ${endpointUrl}`);
+  console.error(`  API base: ${safeEndpointUrl}`);
   if (credentialEnv) console.error(`  Credential env: ${credentialEnv}`);
   console.error(`  Upstream error: ${compactText(redact(probe.message || "unknown inference failure"))}`);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const { compactText } = require("../core/url-utils");
const { redact } = require("../runner");
console.error(" Onboard inference smoke check failed.");
console.error(` Provider: ${options.provider}`);
console.error(` Model: ${options.model}`);
console.error(` API base: ${endpointUrl}`);
if (credentialEnv) console.error(` Credential env: ${credentialEnv}`);
console.error(` Upstream error: ${compactText(redact(probe.message || "unknown inference failure"))}`);
const { compactText } = require("../core/url-utils");
const { redact } = require("../runner");
const safeEndpointUrl = compactText(redact(String(endpointUrl || "")));
console.error(" Onboard inference smoke check failed.");
console.error(` Provider: ${options.provider}`);
console.error(` Model: ${options.model}`);
console.error(` API base: ${safeEndpointUrl}`);
if (credentialEnv) console.error(` Credential env: ${credentialEnv}`);
console.error(` Upstream error: ${compactText(redact(probe.message || "unknown inference failure"))}`);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/inference/onboard-probes.ts` around lines 835 - 842, The smoke-check
logging prints endpointUrl raw and may leak credentials; update the log to
redact and compact the URL before printing by passing endpointUrl through
redact() and compactText() like the upstream error, i.e., replace the raw
endpointUrl usage in the Onboard probes logging block that references
endpointUrl with compactText(redact(endpointUrl || "unknown API base")) so the
diagnostics show a redacted URL while preserving existing
provider/model/credentialEnv logging and behavior.

Comment thread src/lib/onboard.ts
}

verifyInferenceRoute(provider, model);
verifyOnboardInferenceSmoke({ provider, model, endpointUrl, credentialEnv });

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Run the smoke probe on the resume fast-path too.

These probes run in setupInference, but onboard() can skip that function when resume && isInferenceRouteReady(provider, model), which allows a resumed run to report success without executing the new smoke validation.

Suggested fix
       if (isRoutedInferenceProvider(provider)) {
         try {
           await reconcileModelRouter();
         } catch (err) {
           console.error(
             `  ✗ Failed to reconcile model router: ${err instanceof Error ? err.message : String(err)}`,
           );
           process.exit(1);
         }
       }
       skippedStepMessage("inference", `${provider} / ${model}`);
+      verifyOnboardInferenceSmoke({ provider, model, endpointUrl, credentialEnv });
       if (nimContainer && sandboxName) {
         registry.updateSandbox(sandboxName, { nimContainer });
       }

Also applies to: 7989-7989

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` at line 7713, When the onboarding code takes the resume
fast-path (when resume && isInferenceRouteReady(provider, model)) it skips
setupInference and therefore skips the smoke probe; ensure
verifyOnboardInferenceSmoke({ provider, model, endpointUrl, credentialEnv }) is
invoked in the onboard resume fast-path before returning success (i.e., add the
same smoke validation call to the branch guarded by resume &&
isInferenceRouteReady in the onboard function), and mirror the same insertion
for the other resume-fast-path occurrence referenced (the other
isInferenceRouteReady/resume branch) so resumed runs run the smoke probe as
well.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25931101243
Target ref: 57769e4001a23a02de87249d224015a5bb77bcd3
Workflow ref: main
Requested jobs: cloud-onboard-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
cloud-onboard-e2e ✅ success

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25931525914
Target ref: 4c3bd0ff1ebe9b624d137ec015e603fe4c63aa39
Workflow ref: main
Requested jobs: inference-routing-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
inference-routing-e2e ✅ success

@cv cv added the v0.0.44 label May 15, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25932296025
Target ref: 68d9be996f5630a64bf1f8e1078a9a1d69448960
Workflow ref: main
Requested jobs: inference-routing-e2e,messaging-compatible-endpoint-e2e
Summary: 0 passed, 0 failed, 0 skipped

Job Result
inference-routing-e2e ⚠️ cancelled
messaging-compatible-endpoint-e2e ⚠️ cancelled

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25932590930
Target ref: 33b3136458c8c40a3a773089838bc97d221c444a
Workflow ref: main
Requested jobs: messaging-compatible-endpoint-e2e
Summary: 0 passed, 0 failed, 0 skipped

Job Result
messaging-compatible-endpoint-e2e ⚠️ cancelled

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25932946889
Target ref: 9af9c3b876dc20cfee7dedd3aa8de0dca4629dc2
Workflow ref: main
Requested jobs: inference-routing-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
inference-routing-e2e ✅ success

@jyaunches jyaunches enabled auto-merge (squash) May 15, 2026 18:05
@jyaunches jyaunches merged commit 0964a7e into main May 15, 2026
20 checks passed
@miyoungc miyoungc mentioned this pull request May 16, 2026
12 tasks
@wscurran wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026
@jyaunches jyaunches deleted the fix-3253-onboard-inference-smoke branch June 12, 2026 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Brev][Onboard] Onboard reports SUCCESS without exercising inference path — config bugs reach end users instead of being caught

4 participants