[Brev][Onboard] Onboard reports SUCCESS without exercising inference path — config bugs reach end users instead of being caught

## Description

Description
<pre>`nemoclaw onboard` reports "Installation complete" and prints a healthy-looking summary even when the resulting deployment cannot actually serve any inference. There is no end-of-flow smoke test that exercises the configured model and provider. As a result, broken inference configurations (wrong endpoint, wrong model id, wrong env var, missing key plumbing, etc.) silently survive onboard and only blow up later as HTTP 503 inside the OpenClaw TUI. A 1-shot inference probe at the end of onboard would catch this entire class of failures before the user ever sees them.

Concrete recent example: NVBug 6158321 ([Brev][Onboard] Model Router (Provider Routed) inference broken) — three independent config errors stacked on top of each other; onboard reported all 8 steps SUCCESS; user only learned about it when the very first TUI message returned 503. None of those bugs would have been able to ship to a user if onboard had verified inference.
</pre>Environment
<pre>Device: Brev (shadeform brev-pz811qnfg) — H100 PCIe x1
OS: Ubuntu 22.04.5 LTS
Architecture: x86_64
Node.js: v22.22.2
npm: 10.9.7
Docker: 29.1.3
OpenShell CLI: 0.0.36
NemoClaw: v0.0.37
OpenClaw: 2026.4.24 (cbcfdf6)
</pre>Steps to Reproduce
<pre>1. Run any onboard flow with an inference config that is structurally valid but functionally broken
 (in our case: option 8 "Model Router (experimental)" with the current pool-config.yaml shipped
 in the blueprint — see NVBug 6158321 for details).
2. Watch onboard go through all 8 steps:
 [1/8] Preflight checks
 [2/8] Starting OpenShell gateway
 [3/8] Configuring inference (NIM)
 [4/8] Setting up inference provider <- "Model router started (PID ...) on port 4000"
 [5/8] Messaging channels
 [6/8] Creating sandbox
 [7/8] Setting up OpenClaw inside sandbox
 [8/8] Policy presets
3. Onboard prints "Installation complete" and the standard summary block. No errors, no warnings.
4. Run `nemoclaw connect` and `openclaw tui`. Send any prompt. Get HTTP 503.
</pre>Expected Result
<pre>At the end of [4/8] (after the inference provider is registered) or as a final [9/9] verification
step, onboard should fire a single chat.completions.create request through the configured model
and assert it returns 200 with non-empty content. If the probe fails, onboard should:
 (a) NOT print "Installation complete";
 (b) print the actual upstream error so the user (and our QA telemetry) can see what went wrong;
 (c) surface the configured api_base, model, and credential env-var name to help diagnose.
</pre>Actual Result
<pre>After provider registration, onboard logs:
 [4/8] Setting up inference provider
 Starting model router...
 Initializing Model Router source...
 Model router started (PID 44356) on port 4000
 Created provider nvidia-router
 Inference route set: nvidia-router / nvidia-routed

This is treated as success. There is no follow-up request that proves the route actually works.
The first real request happens hours later, in a completely different code path (sandbox -> openshell-router -> host:4000 -> upstream), at which point the user is no longer in the onboarding context and the failure looks like a runtime/infra problem rather than a config bug.
</pre>Why this matters
<pre>- Defense-in-depth against config drift in any inference provider, not just Model Router.
- Catches credential-plumbing bugs (env var name mismatch, missing export to subprocess) at the
 exact moment the user paid the cost of pasting their key, not 20 minutes later in TUI.
- Converts a class of "silent breakage in production" bugs into "loud breakage in onboarding" —
 the cheapest place to fail.
- Implementation cost is small: one POST against the just-configured route, with a 10s timeout.
- Sets the bar for any future inference option (Brave/Ollama/NIM/etc.) — every provider should be
 required to pass this smoke test before onboard claims SUCCESS.
</pre>Logs
<pre>Not captured (this is an enhancement / process bug; the trigger example is in NVBug 6158321).
</pre>

## Bug Details

| Field | Value |
|-------|-------|
| Priority | Unprioritized |
| Action | Dev - Open - To fix |
| Disposition | Open issue |
| Module | Machine Learning - NemoClaw |
| Keyword | NemoClaw, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Inference, NemoClaw_Onboard |

---
[NVB#6158325]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Brev][Onboard] Onboard reports SUCCESS without exercising inference path — config bugs reach end users instead of being caught #3253

Description

Bug Details

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Field	Value
Priority	Unprioritized
Action	Dev - Open - To fix
Disposition	Open issue
Module	Machine Learning - NemoClaw
Keyword	NemoClaw, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Inference, NemoClaw_Onboard

[Brev][Onboard] Onboard reports SUCCESS without exercising inference path — config bugs reach end users instead of being caught #3253

Description

Description

Bug Details

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions