Description
Description
`nemoclaw onboard` reports "Installation complete" and prints a healthy-looking summary even when the resulting deployment cannot actually serve any inference. There is no end-of-flow smoke test that exercises the configured model and provider. As a result, broken inference configurations (wrong endpoint, wrong model id, wrong env var, missing key plumbing, etc.) silently survive onboard and only blow up later as HTTP 503 inside the OpenClaw TUI. A 1-shot inference probe at the end of onboard would catch this entire class of failures before the user ever sees them.
Concrete recent example: NVBug 6158321 ([Brev][Onboard] Model Router (Provider Routed) inference broken) — three independent config errors stacked on top of each other; onboard reported all 8 steps SUCCESS; user only learned about it when the very first TUI message returned 503. None of those bugs would have been able to ship to a user if onboard had verified inference.
Environment
Device: Brev (shadeform brev-pz811qnfg) — H100 PCIe x1
OS: Ubuntu 22.04.5 LTS
Architecture: x86_64
Node.js: v22.22.2
npm: 10.9.7
Docker: 29.1.3
OpenShell CLI: 0.0.36
NemoClaw: v0.0.37
OpenClaw: 2026.4.24 (cbcfdf6)
Steps to Reproduce
1. Run any onboard flow with an inference config that is structurally valid but functionally broken
(in our case: option 8 "Model Router (experimental)" with the current pool-config.yaml shipped
in the blueprint — see NVBug 6158321 for details).
2. Watch onboard go through all 8 steps:
[1/8] Preflight checks
[2/8] Starting OpenShell gateway
[3/8] Configuring inference (NIM)
[4/8] Setting up inference provider <- "Model router started (PID ...) on port 4000"
[5/8] Messaging channels
[6/8] Creating sandbox
[7/8] Setting up OpenClaw inside sandbox
[8/8] Policy presets
3. Onboard prints "Installation complete" and the standard summary block. No errors, no warnings.
4. Run `nemoclaw connect` and `openclaw tui`. Send any prompt. Get HTTP 503.
Expected Result
At the end of [4/8] (after the inference provider is registered) or as a final [9/9] verification
step, onboard should fire a single chat.completions.create request through the configured model
and assert it returns 200 with non-empty content. If the probe fails, onboard should:
(a) NOT print "Installation complete";
(b) print the actual upstream error so the user (and our QA telemetry) can see what went wrong;
(c) surface the configured api_base, model, and credential env-var name to help diagnose.
Actual Result
After provider registration, onboard logs:
[4/8] Setting up inference provider
Starting model router...
Initializing Model Router source...
Model router started (PID 44356) on port 4000
Created provider nvidia-router
Inference route set: nvidia-router / nvidia-routed
This is treated as success. There is no follow-up request that proves the route actually works.
The first real request happens hours later, in a completely different code path (sandbox -> openshell-router -> host:4000 -> upstream), at which point the user is no longer in the onboarding context and the failure looks like a runtime/infra problem rather than a config bug.
Why this matters
- Defense-in-depth against config drift in any inference provider, not just Model Router.
- Catches credential-plumbing bugs (env var name mismatch, missing export to subprocess) at the
exact moment the user paid the cost of pasting their key, not 20 minutes later in TUI.
- Converts a class of "silent breakage in production" bugs into "loud breakage in onboarding" —
the cheapest place to fail.
- Implementation cost is small: one POST against the just-configured route, with a 10s timeout.
- Sets the bar for any future inference option (Brave/Ollama/NIM/etc.) — every provider should be
required to pass this smoke test before onboard claims SUCCESS.
Logs
Not captured (this is an enhancement / process bug; the trigger example is in NVBug 6158321).
Bug Details
| Field |
Value |
| Priority |
Unprioritized |
| Action |
Dev - Open - To fix |
| Disposition |
Open issue |
| Module |
Machine Learning - NemoClaw |
| Keyword |
NemoClaw, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Inference, NemoClaw_Onboard |
[NVB#6158325]
Description
Description
Environment Steps to Reproduce1. Run any onboard flow with an inference config that is structurally valid but functionally broken (in our case: option 8 "Model Router (experimental)" with the current pool-config.yaml shipped in the blueprint — see NVBug 6158321 for details). 2. Watch onboard go through all 8 steps: [1/8] Preflight checks [2/8] Starting OpenShell gateway [3/8] Configuring inference (NIM) [4/8] Setting up inference provider <- "Model router started (PID ...) on port 4000" [5/8] Messaging channels [6/8] Creating sandbox [7/8] Setting up OpenClaw inside sandbox [8/8] Policy presets 3. Onboard prints "Installation complete" and the standard summary block. No errors, no warnings. 4. Run `nemoclaw connect` and `openclaw tui`. Send any prompt. Get HTTP 503.Expected Result Actual ResultAfter provider registration, onboard logs: [4/8] Setting up inference provider Starting model router... Initializing Model Router source... Model router started (PID 44356) on port 4000 Created provider nvidia-router Inference route set: nvidia-router / nvidia-routed This is treated as success. There is no follow-up request that proves the route actually works. The first real request happens hours later, in a completely different code path (sandbox -> openshell-router -> host:4000 -> upstream), at which point the user is no longer in the onboarding context and the failure looks like a runtime/infra problem rather than a config bug.Why this matters LogsBug Details
[NVB#6158325]