Skip to content

[Brev][Onboard] Onboard reports SUCCESS without exercising inference path — config bugs reach end users instead of being caught #3253

@hulynn

Description

@hulynn

Description

Description

`nemoclaw onboard` reports "Installation complete" and prints a healthy-looking summary even when the resulting deployment cannot actually serve any inference. There is no end-of-flow smoke test that exercises the configured model and provider. As a result, broken inference configurations (wrong endpoint, wrong model id, wrong env var, missing key plumbing, etc.) silently survive onboard and only blow up later as HTTP 503 inside the OpenClaw TUI. A 1-shot inference probe at the end of onboard would catch this entire class of failures before the user ever sees them.

Concrete recent example: NVBug 6158321 ([Brev][Onboard] Model Router (Provider Routed) inference broken) — three independent config errors stacked on top of each other; onboard reported all 8 steps SUCCESS; user only learned about it when the very first TUI message returned 503. None of those bugs would have been able to ship to a user if onboard had verified inference.
Environment
Device:        Brev (shadeform brev-pz811qnfg) — H100 PCIe x1
OS:            Ubuntu 22.04.5 LTS
Architecture:  x86_64
Node.js:       v22.22.2
npm:           10.9.7
Docker:        29.1.3
OpenShell CLI: 0.0.36
NemoClaw:      v0.0.37
OpenClaw:      2026.4.24 (cbcfdf6)
Steps to Reproduce
1. Run any onboard flow with an inference config that is structurally valid but functionally broken
   (in our case: option 8 "Model Router (experimental)" with the current pool-config.yaml shipped
   in the blueprint — see NVBug 6158321 for details).
2. Watch onboard go through all 8 steps:
     [1/8] Preflight checks
     [2/8] Starting OpenShell gateway
     [3/8] Configuring inference (NIM)
     [4/8] Setting up inference provider     <- "Model router started (PID ...) on port 4000"
     [5/8] Messaging channels
     [6/8] Creating sandbox
     [7/8] Setting up OpenClaw inside sandbox
     [8/8] Policy presets
3. Onboard prints "Installation complete" and the standard summary block. No errors, no warnings.
4. Run `nemoclaw  connect` and `openclaw tui`. Send any prompt. Get HTTP 503.
Expected Result
At the end of [4/8] (after the inference provider is registered) or as a final [9/9] verification
step, onboard should fire a single chat.completions.create request through the configured model
and assert it returns 200 with non-empty content. If the probe fails, onboard should:
  (a) NOT print "Installation complete";
  (b) print the actual upstream error so the user (and our QA telemetry) can see what went wrong;
  (c) surface the configured api_base, model, and credential env-var name to help diagnose.
Actual Result
After provider registration, onboard logs:
  [4/8] Setting up inference provider
    Starting model router...
    Initializing Model Router source...
    Model router started (PID 44356) on port 4000
    Created provider nvidia-router
    Inference route set: nvidia-router / nvidia-routed

This is treated as success. There is no follow-up request that proves the route actually works.
The first real request happens hours later, in a completely different code path (sandbox -> openshell-router -> host:4000 -> upstream), at which point the user is no longer in the onboarding context and the failure looks like a runtime/infra problem rather than a config bug.
Why this matters
- Defense-in-depth against config drift in any inference provider, not just Model Router.
- Catches credential-plumbing bugs (env var name mismatch, missing export to subprocess) at the
  exact moment the user paid the cost of pasting their key, not 20 minutes later in TUI.
- Converts a class of "silent breakage in production" bugs into "loud breakage in onboarding" —
  the cheapest place to fail.
- Implementation cost is small: one POST against the just-configured route, with a 10s timeout.
- Sets the bar for any future inference option (Brave/Ollama/NIM/etc.) — every provider should be
  required to pass this smoke test before onboard claims SUCCESS.
Logs
Not captured (this is an enhancement / process bug; the trigger example is in NVBug 6158321).

Bug Details

Field Value
Priority Unprioritized
Action Dev - Open - To fix
Disposition Open issue
Module Machine Learning - NemoClaw
Keyword NemoClaw, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Inference, NemoClaw_Onboard

[NVB#6158325]

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA Teamarea: e2eEnd-to-end tests, nightly failures, or validation infrastructureplatform: brevAffects Brev hosted development environments
No fields configured for Enhancement.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions