Skip to content

[All Platform][Onboard][Regression] Second onboard crashes with unhandled exception and leaks ghost sandbox when dashboard port 18789 is already forwarded #2174

@zNeill

Description

@zNeill

Description

Description

NemoClaw v0.0.21 onboard fails fast when onboarding a second sandbox while another sandbox already holds the default dashboard forward (port 18789). The error requires the user to
  manually set CHAT_UI_URL, with no CLI flag, no automatic port allocation, and no way to discover the assigned port from nemoclaw list / status. This is a breaking change for automation
  and test pipelines that relied on the previous behavior.

Environment

 - Device: 245 GiB RAM host, CPU-only (no NVIDIA GPU attached)
  - OS: Ubuntu 25.10
  - OpenShell CLI: v0.0.26
  - NemoClaw: v0.0.21 (installed globally via npm)
  - OpenClaw: v2026.4.2
  - Inference: NVIDIA build endpoint (https://integrate.api.nvidia.com/v1), model nvidia/nemotron-3-super-120b-a12b
  - Reproduced in: NemoClaw DevTest automation (nemoclaw-test) during full regression run

Reproduction Steps

1. Onboard the first sandbox (uses default port 18789):
  nemoclaw onboard --non-interactive
2. After success, openshell forward list shows:
  my-assistant  127.0.0.1  18789    running
3. Without destroying my-assistant, attempt to onboard a second sandbox with a different name:
  nemoclaw onboard --non-interactive   # pick a different sandbox name, e.g. my-assistant-temp
4. The second onboard crashes in phase [6/8] Creating sandbox inside ensureDashboardForward.

Actual Result

 Onboard crashes with an uncaught error:

  Error: Port 18789 is already forwarded for sandbox 'my-assistant'.
  Set CHAT_UI_URL to a different local port (e.g. http://127.0.0.1:18790) before onboarding a second sandbox.
      at ensureDashboardForward (/.nemoclaw/source/dist/lib/onboard.js:4880:15)
      at createSandbox (dist/lib/onboard.js:3033:5)
      at async Object.onboard (dist/lib/onboard.js:5448:27)
      at async runOnboardCommand (dist/lib/onboard-command.js:82:5)
      at async onboard (dist/nemoclaw.js:723:5)

  The second sandbox's creation is aborted mid-flight. The openshell gateway may already have a partially-created record, but nemoclaw list shows only the first sandbox.

Stably reproducible every time, byte-for-byte identical to the QA-machine failure log for T67 ([T5882262]).

Expected Result

The second onboard should complete successfully. Acceptable options:

  - (a) Auto-allocate the next free dashboard port (18790, 18791, …), store it with the sandbox, and expose it in nemoclaw list / nemoclaw  status.
  - (b) Add a first-class CLI flag such as --control-ui-port  (today the only override is the CHAT_UI_URL env var).
  - (c) On conflict, emit a warning and auto-pick the next free port instead of throwing.

Analysis:

1. New guard location — dist/lib/onboard.js:4878-4884 (v0.0.21):

  if (portOwner !== null && portOwner !== sandboxName) {
      throw new Error(`Port ${portToStop} is already forwarded for sandbox '${portOwner}'. ` +
          `Set CHAT_UI_URL to a different local port ...`);
  }
  2. v0.0.20 behavior (prior): the same code path silently called openshell forward stop  and then openshell forward start  , effectively stealing the dashboard
  forward away from the previous sandbox with no warning. That silent stealing was itself a latent bug; the new v0.0.21 guard is the correct direction, but it surfaces two downstream
  problems:

    - Breaking change for automation: flows that previously "worked" (even if the old sandbox's dashboard was quietly broken afterward) now throw.
    - Error message leaves the user stranded:
        - Doesn't name the next free port the user should pick.
      - CHAT_UI_URL is an env var only; no CLI flag equivalent.
      - After successful onboard with a non-default port, there is no nemoclaw list / status field showing the dashboard URL, so the user has no way to rediscover it later.
  3. Observed blast radius in DevTest automation: at least 12 P0/P1 test cases currently fail due to this single change (examples: T15 NVIDIA Cloud, T35 OpenAI-compatible, T36
  Anthropic-compatible, T37 Onboard interrupt/resume, T66 Destroy-and-cleanup, T67 Re-onboard, T83 CI non-interactive, T22 No-GPU fallback, T87 Quickstart E2E, T40.1 npm preset, T157
  --dangerously-skip-permissions). All share the pattern: a baseline my-assistant sandbox is kept alive while a short-lived second sandbox is onboarded for the test. None of them set
  CHAT_UI_URL, so all hit the conflict on port 18789.
  4. Suggested fix (keeping the fail-fast semantics):
    - Add a --control-ui-port  CLI flag (takes precedence over CHAT_UI_URL env).
    - In ensureDashboardForward, when a conflict is detected, auto-allocate the next free port in a sane range and emit a warning instead of throwing.
    - Surface the assigned dashboard port in nemoclaw list and nemoclaw  status so users can find it post-hoc.
    - If still throwing, include a concrete "use this port next" suggestion in the error message rather than the generic 18790 example.

Bug Details

Field Value
Priority Unprioritized
Action Dev - Open - To fix
Disposition Open issue
Module Machine Learning - NemoClaw
Keyword NemoClaw, NemoClaw_Automation, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Onboard, NemoClaw-SWQA-RelBlckr-Recommended, NemoClaw-SWQA-Test-Blocker

[NVB#6099899]

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA TeamUATIssues flagged for User Acceptance Testing.

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions