Skip to content

[DGX Station][Onboard] wizard post-onboard verify falsely reports gateway+dashboard down — sandbox is healthy #3563

@wangericnv

Description

@wangericnv

Description

Description

NemoClaw v0.0.43 express onboard on DGX Station prints scary "Deployment
verification found issues" warnings at the end claiming gateway and dashboard
are broken, BUT immediately checking `nemoclaw  status` shows
everything is healthy. This is a timing race in the wizard's post-onboard
verification — it polls before the gateway / dashboard port forward are ready.

The warning message tells users to "Check /tmp/gateway.log inside the sandbox",
but that file does not exist (we verified via docker exec). So users following
the suggested diagnostic find no log and may think their install is broken
when it actually works.
Environment
Device:        DGX Station GB300 (galaxy-sku2-018, host 10.176.192.158)
OS:            Ubuntu 24.04.4 LTS
Architecture:  aarch64
Kernel:        6.17.0-1014-nvidia-64k
GPU:           NVIDIA GB300 (256703 MB) + NVIDIA RTX PRO 6000 Blackwell Max-Q (97887 MB)
NVIDIA driver: 610.39
Node.js:       v22.22.3 (auto-installed by curl|bash)
npm:           10.9.8
Docker:        29.5.0 (build 98f1464)
nvidia-ctk:    1.19.0
OpenShell CLI: 0.0.39
NemoClaw:      v0.0.43
OpenClaw:      2026.4.24 (cbcfdf6)
Steps to Reproduce
1. Start from a clean Station (no prior nemoclaw / openshell / ~/.nemoclaw)
2. export HF_TOKEN=
3. Run: curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash
4. Type 'yes' to license, 'y' to express → express path
5. Wait for express onboard to complete (about 10 min on warm cache)
6. Observe the very last screen of the wizard
Expected Result
Wizard reports onboarding completed cleanly:
  - "Installation complete" banner
  - sandbox 'my-assistant' is ready
  - Gateway HTTP 200, dashboard port 18789 reachable
  - No false ✗ markers
Actual Result
Wizard prints:

    ⚠ Deployment verification found issues:
    ✗ gateway: HTTP 0 (gateway not responding)
      The gateway process may have crashed during startup. Check
      /tmp/gateway.log inside the sandbox.
    ✗ dashboard: port forward not working (connection refused)
      Port forward on 18789 is not working. Run:
        openshell forward start 18789 my-assistant
    The sandbox was created successfully but may not be fully functional.
    Run: nemoclaw  status — to re-check after a few seconds.

But immediately after on the host:

  $ nemoclaw my-assistant status
    Sandbox: my-assistant
        Model:    Qwen/Qwen3.6-27B-FP8
        Provider: vllm-local
        Inference (vllm backend): healthy (http://127.0.0.1:8000/v1/models)
        Host GPU: yes
        Sandbox GPU: enabled (auto)
        Phase: Ready

  $ openshell sandbox list
    NAME          PHASE
    my-assistant  Ready

  $ docker ps  # both containers up
    openshell-my-assistant-... Up
    nemoclaw-vllm Up

And the diagnostic file mentioned does not exist:
  docker exec  ls /tmp/*.log → no such file

So the verification ran before the gateway and dashboard port forward had
finished coming up, raising false alarms. The follow-up hint ("Check
/tmp/gateway.log") points at a file that doesn't get written at all.
Logs
Reproduced 2026-05-15 ~05:54 UTC during T6015057 manual test.
Diagnostic trace at /home/lab/day0-automation/20260511/_t6015057_console.log
on local-lab.
Impact
- Users see ✗ markers on their fresh install and think something is broken.
- The recommended diagnostic command (cat /tmp/gateway.log) finds nothing,
  wasting troubleshooting time.
- QA cannot mark T6015057 "Pass" without manually checking after the fact
  that the install actually works — wizard output suggests Fail.
- Wizard should either (a) wait long enough for gateway/dashboard to come up
  before verifying, (b) retry verification with backoff, or (c) remove the
  false-alarm warning until verification is reliable.

Bug Details

Field Value
Priority Unprioritized
Action Dev - Open - To fix
Disposition Open issue
Module Machine Learning - NemoClaw
Keyword NemoClaw, NemoClaw_CLI&UX, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Onboard

[NVB#6179568]

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA Team

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions