Skip to content

[DGX Spark][Install] Express setup sandbox creation fails — port 18789 occupied by stale openclaw-gateway from previous failed onboard #3398

@wangericnv

Description

@wangericnv

Description

Description

On DGX Spark, running express setup (curl|bash) repeatedly fails because each failed onboard attempt leaves an orphaned openclaw-gateway process listening on port 18789. The next onboard detects the port conflict, falls back to 18790, but the sandbox never reaches Ready state (180s timeout). The destroy/uninstall cycle does not kill the stale openclaw-gateway process.

Root cause analysis:
1. destroy.ts:stopDockerDriverGatewayProcess() only kills openshell-gateway (checks cmdline for "openshell-gateway"), NOT openclaw-gateway
2. uninstall run-plan.ts kills openshell processes but does not specifically target openclaw-gateway
3. After a failed onboard, openclaw-gateway (spawned inside the sandbox container) survives because the sandbox container may be removed but the gateway process was forwarded to the host network
4. nemoclaw onboard --fresh does not check for or kill stale openclaw-gateway processes before starting

The port fallback path (18789→18790) also appears broken — the sandbox is created but never reaches Ready when using the fallback port, suggesting CHAT_UI_URL mismatch between the Dockerfile ARG and the actual forwarded port.
Environment
Device:        DGX Spark (spark-6087)
OS:            Ubuntu (aarch64)
Architecture:  aarch64
Node.js:       v22.22.2
npm:           10.9.7
Docker:        Docker CE 28.3.3
OpenShell CLI: 0.0.37
NemoClaw:      v0.0.39
OpenClaw:      2026.4.24
Steps to Reproduce
1. On DGX Spark, run: curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash
2. Select Express setup
3. Sandbox creation fails (180s timeout) — any reason (slow build, network, etc.)
4. Observe: orphaned openclaw-gateway still running on port 18789
5. Run: nemoclaw uninstall --yes
6. Run: curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash again
7. Observe: "Port 18789 is taken. Using port 18790 instead."
8. Sandbox creation fails again — 180s timeout, same pattern
9. Each retry leaves another openclaw-gateway process

Verification:
ss -tlnp | grep 18789
→ LISTEN 127.0.0.1:18789 openclaw-gatewa (stale pid from previous attempt)
Expected Result
1. nemoclaw onboard --fresh should detect and kill any stale openclaw-gateway processes before starting
2. nemoclaw uninstall should kill openclaw-gateway processes (not just openshell-gateway)
3. destroy.ts:stopDockerDriverGatewayProcess() should also match "openclaw-gateway" in cmdline check
4. Port fallback (18789→18790) should produce a working sandbox, or fail with actionable error
Actual Result
- Each failed onboard leaves orphaned openclaw-gateway on 18789
- uninstall does not clean it up
- onboard --fresh does not clean it up
- Port fallback to 18790 also fails (sandbox never reaches Ready)
- User is stuck in an unrecoverable loop without manual "kill" command
- Workaround: manually run "pkill -f openclaw-gatewa" before retrying
Logs
! Port 18789 is taken. Using port 18790 instead.
Direct sandbox GPU enabled; allowing only /proc task comm writes.
Creating sandbox 'my-assistant' (this takes a few minutes on first run)...
...63 Docker build steps complete...
Create stream exited with code 1 after sandbox was created.
Sandbox 'my-assistant' was created but did not become ready within 180s.
The orphaned sandbox has been removed — you can safely retry.

Bug Details

Field Value
Priority Unprioritized
Action Dev - Open - To fix
Disposition Open issue
Module Machine Learning - NemoClaw
Keyword NemoClaw, NemoClaw_Automation, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Install, NemoClaw_Sandbox, NemoClaw-SWQA-RelBlckr-Recommended

[NVB#6168123]

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA TeamUATIssues flagged for User Acceptance Testing.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions