Skip to content

[WSL2 x86_64][Sandbox] NEMOCLAW_GATEWAY_PORT=N onboard recreates global gateway and destroys previous sandbox — concurrent instances unsupported #4422

@wangericnv

Description

@wangericnv

Description

The NEMOCLAW_GATEWAY_PORT env var is intended to allow running multiple concurrent NemoClaw instances on different gateway ports (per DevTest case T642366). However, on v0.0.53 (WSL2 x86_64), invoking nemoclaw onboard with NEMOCLAW_GATEWAY_PORT=8081 to create a second sandbox does NOT spin up a second gateway — instead it RECREATES the global OpenShell gateway on the new port, which kills the first sandbox's container (SIGKILL, exit 137) and leaves it in Phase=Error. Only the latest sandbox remains functional.

The wizard's own step [1/8] warning is prescient: "Gateway will be recreated when sandbox creation starts — this will affect running sandboxes." But the case spec (and the env var's existence) imply concurrent instances are expected to work; in practice the implementation is a global singleton gateway whose port can be changed via the env var, not parallel gateways.

Environment

Device:        2u2g-gen-0689 (RTX 5090, 32607 MB GPU, 128 GB RAM)
OS:            Microsoft Windows 11 Enterprise (build 10.0.26100.0)
Architecture:  x86_64
Node.js:       v22.22.3
npm:           10.9.8
Docker:        29.5.2 (native docker-ce in WSL2)
OpenShell CLI: openshell 0.0.44
NemoClaw:      v0.0.53
OpenClaw:      v2026.5.22 (inside sandboxes)
WSL:           Ubuntu 24.04.4 LTS, kernel 6.6.87.2-microsoft-standard-WSL2 x86_64

Steps to Reproduce

Pre-state: clean WSL2, NemoClaw v0.0.53 installed, no existing sandboxes, ports 8080/18789 free, ports 8081/18790 free.

  1. Create sandbox A on default port:
    nemoclaw onboard --fresh --name sandbox-a
    # provider: NVIDIA Endpoints; default 8080/18789
  2. Verify A is up:
    nemoclaw list                # shows sandbox-a
    ss -ltn                      # 127.0.0.1:8080 + 172.18.0.1:8080 + 127.0.0.1:18789
    docker ps                    # openshell-sandbox-a-* Up
  3. Create sandbox B on port 8081:
    NEMOCLAW_GATEWAY_PORT=8081 nemoclaw onboard --fresh --name sandbox-b
  4. Wait for sandbox-b build to complete (~10 min)
  5. Check final state:
    nemoclaw list
    ss -ltn
    docker ps -a
    nemoclaw sandbox-a status
    nemoclaw sandbox-b status

Expected Result

Per DevTest case T642366:

  • Both sandboxes coexist
  • First gateway still listening on 8080 + dashboard on 18789 + sandbox-a container Up + sandbox-a Phase=Ready
  • Second gateway listening on 8081 + dashboard on 18790 (auto-allocated) + sandbox-b container Up + sandbox-b Phase=Ready
  • nemoclaw list shows both sandboxes with distinct dashboard URLs
  • nemoclaw sandbox-a connect and nemoclaw sandbox-b connect both work
  • Independent agent state, no cross-talk
  • Destroying sandbox-b leaves sandbox-a healthy

Actual Result

After sandbox-b onboard completes:

$ ss -ltn | grep -E ':(8080|8081|18789|18790)'
LISTEN 0  128  172.18.0.1:8081  0.0.0.0:*
LISTEN 0  128   0.0.0.0:18789   0.0.0.0:*
LISTEN 0  128   127.0.0.1:8081  0.0.0.0:*
  • No listener on 8080 — sandbox-a's gateway is gone
  • No listener on 18790 — no second dashboard auto-allocated
$ docker ps -a --format 'table {{.Names}}\t{{.Status}}'
NAMES                                                       STATUS
openshell-sandbox-b-040e46a3-...                            Up 3 minutes (unhealthy)
openshell-sanity-nv-51ce3f12-...                            Exited (137) 8 minutes ago

sandbox-a's container was SIGKILLed during sandbox-b's onboard.

$ nemoclaw sandbox-a status
Phase: Error
$ nemoclaw list
Sandboxes:
  sandbox-a       dashboard: http://127.0.0.1:18789/
  sandbox-b *     dashboard: http://127.0.0.1:18789/    ← SAME URL

Both list entries show the same dashboard URL — there is only one.

Logs

OpenShell gateway log shows the gateway being shut down and restarted on the new port mid-onboard:

/home/lab/.local/state/nemoclaw/openshell-docker-gateway/openshell-gateway.log

... INFO openshell_server: Shutdown signal received; stopping gateway
Error: ... execution error: gateway shutdown cleanup failed: ...
... INFO openshell_server::cli: Starting OpenShell server bind=127.0.0.1:8081
... INFO openshell_server: Server listening address=127.0.0.1:8081
... INFO openshell_server: Server listening address=172.18.0.1:8081

The wizard's step [1/8] preflight clearly says:

⚠ Gateway will be recreated when sandbox creation starts — this will
  affect running sandboxes.
Replacing legacy OpenShell gateway metadata with Docker-driver gateway.

So this is documented in the wizard output, but contradicts T642366's expectation of parallel gateways. Either:

  • (a) The case should be marked Cannot Automate / not-supported, OR
  • (b) The implementation should be changed to spin a second gateway process on the new port instead of recreating the global one.

Suggest (b) since the env var name and the test case both imply parallel support, and the openshell-docker driver already binds to per-port sockets that could in principle coexist.


NVB#6235520

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA Teamarea: cliCommand line interface, flags, terminal UX, or outputarea: sandboxOpenShell sandbox lifecycle, runtime, config, or recoveryplatform: wslAffects Windows Subsystem for Linux

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions