Skip to content

[WSL2 x86_64][Sandbox] OpenShell supervisor fails to reconnect to GPU-patched sandbox container; sandbox enters Error phase #4664

@wangericnv

Description

@wangericnv

Description

On WSL2 (Ubuntu 24.04 inside Win 11 + Docker Desktop, RTX 5090), NemoClaw onboard reaches [6/8] Creating sandbox successfully — 80/80 Dockerfile steps complete, the GPU-patched container starts with --gpus all OK and shows Status=running, ExitCode=0, Health=starting. But the OpenShell supervisor fails to reconnect to the new GPU-enabled container within the patch timeout, causing the sandbox to enter Error phase. Onboard exits 1 before first-prompt verification can run.

Environment

Device:        WSL2 x86_64 (Win 11 host) — 2u2g-gen-0689 (10.57.210.126)
OS:            Ubuntu 24.04 (WSL2) on Windows 11 Enterprise Build 26100
Architecture:  x86_64
Node.js:       v22.x
npm:           10.x
Docker:        Docker Desktop with WSL2 backend (29.x)
OpenShell CLI: 0.0.44
NemoClaw:      0.1.0 (main HEAD, NEMOCLAW_INSTALL_REF=main; release-tag-equivalent v0.0.57 unreleased)
OpenClaw:      N/A (sandbox never reached Ready)
GPU:           NVIDIA GeForce RTX 5090 (32 GB, sm_120/sm_121 Blackwell)
Provider:      Win-host Ollama (passwordless sudo + NEMOCLAW_PROVIDER=ollama)

Steps to Reproduce

  1. WSL2 prep on 10.57.210.126 (one-time) — sudo visudo -f /etc/sudoers.d/lab add lab ALL=(ALL) NOPASSWD: ALL
  2. Clean wipe in WSL:
    rm -rf ~/.nemoclaw ~/.config/openshell ~/.local/state/nemoclaw \
           ~/.local/bin/nemoclaw ~/.local/bin/openshell*
  3. Run non-interactive install:
    NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 NEMOCLAW_NON_INTERACTIVE=1 \
    NEMOCLAW_PROVIDER=ollama NEMOCLAW_MODEL=qwen2.5:7b \
    NEMOCLAW_YES=1 NEMOCLAW_POLICY_MODE=suggested NEMOCLAW_INSTALL_REF=main \
      bash -c 'curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash'
  4. Wait for full flow: installer download (~3s), node + nemoclaw CLI build (~30s), openshell install, onboard [1-4/8] inference setup, [6/8] sandbox build (80/80 steps, ~10 min)
  5. Right after "Built image openshell/sandbox-from:<tag>" and "Waiting for sandbox to become ready..." the GPU patch begins
  6. Within ~12s the onboard prints "Docker GPU patch failed" and exits RC=1

Expected Result

After GPU patch recreates the sandbox container with --gpus all, OpenShell supervisor should reconnect to the new container, the sandbox phase should transition to Ready, and onboard should proceed to [7/8] / [8/8].

Actual Result

From ~/.nemoclaw/onboard-failures/<ts>-my-assistant-docker-gpu-patch/summary.txt:

  • error = OpenShell supervisor did not reconnect to the GPU-enabled container.
  • failure_kind = sandbox_error_phase
  • failure_headline = OpenShell sandbox entered Error phase before the GPU proof could run.
  • sandbox_phase = Error
  • gpu_mode_attempts: --gpus all: okthe patch itself succeeded
  • patched_container_status = running
  • patched_container_exit_code = 0
  • patched_container_health = starting

The new GPU-patched container IS running and healthy — but the supervisor reconnect step times out, so onboard treats it as fatal even though the underlying container is fine.

Looks like a reconnect-timing race specific to WSL2/Docker-Desktop (the new container takes a moment longer to fully initialize OCSF supervisor than on native Linux, but the reconnect deadline doesn't account for it).

Logs

docker-logs.txt (new GPU-patched container 88a80839d028…) — supervisor DID start:

2026-06-02T07:55:59.211Z INFO openshell_sandbox: Starting sandbox
2026-06-02T07:55:59.222Z INFO openshell_sandbox: Creating OPA engine from proto policy data
2026-06-02T07:55:59.336Z OCSF CONFIG:DEGRADED [MED] nft not found; bypass detection rules will not be installed
2026-06-02T07:55:59.349Z OCSF NET:LISTEN [INFO] 10.200.0.1:3128
2026-06-02T07:55:59.352Z OCSF SSH:LISTEN [INFO]
2026-06-02T07:55:59.352Z OCSF LIFECYCLE:INSTALL [INFO] OpenShell Sandbox Supervisor success
2026-06-02T07:55:59.352Z INFO openshell_sandbox: supervisor session task spawned

Then ~12 seconds later onboard says GPU patch failed and sandbox = Error.

Phase timings (from INSTALL-START at 0.00s)

Phase Wall time
preflight [1/8] 48.5s
gateway [2/8] 50.7s
inference config [3/8] 51.7s
route set ollama-local / qwen2.5:7b 64.8s
sandbox build start 89.4s
step 80/80 done 677.2s (~10 min — ~2× slower than Spark Path A's 310s, Docker Desktop overhead)
"Built image" emitted 683.8s
"Recreating OpenShell Docker sandbox container with NVIDIA GPU access..." 688s
"Docker GPU patch failed" 697.5s (≈ 14s after recreate started)

Sandbox built fine, gpu_mode_attempts: --gpus all: ok — only the supervisor reconnect failed.

SECURITY-degraded markers in OCSF log

Documented for completeness — these are NOT the cause; they just document the WSL2 environment:

  • [SECURITY WARNING] setpriv or CAP_SETPCAP unavailable — falling back to gosu
  • [SECURITY] CAP_SETPCAP not available — cannot drop bounding-set caps via capsh
  • [SECURITY] Residual CapBnd=00000004a82c35fb
  • Dangerous caps remain in bounding set: cap_sys_admin,cap_sys_ptrace,cap_net_raw,cap_dac_override,cap_net_bind_service

Diag bundle

~/.nemoclaw/onboard-failures/2026-06-02T07-55-59-515Z-my-assistant-docker-gpu-patch/ (8 files: summary.txt, docker-inspect.json, docker-logs.txt, docker-network-summary.txt, docker-ps.txt, openshell-sandbox-get.txt, openshell-sandbox-list.txt, patched-container-state.json).

Related context

NOT duplicate of:

  • NVBug 6175942 — Spark/Station GPU patch fails at /proc/comm write (different mechanism: patch itself fails on Spark; here the patch succeeds but the supervisor-reconnect step fails)
  • NVBug 6235316 (Ollama proxy port 11435 blocks 5 cases) — Win-host Ollama via NEMOCLAW_PROVIDER=ollama worked here, so unrelated

Suggested fixes

  1. Increase supervisor-reconnect timeout for WSL2/Docker-Desktop platform detection
  2. If the patched container is Status=running + Health=starting, retry reconnect rather than declaring fatal
  3. OR: don't tear down the old container until the new one's supervisor responds successfully (rollback path)

NVB#6256537

Metadata

Metadata

Labels

NV QABugs found by the NVIDIA QA TeamUATIssues flagged for User Acceptance Testing.area: sandboxOpenShell sandbox lifecycle, runtime, config, or recoveryplatform: wslAffects Windows Subsystem for Linuxv0.0.59Release target

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions