Description
On WSL2 (Ubuntu 24.04 inside Win 11 + Docker Desktop, RTX 5090), NemoClaw onboard reaches [6/8] Creating sandbox successfully — 80/80 Dockerfile steps complete, the GPU-patched container starts with --gpus all OK and shows Status=running, ExitCode=0, Health=starting. But the OpenShell supervisor fails to reconnect to the new GPU-enabled container within the patch timeout, causing the sandbox to enter Error phase. Onboard exits 1 before first-prompt verification can run.
Environment
Device: WSL2 x86_64 (Win 11 host) — 2u2g-gen-0689 (10.57.210.126)
OS: Ubuntu 24.04 (WSL2) on Windows 11 Enterprise Build 26100
Architecture: x86_64
Node.js: v22.x
npm: 10.x
Docker: Docker Desktop with WSL2 backend (29.x)
OpenShell CLI: 0.0.44
NemoClaw: 0.1.0 (main HEAD, NEMOCLAW_INSTALL_REF=main; release-tag-equivalent v0.0.57 unreleased)
OpenClaw: N/A (sandbox never reached Ready)
GPU: NVIDIA GeForce RTX 5090 (32 GB, sm_120/sm_121 Blackwell)
Provider: Win-host Ollama (passwordless sudo + NEMOCLAW_PROVIDER=ollama)
Steps to Reproduce
- WSL2 prep on
10.57.210.126 (one-time) — sudo visudo -f /etc/sudoers.d/lab add lab ALL=(ALL) NOPASSWD: ALL
- Clean wipe in WSL:
rm -rf ~/.nemoclaw ~/.config/openshell ~/.local/state/nemoclaw \
~/.local/bin/nemoclaw ~/.local/bin/openshell*
- Run non-interactive install:
NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 NEMOCLAW_NON_INTERACTIVE=1 \
NEMOCLAW_PROVIDER=ollama NEMOCLAW_MODEL=qwen2.5:7b \
NEMOCLAW_YES=1 NEMOCLAW_POLICY_MODE=suggested NEMOCLAW_INSTALL_REF=main \
bash -c 'curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash'
- Wait for full flow: installer download (~3s), node + nemoclaw CLI build (~30s), openshell install, onboard
[1-4/8] inference setup, [6/8] sandbox build (80/80 steps, ~10 min)
- Right after
"Built image openshell/sandbox-from:<tag>" and "Waiting for sandbox to become ready..." the GPU patch begins
- Within ~12s the onboard prints
"Docker GPU patch failed" and exits RC=1
Expected Result
After GPU patch recreates the sandbox container with --gpus all, OpenShell supervisor should reconnect to the new container, the sandbox phase should transition to Ready, and onboard should proceed to [7/8] / [8/8].
Actual Result
From ~/.nemoclaw/onboard-failures/<ts>-my-assistant-docker-gpu-patch/summary.txt:
error = OpenShell supervisor did not reconnect to the GPU-enabled container.
failure_kind = sandbox_error_phase
failure_headline = OpenShell sandbox entered Error phase before the GPU proof could run.
sandbox_phase = Error
gpu_mode_attempts: --gpus all: ok — the patch itself succeeded
patched_container_status = running
patched_container_exit_code = 0
patched_container_health = starting
The new GPU-patched container IS running and healthy — but the supervisor reconnect step times out, so onboard treats it as fatal even though the underlying container is fine.
Looks like a reconnect-timing race specific to WSL2/Docker-Desktop (the new container takes a moment longer to fully initialize OCSF supervisor than on native Linux, but the reconnect deadline doesn't account for it).
Logs
docker-logs.txt (new GPU-patched container 88a80839d028…) — supervisor DID start:
2026-06-02T07:55:59.211Z INFO openshell_sandbox: Starting sandbox
2026-06-02T07:55:59.222Z INFO openshell_sandbox: Creating OPA engine from proto policy data
2026-06-02T07:55:59.336Z OCSF CONFIG:DEGRADED [MED] nft not found; bypass detection rules will not be installed
2026-06-02T07:55:59.349Z OCSF NET:LISTEN [INFO] 10.200.0.1:3128
2026-06-02T07:55:59.352Z OCSF SSH:LISTEN [INFO]
2026-06-02T07:55:59.352Z OCSF LIFECYCLE:INSTALL [INFO] OpenShell Sandbox Supervisor success
2026-06-02T07:55:59.352Z INFO openshell_sandbox: supervisor session task spawned
Then ~12 seconds later onboard says GPU patch failed and sandbox = Error.
Phase timings (from INSTALL-START at 0.00s)
| Phase |
Wall time |
preflight [1/8] |
48.5s |
gateway [2/8] |
50.7s |
inference config [3/8] |
51.7s |
route set ollama-local / qwen2.5:7b |
64.8s |
| sandbox build start |
89.4s |
| step 80/80 done |
677.2s (~10 min — ~2× slower than Spark Path A's 310s, Docker Desktop overhead) |
| "Built image" emitted |
683.8s |
| "Recreating OpenShell Docker sandbox container with NVIDIA GPU access..." |
688s |
| "Docker GPU patch failed" |
697.5s (≈ 14s after recreate started) |
Sandbox built fine, gpu_mode_attempts: --gpus all: ok — only the supervisor reconnect failed.
SECURITY-degraded markers in OCSF log
Documented for completeness — these are NOT the cause; they just document the WSL2 environment:
[SECURITY WARNING] setpriv or CAP_SETPCAP unavailable — falling back to gosu
[SECURITY] CAP_SETPCAP not available — cannot drop bounding-set caps via capsh
[SECURITY] Residual CapBnd=00000004a82c35fb
Dangerous caps remain in bounding set: cap_sys_admin,cap_sys_ptrace,cap_net_raw,cap_dac_override,cap_net_bind_service
Diag bundle
~/.nemoclaw/onboard-failures/2026-06-02T07-55-59-515Z-my-assistant-docker-gpu-patch/ (8 files: summary.txt, docker-inspect.json, docker-logs.txt, docker-network-summary.txt, docker-ps.txt, openshell-sandbox-get.txt, openshell-sandbox-list.txt, patched-container-state.json).
Related context
NOT duplicate of:
- NVBug 6175942 — Spark/Station GPU patch fails at
/proc/comm write (different mechanism: patch itself fails on Spark; here the patch succeeds but the supervisor-reconnect step fails)
- NVBug 6235316 (Ollama proxy port 11435 blocks 5 cases) — Win-host Ollama via
NEMOCLAW_PROVIDER=ollama worked here, so unrelated
Suggested fixes
- Increase supervisor-reconnect timeout for WSL2/Docker-Desktop platform detection
- If the patched container is
Status=running + Health=starting, retry reconnect rather than declaring fatal
- OR: don't tear down the old container until the new one's supervisor responds successfully (rollback path)
NVB#6256537
Description
On WSL2 (Ubuntu 24.04 inside Win 11 + Docker Desktop, RTX 5090), NemoClaw onboard reaches
[6/8] Creating sandboxsuccessfully — 80/80 Dockerfile steps complete, the GPU-patched container starts with--gpus allOK and showsStatus=running,ExitCode=0,Health=starting. But the OpenShell supervisor fails to reconnect to the new GPU-enabled container within the patch timeout, causing the sandbox to enter Error phase. Onboard exits 1 before first-prompt verification can run.Environment
Steps to Reproduce
10.57.210.126(one-time) —sudo visudo -f /etc/sudoers.d/labaddlab ALL=(ALL) NOPASSWD: ALLNEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 NEMOCLAW_NON_INTERACTIVE=1 \ NEMOCLAW_PROVIDER=ollama NEMOCLAW_MODEL=qwen2.5:7b \ NEMOCLAW_YES=1 NEMOCLAW_POLICY_MODE=suggested NEMOCLAW_INSTALL_REF=main \ bash -c 'curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash'[1-4/8]inference setup,[6/8]sandbox build (80/80 steps, ~10 min)"Built image openshell/sandbox-from:<tag>"and"Waiting for sandbox to become ready..."the GPU patch begins"Docker GPU patch failed"and exits RC=1Expected Result
After GPU patch recreates the sandbox container with
--gpus all, OpenShell supervisor should reconnect to the new container, the sandbox phase should transition toReady, and onboard should proceed to[7/8]/[8/8].Actual Result
From
~/.nemoclaw/onboard-failures/<ts>-my-assistant-docker-gpu-patch/summary.txt:error = OpenShell supervisor did not reconnect to the GPU-enabled container.failure_kind = sandbox_error_phasefailure_headline = OpenShell sandbox entered Error phase before the GPU proof could run.sandbox_phase = Errorgpu_mode_attempts: --gpus all: ok— the patch itself succeededpatched_container_status = runningpatched_container_exit_code = 0patched_container_health = startingThe new GPU-patched container IS running and healthy — but the supervisor reconnect step times out, so onboard treats it as fatal even though the underlying container is fine.
Looks like a reconnect-timing race specific to WSL2/Docker-Desktop (the new container takes a moment longer to fully initialize OCSF supervisor than on native Linux, but the reconnect deadline doesn't account for it).
Logs
docker-logs.txt(new GPU-patched container88a80839d028…) — supervisor DID start:Then ~12 seconds later onboard says GPU patch failed and
sandbox = Error.Phase timings (from
INSTALL-STARTat 0.00s)[1/8][2/8][3/8]ollama-local / qwen2.5:7bSandbox built fine,
gpu_mode_attempts: --gpus all: ok— only the supervisor reconnect failed.SECURITY-degraded markers in OCSF log
Documented for completeness — these are NOT the cause; they just document the WSL2 environment:
[SECURITY WARNING] setpriv or CAP_SETPCAP unavailable — falling back to gosu[SECURITY] CAP_SETPCAP not available — cannot drop bounding-set caps via capsh[SECURITY] Residual CapBnd=00000004a82c35fbDangerous caps remain in bounding set: cap_sys_admin,cap_sys_ptrace,cap_net_raw,cap_dac_override,cap_net_bind_serviceDiag bundle
~/.nemoclaw/onboard-failures/2026-06-02T07-55-59-515Z-my-assistant-docker-gpu-patch/(8 files:summary.txt,docker-inspect.json,docker-logs.txt,docker-network-summary.txt,docker-ps.txt,openshell-sandbox-get.txt,openshell-sandbox-list.txt,patched-container-state.json).Related context
NOT duplicate of:
/proc/commwrite (different mechanism: patch itself fails on Spark; here the patch succeeds but the supervisor-reconnect step fails)NEMOCLAW_PROVIDER=ollamaworked here, so unrelatedSuggested fixes
Status=running+Health=starting, retry reconnect rather than declaring fatalNVB#6256537