[WSL2 x86_64][Sandbox] OpenShell supervisor fails to reconnect to GPU-patched sandbox container; sandbox enters Error phase

## Description

On WSL2 (Ubuntu 24.04 inside Win 11 + Docker Desktop, RTX 5090), NemoClaw onboard reaches `[6/8] Creating sandbox` successfully — 80/80 Dockerfile steps complete, the GPU-patched container starts with `--gpus all` OK and shows `Status=running`, `ExitCode=0`, `Health=starting`. But the OpenShell supervisor fails to reconnect to the new GPU-enabled container within the patch timeout, causing the sandbox to enter Error phase. Onboard exits 1 before first-prompt verification can run.

## Environment

```text
Device:        WSL2 x86_64 (Win 11 host) — 2u2g-gen-0689 (10.57.210.126)
OS:            Ubuntu 24.04 (WSL2) on Windows 11 Enterprise Build 26100
Architecture:  x86_64
Node.js:       v22.x
npm:           10.x
Docker:        Docker Desktop with WSL2 backend (29.x)
OpenShell CLI: 0.0.44
NemoClaw:      0.1.0 (main HEAD, NEMOCLAW_INSTALL_REF=main; release-tag-equivalent v0.0.57 unreleased)
OpenClaw:      N/A (sandbox never reached Ready)
GPU:           NVIDIA GeForce RTX 5090 (32 GB, sm_120/sm_121 Blackwell)
Provider:      Win-host Ollama (passwordless sudo + NEMOCLAW_PROVIDER=ollama)
```

## Steps to Reproduce

1. WSL2 prep on `10.57.210.126` (one-time) — `sudo visudo -f /etc/sudoers.d/lab` add `lab ALL=(ALL) NOPASSWD: ALL`
2. Clean wipe in WSL:
   ```bash
   rm -rf ~/.nemoclaw ~/.config/openshell ~/.local/state/nemoclaw \
          ~/.local/bin/nemoclaw ~/.local/bin/openshell*
   ```
3. Run non-interactive install:
   ```bash
   NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 NEMOCLAW_NON_INTERACTIVE=1 \
   NEMOCLAW_PROVIDER=ollama NEMOCLAW_MODEL=qwen2.5:7b \
   NEMOCLAW_YES=1 NEMOCLAW_POLICY_MODE=suggested NEMOCLAW_INSTALL_REF=main \
     bash -c 'curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash'
   ```
4. Wait for full flow: installer download (~3s), node + nemoclaw CLI build (~30s), openshell install, onboard `[1-4/8]` inference setup, `[6/8]` sandbox build (80/80 steps, ~10 min)
5. Right after `"Built image openshell/sandbox-from:<tag>"` and `"Waiting for sandbox to become ready..."` the GPU patch begins
6. Within ~12s the onboard prints `"Docker GPU patch failed"` and exits RC=1

## Expected Result

After GPU patch recreates the sandbox container with `--gpus all`, OpenShell supervisor should reconnect to the new container, the sandbox phase should transition to `Ready`, and onboard should proceed to `[7/8]` / `[8/8]`.

## Actual Result

From `~/.nemoclaw/onboard-failures/<ts>-my-assistant-docker-gpu-patch/summary.txt`:

- `error = OpenShell supervisor did not reconnect to the GPU-enabled container.`
- `failure_kind = sandbox_error_phase`
- `failure_headline = OpenShell sandbox entered Error phase before the GPU proof could run.`
- `sandbox_phase = Error`
- `gpu_mode_attempts: --gpus all: ok` — **the patch itself succeeded**
- `patched_container_status = running`
- `patched_container_exit_code = 0`
- `patched_container_health = starting`

The new GPU-patched container **IS** running and healthy — but the supervisor reconnect step times out, so onboard treats it as fatal even though the underlying container is fine.

Looks like a reconnect-timing race specific to WSL2/Docker-Desktop (the new container takes a moment longer to fully initialize OCSF supervisor than on native Linux, but the reconnect deadline doesn't account for it).

## Logs

`docker-logs.txt` (new GPU-patched container `88a80839d028…`) — supervisor DID start:

```text
2026-06-02T07:55:59.211Z INFO openshell_sandbox: Starting sandbox
2026-06-02T07:55:59.222Z INFO openshell_sandbox: Creating OPA engine from proto policy data
2026-06-02T07:55:59.336Z OCSF CONFIG:DEGRADED [MED] nft not found; bypass detection rules will not be installed
2026-06-02T07:55:59.349Z OCSF NET:LISTEN [INFO] 10.200.0.1:3128
2026-06-02T07:55:59.352Z OCSF SSH:LISTEN [INFO]
2026-06-02T07:55:59.352Z OCSF LIFECYCLE:INSTALL [INFO] OpenShell Sandbox Supervisor success
2026-06-02T07:55:59.352Z INFO openshell_sandbox: supervisor session task spawned
```

Then **~12 seconds later** onboard says GPU patch failed and `sandbox = Error`.

### Phase timings (from `INSTALL-START` at 0.00s)

| Phase | Wall time |
|---|---|
| preflight `[1/8]` | 48.5s |
| gateway `[2/8]` | 50.7s |
| inference config `[3/8]` | 51.7s |
| route set `ollama-local / qwen2.5:7b` | 64.8s |
| sandbox build start | 89.4s |
| step 80/80 done | 677.2s (~10 min — **~2× slower than Spark Path A's 310s**, Docker Desktop overhead) |
| "Built image" emitted | 683.8s |
| "Recreating OpenShell Docker sandbox container with NVIDIA GPU access..." | 688s |
| **"Docker GPU patch failed"** | **697.5s** (≈ 14s after recreate started) |

Sandbox built fine, `gpu_mode_attempts: --gpus all: ok` — only the supervisor reconnect failed.

### SECURITY-degraded markers in OCSF log

Documented for completeness — these are NOT the cause; they just document the WSL2 environment:

- `[SECURITY WARNING] setpriv or CAP_SETPCAP unavailable — falling back to gosu`
- `[SECURITY] CAP_SETPCAP not available — cannot drop bounding-set caps via capsh`
- `[SECURITY] Residual CapBnd=00000004a82c35fb`
- `Dangerous caps remain in bounding set: cap_sys_admin,cap_sys_ptrace,cap_net_raw,cap_dac_override,cap_net_bind_service`

### Diag bundle

`~/.nemoclaw/onboard-failures/2026-06-02T07-55-59-515Z-my-assistant-docker-gpu-patch/` (8 files: `summary.txt`, `docker-inspect.json`, `docker-logs.txt`, `docker-network-summary.txt`, `docker-ps.txt`, `openshell-sandbox-get.txt`, `openshell-sandbox-list.txt`, `patched-container-state.json`).

## Related context

**NOT duplicate** of:

- NVBug [6175942](https://nvbugspro.nvidia.com/bug/6175942) — Spark/Station GPU patch fails at `/proc/comm` write (different mechanism: patch itself fails on Spark; here the patch succeeds but the supervisor-reconnect step fails)
- NVBug [6235316](https://nvbugspro.nvidia.com/bug/6235316) (Ollama proxy port 11435 blocks 5 cases) — Win-host Ollama via `NEMOCLAW_PROVIDER=ollama` worked here, so unrelated

## Suggested fixes

1. Increase supervisor-reconnect timeout for WSL2/Docker-Desktop platform detection
2. If the patched container is `Status=running` + `Health=starting`, retry reconnect rather than declaring fatal
3. OR: don't tear down the old container until the new one's supervisor responds successfully (rollback path)

---
[NVB#6256537](https://nvbugspro.nvidia.com/bug/6256537)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WSL2 x86_64][Sandbox] OpenShell supervisor fails to reconnect to GPU-patched sandbox container; sandbox enters Error phase #4664

Description

Environment

Steps to Reproduce

Expected Result

Actual Result

Logs

Phase timings (from `INSTALL-START` at 0.00s)

SECURITY-degraded markers in OCSF log

Diag bundle

Related context

Suggested fixes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Phase	Wall time
preflight `[1/8]`	48.5s
gateway `[2/8]`	50.7s
inference config `[3/8]`	51.7s
route set `ollama-local / qwen2.5:7b`	64.8s
sandbox build start	89.4s
step 80/80 done	677.2s (~10 min — ~2× slower than Spark Path A's 310s, Docker Desktop overhead)
"Built image" emitted	683.8s
"Recreating OpenShell Docker sandbox container with NVIDIA GPU access..."	688s
"Docker GPU patch failed"	697.5s (≈ 14s after recreate started)

[WSL2 x86_64][Sandbox] OpenShell supervisor fails to reconnect to GPU-patched sandbox container; sandbox enters Error phase #4664

Description

Description

Environment

Steps to Reproduce

Expected Result

Actual Result

Logs

Phase timings (from INSTALL-START at 0.00s)

SECURITY-degraded markers in OCSF log

Diag bundle

Related context

Suggested fixes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Phase timings (from `INSTALL-START` at 0.00s)