[Ubuntu 24.04][Onboard] nemoclaw onboard --gpu always fails with Docker GPU patch supervisor reconnect timeout on aarch64 dual-GPU host

## Description

On Ubuntu 24.04 aarch64 host with RTX PRO 6000 Blackwell + GB300 dual-GPU, `nemoclaw onboard --gpu` consistently fails at the Docker GPU patch stage. After NemoClaw recreates the sandbox container with `--gpus`, the OpenShell supervisor cannot reconnect (`openshell sandbox exec -n <name> -- true` keeps failing), the sandbox enters Error phase, and the pre-patch sandbox is rolled back.

The rolled-back container is left degraded — only PID 1 (`openshell-sandbox`) + `sleep infinity` running, no `nemoclaw-start`, no `openclaw` gateway — yet `nemoclaw status` still reports `Phase: Ready`.

Tried 3 GPU mode variants, ALL fail:

- `--gpu` auto (`--gpus all`) → sandbox Error phase
- `--gpu` explicit (`--gpus all`) → sandbox Error phase
- `--sandbox-gpu-device <UUID>` (`--gpus device=<uuid>`) → patch failed, pre-patch restored

Blocks ALL test cases requiring GPU sandbox on aarch64 (e.g. T6115528 GPU sandbox DNS, T6115545 Shields Down on GPU sandbox).

Escape hatch `NEMOCLAW_DOCKER_GPU_PATCH=0` works but disables GPU passthrough entirely.

Note: NemoClaw side rollback logic works correctly. The actual reconnect failure is in OpenShell, not in NemoClaw's TypeScript. But the user-visible impact on NemoClaw is total: no `--gpu` onboard succeeds on this host.

## Environment

```text
Host:           galaxy-sku2-018 (10.176.173.194)
OS:             Ubuntu 24.04.4 LTS
Architecture:   aarch64
Kernel:         6.17.0-1021-nvidia-64k
GPUs:           NVIDIA RTX PRO 6000 Blackwell Max-Q (97887 MB) + NVIDIA GB300 (256703 MB)
NVIDIA driver:  610.43.02 (CUDA 13.3)
Docker:         29.2.1
nvidia-container-toolkit: 1.19.0 (CDI specs at /var/run/cdi/nvidia.yaml — nvidia-ctk cdi list shows 5 devices)
NemoClaw:       v0.0.60
OpenShell:      0.0.44 (docker driver)
```

## Steps to Reproduce

1. On aarch64 dual-GPU host, install NemoClaw v0.0.60:

   ```bash
   curl -fsSL https://www.nvidia.com/nemoclaw.sh \
     | NEMOCLAW_INSTALL_TAG=v0.0.60 NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 bash
   ```

2. Set credentials and run `--gpu` onboard:

   ```bash
   export NVIDIA_API_KEY=nvapi-...
   export NEMOCLAW_PROVIDER=build
   export NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1
   nemoclaw onboard --gpu --name dns-gpu-test --non-interactive
   ```

3. Observe Docker GPU patch failure message.
4. Inspect the resulting container:

   ```bash
   docker exec $(docker ps --filter name=openshell-dns-gpu --format '{{.Names}}' | head -1) ps -ef
   ```

## Expected Result

Onboard completes successfully with Sandbox GPU enabled. Sandbox container has `nemoclaw-start` + `openclaw` gateway running. `nvidia-smi` works inside the sandbox.

## Actual Result

Onboard ends with:

```text
Docker GPU patch failed.
OpenShell supervisor did not reconnect to the GPU-enabled container; pre-patch sandbox restored.
OpenShell sandbox entered Error phase before the GPU proof could run.
  sandbox_phase=Error
  patched_create_option=--gpus all
Diagnostics saved: /localhome/local-mercl/.nemoclaw/onboard-failures/2026-06-08T10-09-03-095Z-dns-gpu-test-docker-gpu-patch
```

`docker exec` into restored container shows only:

```text
UID     PID  PPID  CMD
root    1    0     /opt/openshell/bin/openshell-sandbox
sandbox 94   1     sleep infinity
```

## Logs

```text
Diagnostic directory on host:
  /localhome/local-mercl/.nemoclaw/onboard-failures/2026-06-08T10-09-03-095Z-dns-gpu-test-docker-gpu-patch/
    summary.txt (failure metadata: sandbox_phase, patched_create_option, etc.)
    docker-inspect.json
    docker-logs.txt
    openshell-sandbox-get.txt (Error phase reading)
    openshell-sandbox-list.txt

Code references:
  src/lib/onboard/docker-gpu-supervisor-reconnect.ts:118-148 (detects + rolls back correctly)
  src/lib/onboard/docker-gpu-sandbox-create.ts:165-214 (reconnect wait + failure exit)

Suggested investigation:
  Root cause is in OpenShell supervisor's ability to reconnect to a --gpus-patched
  container on aarch64 + kernel 6.17 nvidia-64k + dual GPU. May be specific to
  nvidia-container-toolkit 1.19.0 + this Blackwell GPU combo. Route to OpenShell
  team for the underlying fix; NemoClaw side is correct.
```

---
[NVB#6282407](https://nvbugspro.nvidia.com/bug/6282407)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ubuntu 24.04][Onboard] nemoclaw onboard --gpu always fails with Docker GPU patch supervisor reconnect timeout on aarch64 dual-GPU host #4950

Description

Environment

Steps to Reproduce

Expected Result

Actual Result

Logs

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Ubuntu 24.04][Onboard] nemoclaw onboard --gpu always fails with Docker GPU patch supervisor reconnect timeout on aarch64 dual-GPU host #4950

Description

Description

Environment

Steps to Reproduce

Expected Result

Actual Result

Logs

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions