Description
On Ubuntu 24.04 aarch64 host with RTX PRO 6000 Blackwell + GB300 dual-GPU, nemoclaw onboard --gpu consistently fails at the Docker GPU patch stage. After NemoClaw recreates the sandbox container with --gpus, the OpenShell supervisor cannot reconnect (openshell sandbox exec -n <name> -- true keeps failing), the sandbox enters Error phase, and the pre-patch sandbox is rolled back.
The rolled-back container is left degraded — only PID 1 (openshell-sandbox) + sleep infinity running, no nemoclaw-start, no openclaw gateway — yet nemoclaw status still reports Phase: Ready.
Tried 3 GPU mode variants, ALL fail:
--gpu auto (--gpus all) → sandbox Error phase
--gpu explicit (--gpus all) → sandbox Error phase
--sandbox-gpu-device <UUID> (--gpus device=<uuid>) → patch failed, pre-patch restored
Blocks ALL test cases requiring GPU sandbox on aarch64 (e.g. T6115528 GPU sandbox DNS, T6115545 Shields Down on GPU sandbox).
Escape hatch NEMOCLAW_DOCKER_GPU_PATCH=0 works but disables GPU passthrough entirely.
Note: NemoClaw side rollback logic works correctly. The actual reconnect failure is in OpenShell, not in NemoClaw's TypeScript. But the user-visible impact on NemoClaw is total: no --gpu onboard succeeds on this host.
Environment
Host: galaxy-sku2-018 (10.176.173.194)
OS: Ubuntu 24.04.4 LTS
Architecture: aarch64
Kernel: 6.17.0-1021-nvidia-64k
GPUs: NVIDIA RTX PRO 6000 Blackwell Max-Q (97887 MB) + NVIDIA GB300 (256703 MB)
NVIDIA driver: 610.43.02 (CUDA 13.3)
Docker: 29.2.1
nvidia-container-toolkit: 1.19.0 (CDI specs at /var/run/cdi/nvidia.yaml — nvidia-ctk cdi list shows 5 devices)
NemoClaw: v0.0.60
OpenShell: 0.0.44 (docker driver)
Steps to Reproduce
-
On aarch64 dual-GPU host, install NemoClaw v0.0.60:
curl -fsSL https://www.nvidia.com/nemoclaw.sh \
| NEMOCLAW_INSTALL_TAG=v0.0.60 NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 bash
-
Set credentials and run --gpu onboard:
export NVIDIA_API_KEY=nvapi-...
export NEMOCLAW_PROVIDER=build
export NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1
nemoclaw onboard --gpu --name dns-gpu-test --non-interactive
-
Observe Docker GPU patch failure message.
-
Inspect the resulting container:
docker exec $(docker ps --filter name=openshell-dns-gpu --format '{{.Names}}' | head -1) ps -ef
Expected Result
Onboard completes successfully with Sandbox GPU enabled. Sandbox container has nemoclaw-start + openclaw gateway running. nvidia-smi works inside the sandbox.
Actual Result
Onboard ends with:
Docker GPU patch failed.
OpenShell supervisor did not reconnect to the GPU-enabled container; pre-patch sandbox restored.
OpenShell sandbox entered Error phase before the GPU proof could run.
sandbox_phase=Error
patched_create_option=--gpus all
Diagnostics saved: /localhome/local-mercl/.nemoclaw/onboard-failures/2026-06-08T10-09-03-095Z-dns-gpu-test-docker-gpu-patch
docker exec into restored container shows only:
UID PID PPID CMD
root 1 0 /opt/openshell/bin/openshell-sandbox
sandbox 94 1 sleep infinity
Logs
Diagnostic directory on host:
/localhome/local-mercl/.nemoclaw/onboard-failures/2026-06-08T10-09-03-095Z-dns-gpu-test-docker-gpu-patch/
summary.txt (failure metadata: sandbox_phase, patched_create_option, etc.)
docker-inspect.json
docker-logs.txt
openshell-sandbox-get.txt (Error phase reading)
openshell-sandbox-list.txt
Code references:
src/lib/onboard/docker-gpu-supervisor-reconnect.ts:118-148 (detects + rolls back correctly)
src/lib/onboard/docker-gpu-sandbox-create.ts:165-214 (reconnect wait + failure exit)
Suggested investigation:
Root cause is in OpenShell supervisor's ability to reconnect to a --gpus-patched
container on aarch64 + kernel 6.17 nvidia-64k + dual GPU. May be specific to
nvidia-container-toolkit 1.19.0 + this Blackwell GPU combo. Route to OpenShell
team for the underlying fix; NemoClaw side is correct.
NVB#6282407
Description
On Ubuntu 24.04 aarch64 host with RTX PRO 6000 Blackwell + GB300 dual-GPU,
nemoclaw onboard --gpuconsistently fails at the Docker GPU patch stage. After NemoClaw recreates the sandbox container with--gpus, the OpenShell supervisor cannot reconnect (openshell sandbox exec -n <name> -- truekeeps failing), the sandbox enters Error phase, and the pre-patch sandbox is rolled back.The rolled-back container is left degraded — only PID 1 (
openshell-sandbox) +sleep infinityrunning, nonemoclaw-start, noopenclawgateway — yetnemoclaw statusstill reportsPhase: Ready.Tried 3 GPU mode variants, ALL fail:
--gpuauto (--gpus all) → sandbox Error phase--gpuexplicit (--gpus all) → sandbox Error phase--sandbox-gpu-device <UUID>(--gpus device=<uuid>) → patch failed, pre-patch restoredBlocks ALL test cases requiring GPU sandbox on aarch64 (e.g. T6115528 GPU sandbox DNS, T6115545 Shields Down on GPU sandbox).
Escape hatch
NEMOCLAW_DOCKER_GPU_PATCH=0works but disables GPU passthrough entirely.Note: NemoClaw side rollback logic works correctly. The actual reconnect failure is in OpenShell, not in NemoClaw's TypeScript. But the user-visible impact on NemoClaw is total: no
--gpuonboard succeeds on this host.Environment
Steps to Reproduce
On aarch64 dual-GPU host, install NemoClaw v0.0.60:
curl -fsSL https://www.nvidia.com/nemoclaw.sh \ | NEMOCLAW_INSTALL_TAG=v0.0.60 NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 bashSet credentials and run
--gpuonboard:Observe Docker GPU patch failure message.
Inspect the resulting container:
Expected Result
Onboard completes successfully with Sandbox GPU enabled. Sandbox container has
nemoclaw-start+openclawgateway running.nvidia-smiworks inside the sandbox.Actual Result
Onboard ends with:
docker execinto restored container shows only:Logs
NVB#6282407