Description
Description
NemoClaw onboard step 6/8 fails when adding GPU access to the sandbox container on hosts with Docker 29.x where the nvidia container runtime is not registered. Docker's --gpus all fallback path tries to enumerate an AMD CDI spec and aborts, killing sandbox creation even though the host has only NVIDIA GPUs. Preflight reports all-green for GPU support, so the failure surfaces only after the user has waited ~3 minutes for the sandbox image to build.
Environment
Device: Ubuntu 24.04 server (RTX 5050, 8151 MB)
OS: Ubuntu 24.04.4 LTS, kernel 6.17.0-23-generic
Architecture: x86_64
Node.js: v22.22.3
npm: 10.9.8
Docker: 29.4.3 (Server Version)
OpenShell CLI: openshell 0.0.39
NemoClaw: v0.0.43
OpenClaw: 2026.4.24
Docker runtimes registered (docker info): io.containerd.runc.v2, runc
Default runtime: runc
nvidia runtime registered: NO
nvidia-container-toolkit (dpkg): 1.19.0-1
/etc/docker/daemon.json: does not exist
/etc/cdi/nvidia.yaml: present
/var/run/cdi/nvidia.yaml: present
NVIDIA driver: 595.58.03 (CUDA 13.2)
Steps to Reproduce
1. Provision a clean Ubuntu 24.04 host with an NVIDIA GPU.
2. Install Docker 29.x and nvidia-container-toolkit 1.19 via apt. Do NOT run "sudo nvidia-ctk runtime configure --runtime=docker".
3. Confirm "docker info" shows Runtimes: io.containerd.runc.v2 runc (no nvidia).
4. Run the express installer:
curl -fsSL https://www.nvidia.com/nemoclaw.sh \
| NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 NEMOCLAW_PROVIDER=install-ollama bash
5. Allow preflight, inference setup, and sandbox image build to complete (about 3 minutes, 65 Dockerfile steps).
6. At step 6/8 the onboard aborts with the error below.Expected Result
Either preflight detects the missing nvidia runtime and prompts (or fixes) before the sandbox build, or docker-gpu-patch falls back to the cdi mode (--device nvidia.com/gpu=0) when --gpus all fails. Onboard completes with a working GPU-enabled sandbox.
Actual Result
Preflight reports all-green: Docker CDI GPU support detected, Sandbox GPU: enabled (auto). Sandbox image builds successfully. At step 6/8 onboard exits 1 with:
Recreating OpenShell Docker sandbox container with NVIDIA GPU access...
openshell-my-assistant-e9cbbfaa-61a9-4ed3-8983-44090359d623
Docker GPU patch failed.
Could not start GPU-enabled sandbox container: docker: Error response from daemon: AMD CDI spec not found
Run 'docker run --help' for more information
Diagnostics saved: ~/.nemoclaw/onboard-failures/TIMESTAMP-my-assistant-docker-gpu-patch
Escape hatch: set NEMOCLAW_DOCKER_GPU_PATCH=0 to skip this patch.
Reproduction outside NemoClaw
Running directly on the same host (no nemoclaw involved) confirms the issue is in Docker's --gpus fallback path, not nemoclaw runtime logic:
$ docker run --rm --gpus all hello-world
docker: Error response from daemon: AMD CDI spec not found
$ docker run --rm --device nvidia.com/gpu=0 hello-world
Hello from Docker! # works
$ docker run --rm --device nvidia.com/gpu=all hello-world
Hello from Docker! # works
Root Cause
Docker 29.x resolves --gpus all by first looking for a GPU-aware runtime (e.g. nvidia). When none is registered, it falls back to CDI auto-enumeration that hardcodes an AMD CDI vendor lookup; the missing AMD spec is treated as fatal even though no AMD GPU is present. NemoClaw's docker-gpu-patch (src/lib/onboard/docker-gpu-patch.ts) tries the gpus mode first (--gpus all), trips this docker bug, treats it as fatal, and never tries the cdi mode (--device nvidia.com/gpu=0) which works on the same host.
Why some Ubuntu 24 + GPU hosts hit it, others don't
Three factors must all be true:
Factor | Affected (hits bug) | Unaffected
--------------------------------------|-----------------------------|------------------------------
Docker version | 29.x | up to 26.x
nvidia runtime in /etc/docker/daemon.json | absent | present
nvidia-ctk runtime configure ever run? | no (plain apt install) | yes (manual or cloud-init)
Cloud-init / Ansible templates commonly run
nvidia-ctk runtime configure, which is why most pre-baked internal images are immune. Freshly provisioned hosts that picked up Docker 29 and only ran
apt install nvidia-container-toolkit hit it.
Workarounds verified on this host
1. Skip the patch entirely:
NEMOCLAW_DOCKER_GPU_PATCH=0 nemoclaw onboard ...
Sandbox creates without GPU passthrough into the container. Ollama still runs on the host GPU; sandbox reaches ollama via loopback. End-to-end inference functional. Confirmed via "nemoclaw my-assistant doctor" reporting inference OK and both provider endpoints (11434 backend + 11435 auth proxy) reachable.
2. Fix the host:
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Writes /etc/docker/daemon.json with the nvidia runtime. After this, "docker run --gpus all" works and onboard succeeds with full sandbox GPU passthrough.Suggested Fixes
1. Preflight (preferred): when docker version >= 29 AND nvidia-container-toolkit is installed AND no nvidia runtime appears in "docker info", either (a) prompt the user to run "nvidia-ctk runtime configure --runtime=docker" and restart docker, or (b) fail preflight with a clear remediation message. Do not let onboard reach step 6/8 in this state.
2. docker-gpu-patch (defensive): when the "gpus" mode fails — specifically on the "AMD CDI spec not found" pattern, or as a general fallback — automatically retry the "cdi" mode ("--device nvidia.com/gpu=DEVICE") before giving up. The cdi path is verified working on the same host where the gpus path fails.Closest existing bugs (searched, not duplicates)
6175942 / GitHub #3511 - same docker-gpu-patch.ts file, different failure point (/proc/comm write) - already Fixed
6175658 / GitHub #3506 - "nvidia-ctk command not found" preflight messaging - different scenario (toolkit missing entirely)
6154930 / GitHub #3174 - preflight nvidia-smi reporting when toolkit missing
Bug Details
| Field |
Value |
| Priority |
Unprioritized |
| Action |
Dev - Open - To fix |
| Disposition |
Open issue |
| Module |
Machine Learning - NemoClaw |
| Keyword |
NemoClaw, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Install, NemoClaw_Onboard, NemoClaw_Sandbox, NemoClaw-SWQA-RelBlckr-Recommended, NemoClaw-SWQA-Sprint4-Blocker |
[NVB#6180013]
Description
Description
Steps to ReproduceNemoClaw onboard step 6/8 fails when adding GPU access to the sandbox container on hosts with Docker 29.x where the nvidia container runtime is not registered. Docker's
--gpus allfallback path tries to enumerate an AMD CDI spec and aborts, killing sandbox creation even though the host has only NVIDIA GPUs. Preflight reports all-green for GPU support, so the failure surfaces only after the user has waited ~3 minutes for the sandbox image to build.Environment
1. Provision a clean Ubuntu 24.04 host with an NVIDIA GPU. 2. Install Docker 29.x and nvidia-container-toolkit 1.19 via apt. Do NOT run "sudo nvidia-ctk runtime configure --runtime=docker". 3. Confirm "docker info" shows Runtimes: io.containerd.runc.v2 runc (no nvidia). 4. Run the express installer: curl -fsSL https://www.nvidia.com/nemoclaw.sh \ | NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 NEMOCLAW_PROVIDER=install-ollama bash 5. Allow preflight, inference setup, and sandbox image build to complete (about 3 minutes, 65 Dockerfile steps). 6. At step 6/8 the onboard aborts with the error below.Expected ResultEither preflight detects the missing nvidia runtime and prompts (or fixes) before the sandbox build, or docker-gpu-patch falls back to the cdi mode (
--device nvidia.com/gpu=0) when--gpus allfails. Onboard completes with a working GPU-enabled sandbox.Actual Result
Reproduction outside NemoClawPreflight reports all-green:
Docker CDI GPU support detected,Sandbox GPU: enabled (auto). Sandbox image builds successfully. At step 6/8 onboard exits 1 with:Running directly on the same host (no nemoclaw involved) confirms the issue is in Docker's --gpus fallback path, not nemoclaw runtime logic:
Root CauseDocker 29.x resolves
--gpus allby first looking for a GPU-aware runtime (e.g.nvidia). When none is registered, it falls back to CDI auto-enumeration that hardcodes an AMD CDI vendor lookup; the missing AMD spec is treated as fatal even though no AMD GPU is present. NemoClaw's docker-gpu-patch (src/lib/onboard/docker-gpu-patch.ts) tries thegpusmode first (--gpus all), trips this docker bug, treats it as fatal, and never tries thecdimode (--device nvidia.com/gpu=0) which works on the same host.Why some Ubuntu 24 + GPU hosts hit it, others don't
Cloud-init / Ansible templates commonly runThree factors must all be true:
nvidia-ctk runtime configure, which is why most pre-baked internal images are immune. Freshly provisioned hosts that picked up Docker 29 and only ranapt install nvidia-container-toolkithit it.Workarounds verified on this host
1. Skip the patch entirely: NEMOCLAW_DOCKER_GPU_PATCH=0 nemoclaw onboard ... Sandbox creates without GPU passthrough into the container. Ollama still runs on the host GPU; sandbox reaches ollama via loopback. End-to-end inference functional. Confirmed via "nemoclaw my-assistant doctor" reporting inference OK and both provider endpoints (11434 backend + 11435 auth proxy) reachable. 2. Fix the host: sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker Writes /etc/docker/daemon.json with the nvidia runtime. After this, "docker run --gpus all" works and onboard succeeds with full sandbox GPU passthrough.Suggested Fixes1. Preflight (preferred): when docker version >= 29 AND nvidia-container-toolkit is installed AND no nvidia runtime appears in "docker info", either (a) prompt the user to run "nvidia-ctk runtime configure --runtime=docker" and restart docker, or (b) fail preflight with a clear remediation message. Do not let onboard reach step 6/8 in this state. 2. docker-gpu-patch (defensive): when the "gpus" mode fails — specifically on the "AMD CDI spec not found" pattern, or as a general fallback — automatically retry the "cdi" mode ("--device nvidia.com/gpu=DEVICE") before giving up. The cdi path is verified working on the same host where the gpus path fails.Closest existing bugs (searched, not duplicates)Bug Details
[NVB#6180013]