Skip to content

[Ubuntu 24.04][Onboard] sandbox creation aborts on Docker 29 hosts without nvidia container runtime #3575

@hulynn

Description

@hulynn

Description

Description
NemoClaw onboard step 6/8 fails when adding GPU access to the sandbox container on hosts with Docker 29.x where the nvidia container runtime is not registered. Docker's --gpus all fallback path tries to enumerate an AMD CDI spec and aborts, killing sandbox creation even though the host has only NVIDIA GPUs. Preflight reports all-green for GPU support, so the failure surfaces only after the user has waited ~3 minutes for the sandbox image to build.
Environment

Device:        Ubuntu 24.04 server (RTX 5050, 8151 MB)
OS:            Ubuntu 24.04.4 LTS, kernel 6.17.0-23-generic
Architecture:  x86_64
Node.js:       v22.22.3
npm:           10.9.8
Docker:        29.4.3 (Server Version)
OpenShell CLI: openshell 0.0.39
NemoClaw:      v0.0.43
OpenClaw:      2026.4.24

Docker runtimes registered (docker info):   io.containerd.runc.v2, runc
Default runtime:                             runc
nvidia runtime registered:                   NO
nvidia-container-toolkit (dpkg):             1.19.0-1
/etc/docker/daemon.json:                     does not exist
/etc/cdi/nvidia.yaml:                        present
/var/run/cdi/nvidia.yaml:                    present
NVIDIA driver:                               595.58.03 (CUDA 13.2)
Steps to Reproduce
1. Provision a clean Ubuntu 24.04 host with an NVIDIA GPU.
2. Install Docker 29.x and nvidia-container-toolkit 1.19 via apt. Do NOT run "sudo nvidia-ctk runtime configure --runtime=docker".
3. Confirm "docker info" shows Runtimes: io.containerd.runc.v2 runc  (no nvidia).
4. Run the express installer:
     curl -fsSL https://www.nvidia.com/nemoclaw.sh \
       | NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 NEMOCLAW_PROVIDER=install-ollama bash
5. Allow preflight, inference setup, and sandbox image build to complete (about 3 minutes, 65 Dockerfile steps).
6. At step 6/8 the onboard aborts with the error below.
Expected Result

Either preflight detects the missing nvidia runtime and prompts (or fixes) before the sandbox build, or docker-gpu-patch falls back to the cdi mode (--device nvidia.com/gpu=0) when --gpus all fails. Onboard completes with a working GPU-enabled sandbox.

Actual Result
Preflight reports all-green: Docker CDI GPU support detected, Sandbox GPU: enabled (auto). Sandbox image builds successfully. At step 6/8 onboard exits 1 with:

Recreating OpenShell Docker sandbox container with NVIDIA GPU access...
openshell-my-assistant-e9cbbfaa-61a9-4ed3-8983-44090359d623

  Docker GPU patch failed.
  Could not start GPU-enabled sandbox container: docker: Error response from daemon: AMD CDI spec not found

Run 'docker run --help' for more information
  Diagnostics saved: ~/.nemoclaw/onboard-failures/TIMESTAMP-my-assistant-docker-gpu-patch
  Escape hatch: set NEMOCLAW_DOCKER_GPU_PATCH=0 to skip this patch.
Reproduction outside NemoClaw

Running directly on the same host (no nemoclaw involved) confirms the issue is in Docker's --gpus fallback path, not nemoclaw runtime logic:

$ docker run --rm --gpus all hello-world
docker: Error response from daemon: AMD CDI spec not found

$ docker run --rm --device nvidia.com/gpu=0 hello-world
Hello from Docker!         # works

$ docker run --rm --device nvidia.com/gpu=all hello-world
Hello from Docker!         # works
Root Cause

Docker 29.x resolves --gpus all by first looking for a GPU-aware runtime (e.g. nvidia). When none is registered, it falls back to CDI auto-enumeration that hardcodes an AMD CDI vendor lookup; the missing AMD spec is treated as fatal even though no AMD GPU is present. NemoClaw's docker-gpu-patch (src/lib/onboard/docker-gpu-patch.ts) tries the gpus mode first (--gpus all), trips this docker bug, treats it as fatal, and never tries the cdi mode (--device nvidia.com/gpu=0) which works on the same host.

Why some Ubuntu 24 + GPU hosts hit it, others don't
Three factors must all be true:

Factor                                | Affected (hits bug)         | Unaffected
--------------------------------------|-----------------------------|------------------------------
Docker version                         | 29.x                        | up to 26.x
nvidia runtime in /etc/docker/daemon.json | absent                   | present
nvidia-ctk runtime configure ever run? | no (plain apt install)      | yes (manual or cloud-init)
Cloud-init / Ansible templates commonly run nvidia-ctk runtime configure, which is why most pre-baked internal images are immune. Freshly provisioned hosts that picked up Docker 29 and only ran apt install nvidia-container-toolkit hit it.

Workarounds verified on this host

1. Skip the patch entirely:
     NEMOCLAW_DOCKER_GPU_PATCH=0 nemoclaw onboard ...
   Sandbox creates without GPU passthrough into the container. Ollama still runs on the host GPU; sandbox reaches ollama via loopback. End-to-end inference functional. Confirmed via "nemoclaw my-assistant doctor" reporting inference OK and both provider endpoints (11434 backend + 11435 auth proxy) reachable.

2. Fix the host:
     sudo nvidia-ctk runtime configure --runtime=docker
     sudo systemctl restart docker
   Writes /etc/docker/daemon.json with the nvidia runtime. After this, "docker run --gpus all" works and onboard succeeds with full sandbox GPU passthrough.
Suggested Fixes
1. Preflight (preferred): when docker version >= 29 AND nvidia-container-toolkit is installed AND no nvidia runtime appears in "docker info", either (a) prompt the user to run "nvidia-ctk runtime configure --runtime=docker" and restart docker, or (b) fail preflight with a clear remediation message. Do not let onboard reach step 6/8 in this state.

2. docker-gpu-patch (defensive): when the "gpus" mode fails — specifically on the "AMD CDI spec not found" pattern, or as a general fallback — automatically retry the "cdi" mode ("--device nvidia.com/gpu=DEVICE") before giving up. The cdi path is verified working on the same host where the gpus path fails.
Closest existing bugs (searched, not duplicates)
6175942 / GitHub #3511 - same docker-gpu-patch.ts file, different failure point (/proc/comm write) - already Fixed
6175658 / GitHub #3506 - "nvidia-ctk command not found" preflight messaging - different scenario (toolkit missing entirely)
6154930 / GitHub #3174 - preflight nvidia-smi reporting when toolkit missing

Bug Details

Field Value
Priority Unprioritized
Action Dev - Open - To fix
Disposition Open issue
Module Machine Learning - NemoClaw
Keyword NemoClaw, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Install, NemoClaw_Onboard, NemoClaw_Sandbox, NemoClaw-SWQA-RelBlckr-Recommended, NemoClaw-SWQA-Sprint4-Blocker

[NVB#6180013]

Metadata

Metadata

Labels

NV QABugs found by the NVIDIA QA TeamUATIssues flagged for User Acceptance Testing.platform: containerAffects Docker, containerd, Podman, or imagesplatform: ubuntuAffects Ubuntu Linux environmentsv0.0.51Release target

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions