[Ubuntu 24.04][Onboard] sandbox creation aborts on Docker 29 hosts without nvidia container runtime

## Description

Description
NemoClaw onboard step 6/8 fails when adding GPU access to the sandbox container on hosts with Docker 29.x where the nvidia container runtime is not registered. Docker's <code>--gpus all</code> fallback path tries to enumerate an AMD CDI spec and aborts, killing sandbox creation even though the host has only NVIDIA GPUs. Preflight reports all-green for GPU support, so the failure surfaces only after the user has waited ~3 minutes for the sandbox image to build.
Environment
<pre>Device: Ubuntu 24.04 server (RTX 5050, 8151 MB)
OS: Ubuntu 24.04.4 LTS, kernel 6.17.0-23-generic
Architecture: x86_64
Node.js: v22.22.3
npm: 10.9.8
Docker: 29.4.3 (Server Version)
OpenShell CLI: openshell 0.0.39
NemoClaw: v0.0.43
OpenClaw: 2026.4.24

Docker runtimes registered (docker info): io.containerd.runc.v2, runc
Default runtime: runc
nvidia runtime registered: NO
nvidia-container-toolkit (dpkg): 1.19.0-1
/etc/docker/daemon.json: does not exist
/etc/cdi/nvidia.yaml: present
/var/run/cdi/nvidia.yaml: present
NVIDIA driver: 595.58.03 (CUDA 13.2)</code></pre>Steps to Reproduce

<pre>1. Provision a clean Ubuntu 24.04 host with an NVIDIA GPU.
2. Install Docker 29.x and nvidia-container-toolkit 1.19 via apt. Do NOT run "sudo nvidia-ctk runtime configure --runtime=docker".
3. Confirm "docker info" shows Runtimes: io.containerd.runc.v2 runc (no nvidia).
4. Run the express installer:
 curl -fsSL https://www.nvidia.com/nemoclaw.sh \
 | NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1 NEMOCLAW_PROVIDER=install-ollama bash
5. Allow preflight, inference setup, and sandbox image build to complete (about 3 minutes, 65 Dockerfile steps).
6. At step 6/8 the onboard aborts with the error below.</code></pre>Expected Result
Either preflight detects the missing nvidia runtime and prompts (or fixes) before the sandbox build, or docker-gpu-patch falls back to the cdi mode (<code>--device nvidia.com/gpu=0</code>) when <code>--gpus all</code> fails. Onboard completes with a working GPU-enabled sandbox.

Actual Result
Preflight reports all-green: <code>Docker CDI GPU support detected</code>, <code>Sandbox GPU: enabled (auto)</code>. Sandbox image builds successfully. At step 6/8 onboard exits 1 with:

<pre>Recreating OpenShell Docker sandbox container with NVIDIA GPU access...
openshell-my-assistant-e9cbbfaa-61a9-4ed3-8983-44090359d623

 Docker GPU patch failed.
 Could not start GPU-enabled sandbox container: docker: Error response from daemon: AMD CDI spec not found

Run 'docker run --help' for more information
 Diagnostics saved: ~/.nemoclaw/onboard-failures/TIMESTAMP-my-assistant-docker-gpu-patch
 Escape hatch: set NEMOCLAW_DOCKER_GPU_PATCH=0 to skip this patch.</code></pre>Reproduction outside NemoClaw
Running directly on the same host (no nemoclaw involved) confirms the issue is in Docker's --gpus fallback path, not nemoclaw runtime logic:
<pre>$ docker run --rm --gpus all hello-world
docker: Error response from daemon: AMD CDI spec not found

$ docker run --rm --device nvidia.com/gpu=0 hello-world
Hello from Docker! # works

$ docker run --rm --device nvidia.com/gpu=all hello-world
Hello from Docker! # works</code></pre>Root Cause
Docker 29.x resolves <code>--gpus all</code> by first looking for a GPU-aware runtime (e.g. <code>nvidia</code>). When none is registered, it falls back to CDI auto-enumeration that hardcodes an AMD CDI vendor lookup; the missing AMD spec is treated as fatal even though no AMD GPU is present. NemoClaw's docker-gpu-patch (<code>src/lib/onboard/docker-gpu-patch.ts</code>) tries the <code>gpus</code> mode first (<code>--gpus all</code>), trips this docker bug, treats it as fatal, and never tries the <code>cdi</code> mode (<code>--device nvidia.com/gpu=0</code>) which works on the same host.

Why some Ubuntu 24 + GPU hosts hit it, others don't
Three factors must all be true:
<pre>Factor | Affected (hits bug) | Unaffected
--------------------------------------|-----------------------------|------------------------------
Docker version | 29.x | up to 26.x
nvidia runtime in /etc/docker/daemon.json | absent | present
nvidia-ctk runtime configure ever run? | no (plain apt install) | yes (manual or cloud-init)</code></pre>Cloud-init / Ansible templates commonly run <code>nvidia-ctk runtime configure</code>, which is why most pre-baked internal images are immune. Freshly provisioned hosts that picked up Docker 29 and only ran <code>apt install nvidia-container-toolkit</code> hit it.

Workarounds verified on this host

<pre>1. Skip the patch entirely:
 NEMOCLAW_DOCKER_GPU_PATCH=0 nemoclaw onboard ...
 Sandbox creates without GPU passthrough into the container. Ollama still runs on the host GPU; sandbox reaches ollama via loopback. End-to-end inference functional. Confirmed via "nemoclaw my-assistant doctor" reporting inference OK and both provider endpoints (11434 backend + 11435 auth proxy) reachable.

2. Fix the host:
 sudo nvidia-ctk runtime configure --runtime=docker
 sudo systemctl restart docker
 Writes /etc/docker/daemon.json with the nvidia runtime. After this, "docker run --gpus all" works and onboard succeeds with full sandbox GPU passthrough.</code></pre>Suggested Fixes

<pre>1. Preflight (preferred): when docker version >= 29 AND nvidia-container-toolkit is installed AND no nvidia runtime appears in "docker info", either (a) prompt the user to run "nvidia-ctk runtime configure --runtime=docker" and restart docker, or (b) fail preflight with a clear remediation message. Do not let onboard reach step 6/8 in this state.

2. docker-gpu-patch (defensive): when the "gpus" mode fails — specifically on the "AMD CDI spec not found" pattern, or as a general fallback — automatically retry the "cdi" mode ("--device nvidia.com/gpu=DEVICE") before giving up. The cdi path is verified working on the same host where the gpus path fails.</code></pre>Closest existing bugs (searched, not duplicates)

<pre>6175942 / GitHub #3511 - same docker-gpu-patch.ts file, different failure point (/proc/comm write) - already Fixed
6175658 / GitHub #3506 - "nvidia-ctk command not found" preflight messaging - different scenario (toolkit missing entirely)
6154930 / GitHub #3174 - preflight nvidia-smi reporting when toolkit missing</code></pre>

## Bug Details

| Field | Value |
|-------|-------|
| Priority | Unprioritized |
| Action | Dev - Open - To fix |
| Disposition | Open issue |
| Module | Machine Learning - NemoClaw |
| Keyword | NemoClaw, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Install, NemoClaw_Onboard, NemoClaw_Sandbox, NemoClaw-SWQA-RelBlckr-Recommended, NemoClaw-SWQA-Sprint4-Blocker |

---
[NVB#6180013]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ubuntu 24.04][Onboard] sandbox creation aborts on Docker 29 hosts without nvidia container runtime #3575

Description

Bug Details

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Field	Value
Priority	Unprioritized
Action	Dev - Open - To fix
Disposition	Open issue
Module	Machine Learning - NemoClaw
Keyword	NemoClaw, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Install, NemoClaw_Onboard, NemoClaw_Sandbox, NemoClaw-SWQA-RelBlckr-Recommended, NemoClaw-SWQA-Sprint4-Blocker

[Ubuntu 24.04][Onboard] sandbox creation aborts on Docker 29 hosts without nvidia container runtime #3575

Description

Description

Bug Details

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions