fix(onboard): scan default CDI dirs for NVIDIA specs (#3575) by latenighthackathon · Pull Request #3675 · NVIDIA/NemoClaw

latenighthackathon · 2026-05-18T03:47:22Z

Summary

Closes #3575. Extends dockerReportsNvidiaCdiDevices to scan Docker's well-known default CDI directories in addition to whatever docker info --format '{{json .CDISpecDirs}}' returns, so cdi mode stays in the GPU candidate list on Docker 29 hosts that have nvidia-container-toolkit installed but no /etc/docker/daemon.json.

Problem

On those hosts, docker info returns an empty CDISpecDirs even though /etc/cdi/nvidia.yaml is present and Docker reads it at runtime. The previous detector only scanned the dirs docker info reported, so cdiAvailable came back false, cdi was dropped from the candidate list, and selectDockerGpuPatchMode never tried --device nvidia.com/gpu=DEVICE. --gpus all then tripped Docker 29's "AMD CDI spec not found" fallback bug and onboard aborted at step 6/8 with no retry, even though the reporter verified that docker run --device nvidia.com/gpu=0 hello-world worked on the same host.

This matches the reporter's Option 2 ("docker-gpu-patch defensive") but does the work upfront in detection rather than after --gpus all fails. Net result: on affected hosts, the existing candidate iteration in selectDockerGpuPatchMode picks up cdi automatically after gpus/nvidia-runtime probes fail, no retry-after-failure plumbing needed.

Out of scope

The reporter's Option 1 (preflight prompt to run nvidia-ctk runtime configure) is deferred. It is a UX/wizard change worth its own PR if Option 2 does not fully cover the ergonomics.

Test plan

4 new vitest cases in src/lib/onboard/docker-gpu-patch.test.ts covering:
- empty CDISpecDirs + nvidia.yaml in /etc/cdi resolves cdiAvailable=true
- default dirs with no NVIDIA specs resolves cdiAvailable=false
- docker info throws -> still falls back to defaults
- reported dir matching a default -> scanned once, not twice
npx vitest run src/lib/onboard/ - 410/410 pass (38 files)
prek run clean on changed files

Signed-off-by: latenighthackathon latenighthackathon@users.noreply.github.com

Summary by CodeRabbit

Tests
- Added comprehensive test coverage for NVIDIA GPU device detection across multiple scenarios.
Improvements
- Improved robustness when Docker-provided info is unavailable by falling back to standard CDI locations.
- Optimized scanning to avoid redundant directory checks, reducing unnecessary work and false negatives.

copy-pr-bot · 2026-05-18T03:47:26Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-18T03:47:34Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4e10857a-53cc-4cbf-a491-386bc78f8992

📥 Commits

Reviewing files that changed from the base of the PR and between 19e81ce and fc3e9b1.

📒 Files selected for processing (2)

src/lib/onboard/docker-gpu-patch.test.ts
src/lib/onboard/docker-gpu-patch.ts

🚧 Files skipped from review as they are similar to previous changes (2)

src/lib/onboard/docker-gpu-patch.test.ts
src/lib/onboard/docker-gpu-patch.ts

📝 Walkthrough

Walkthrough

Adds injected filesystem helpers, default CDI directories, and heuristics to detect NVIDIA CDI specs; dockerReportsNvidiaCdiDevices now falls back to default CDI dirs when docker info fails. Four Vitest cases cover detection, no-match, docker-info exception fallback, and duplicate-scan avoidance.

Changes

NVIDIA CDI Detection for Docker 29.x Fallback

Layer / File(s)	Summary
CDI detection interfaces and helpers `src/lib/onboard/docker-gpu-patch.ts`	`DockerGpuPatchDeps` gains optional `readDir` and `readFile`. Adds exported `DEFAULT_DOCKER_CDI_SPEC_DIRS` (`["/etc/cdi", "/var/run/cdi"]`). Implements helpers to resolve/deduplicate scan dirs, list directory entries via injected fs, read file content, and identify NVIDIA CDI spec files by extension/content heuristics.
NVIDIA CDI fallback detection `src/lib/onboard/docker-gpu-patch.ts`, `src/lib/onboard/docker-gpu-patch.test.ts`	`dockerReportsNvidiaCdiDevices` now falls back to default CDI dirs when `docker info` fails, uses `deps.readDir`/`deps.readFile` if provided, deduplicates directories, and applies the NVIDIA CDI predicate. Four Vitest tests validate default scanning, absence of NVIDIA specs, docker-info exception fallback, and avoiding duplicate rescans.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

Docker, fix, v0.0.44

Suggested reviewers

cjagwani
cv
zyang-dev

Poem

🐰 I hop through /etc and /var/run with glee,
I read each yaml for NVIDIA's key,
When docker hides dirs or throws a fit,
I fallback and sniff — I never quit.
Sandbox ready, GPUs hum for thee.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(onboard): scan default CDI dirs for NVIDIA specs (`#3575`)' clearly and concisely describes the main change: scanning default CDI directories for NVIDIA specs to address Docker 29 compatibility issues.
Linked Issues check	✅ Passed	The PR directly addresses issue `#3575` by implementing defensive fallback logic to scan default CDI directories when docker info fails or returns empty CDISpecDirs, enabling GPU detection on Docker 29 hosts without registered nvidia runtime.
Out of Scope Changes check	✅ Passed	All changes are scoped to the docker-gpu-patch module and directly support the CDI spec detection objective. The addition of readDir/readFile dependencies and DEFAULT_DOCKER_CDI_SPEC_DIRS constant are necessary enablers for testability and the main functionality.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

wscurran · 2026-05-18T14:54:00Z

✨ Thanks for submitting this detailed PR to fix the onboard issue with scanning default CDI dirs for NVIDIA specs. This change aims to improve the reliability of the onboard process by extending the dockerReportsNvidiaCdiDevices function to scan Docker's well-known default CDI directories.

Related open issues:

#3575 [Ubuntu 24.04][Onboard] sandbox creation aborts on Docker 29 hosts without nvidia container runtime

On Docker 29 hosts with `nvidia-container-toolkit` installed but no `/etc/docker/daemon.json`, `docker info` returns an empty CDISpecDirs list. The previous detection in `dockerReportsNvidiaCdiDevices` only scanned what docker info reported, so the `cdi` GPU mode was dropped from the candidate list. `--gpus all` then tripped Docker 29's "AMD CDI spec not found" fallback bug and onboard aborted at step 6/8 with no retry, even though `docker run --device nvidia.com/gpu=0` worked on the same host. Extend the detector to also scan Docker's well-known default CDI directories (`/etc/cdi`, `/var/run/cdi`) in addition to whatever docker info reports. The defaults are deduplicated against the reported set so a host that surfaces `/etc/cdi` explicitly is not scanned twice. Adds optional `readDir` and `readFile` deps so the new behavior is fully unit-testable. Tests: - Empty CDISpecDirs + nvidia.yaml in /etc/cdi -> detected. - Default dirs with no NVIDIA specs -> not detected. - docker info throws -> still falls back to defaults. - Reported dir matching a default -> scanned once, not twice. Closes NVIDIA#3575 Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com>

## Summary - keep `~/.nemoclaw` directory hardening for user-owned config dirs - skip directory `chmod` when the config dir is owned by another UID, which preserves OpenClaw's root-owned `/sandbox/.nemoclaw` layout - keep `config.json` restricted to `0600` - add shell-level regression coverage for owned and non-owned config directories ## Root Cause PR #4054 changed the step-7 sandbox config sync script from writing `~/.nemoclaw/config.json` to unconditionally running `chmod 700 ~/.nemoclaw` first. In the OpenClaw sandbox image, `/sandbox/.nemoclaw` is intentionally root-owned with mode `1755`, while `/sandbox/.nemoclaw/config.json` is sandbox-owned and writable. The directory chmod fails under `set -euo pipefail`, so `openshell sandbox connect <sandbox>` exits 1 during: `[7/8] Setting up OpenClaw inside sandbox` This restores compatibility with that image contract without reverting the file-permission hardening from #4054. ## Evidence - Last known good public install: run `26413471038`, installed ref `8ed9a2c04c80627deed519a7c5ffe8f95ebb5ddd`, `cloud-onboard-e2e` reached `✓ OpenClaw gateway launched inside sandbox`. - First observed failing public install: run `26414889786`, installed ref `94fefa6fd08e33d48eede651e33770a0712da854`, `cloud-onboard-e2e` failed at `openshell sandbox connect e2e-cloud-onboard`. - The commit window contains #4054, #3980, and #3675; #4054 is the only PR that changed the config-sync script executed at the failing boundary. ## Validation - `npm ci --include=dev --ignore-scripts` - `npx vitest run src/lib/onboard/config-sync.test.ts --reporter=verbose` - `npx @biomejs/biome check src/lib/onboard/config-sync.ts src/lib/onboard/config-sync.test.ts` - `git diff --check` - `npm run build:cli` - `npm run typecheck:cli`  ## Summary by CodeRabbit * **Refactor** * Improved NemoClaw configuration directory permission handling to be more robust in different user ownership scenarios. * Enhanced sandbox configuration script to conditionally apply permissions based on directory ownership. * **Tests** * Added comprehensive integration tests for configuration directory permission management and ownership verification.  [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/4199?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

latenighthackathon force-pushed the fix/docker-gpu-cdi-fallback-3575 branch from 19e81ce to fc3e9b1 Compare May 18, 2026 03:54

wscurran added Docker platform: ubuntu Affects Ubuntu Linux environments labels May 18, 2026

github-actions Bot mentioned this pull request May 21, 2026

fix(hermes): restore Spark GPU recreate startup #3963

Merged

latenighthackathon force-pushed the fix/docker-gpu-cdi-fallback-3575 branch from 4a3c6b6 to c87c20c Compare May 21, 2026 04:43

github-actions Bot mentioned this pull request May 21, 2026

fix(onboard): use NVIDIA runtime for Jetson sandbox GPU #4008

Merged

cv added the v0.0.51 Release target label May 25, 2026

cv approved these changes May 25, 2026

View reviewed changes

Merge branch 'main' into fix/docker-gpu-cdi-fallback-3575

663b4e6

cv enabled auto-merge (squash) May 25, 2026 18:24

cv merged commit 94fefa6 into NVIDIA:main May 25, 2026
17 checks passed

ericksoa mentioned this pull request May 25, 2026

fix(onboard): preserve root-owned config sync dirs #4199

Merged

wscurran added area: packaging Packages, images, registries, installers, or distribution bug-fix PR fixes a bug or regression platform: container Affects Docker, containerd, Podman, or images and removed Docker labels Jun 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(onboard): scan default CDI dirs for NVIDIA specs (#3575)#3675

fix(onboard): scan default CDI dirs for NVIDIA specs (#3575)#3675
cv merged 2 commits into
NVIDIA:mainfrom
latenighthackathon:fix/docker-gpu-cdi-fallback-3575

latenighthackathon commented May 18, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented May 18, 2026

Uh oh!

coderabbitai Bot commented May 18, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

wscurran commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

latenighthackathon commented May 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Out of scope

Test plan

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented May 18, 2026

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

wscurran commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

latenighthackathon commented May 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 18, 2026 •

edited

Loading