Skip to content

fix(onboard): scan default CDI dirs for NVIDIA specs (#3575)#3675

Merged
cv merged 2 commits into
NVIDIA:mainfrom
latenighthackathon:fix/docker-gpu-cdi-fallback-3575
May 25, 2026
Merged

fix(onboard): scan default CDI dirs for NVIDIA specs (#3575)#3675
cv merged 2 commits into
NVIDIA:mainfrom
latenighthackathon:fix/docker-gpu-cdi-fallback-3575

Conversation

@latenighthackathon

@latenighthackathon latenighthackathon commented May 18, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes #3575. Extends dockerReportsNvidiaCdiDevices to scan Docker's well-known default CDI directories in addition to whatever docker info --format '{{json .CDISpecDirs}}' returns, so cdi mode stays in the GPU candidate list on Docker 29 hosts that have nvidia-container-toolkit installed but no /etc/docker/daemon.json.

Problem

On those hosts, docker info returns an empty CDISpecDirs even though /etc/cdi/nvidia.yaml is present and Docker reads it at runtime. The previous detector only scanned the dirs docker info reported, so cdiAvailable came back false, cdi was dropped from the candidate list, and selectDockerGpuPatchMode never tried --device nvidia.com/gpu=DEVICE. --gpus all then tripped Docker 29's "AMD CDI spec not found" fallback bug and onboard aborted at step 6/8 with no retry, even though the reporter verified that docker run --device nvidia.com/gpu=0 hello-world worked on the same host.

This matches the reporter's Option 2 ("docker-gpu-patch defensive") but does the work upfront in detection rather than after --gpus all fails. Net result: on affected hosts, the existing candidate iteration in selectDockerGpuPatchMode picks up cdi automatically after gpus/nvidia-runtime probes fail, no retry-after-failure plumbing needed.

Out of scope

The reporter's Option 1 (preflight prompt to run nvidia-ctk runtime configure) is deferred. It is a UX/wizard change worth its own PR if Option 2 does not fully cover the ergonomics.

Test plan

  • 4 new vitest cases in src/lib/onboard/docker-gpu-patch.test.ts covering:
    • empty CDISpecDirs + nvidia.yaml in /etc/cdi resolves cdiAvailable=true
    • default dirs with no NVIDIA specs resolves cdiAvailable=false
    • docker info throws -> still falls back to defaults
    • reported dir matching a default -> scanned once, not twice
  • npx vitest run src/lib/onboard/ - 410/410 pass (38 files)
  • prek run clean on changed files

Signed-off-by: latenighthackathon latenighthackathon@users.noreply.github.com

Summary by CodeRabbit

  • Tests

    • Added comprehensive test coverage for NVIDIA GPU device detection across multiple scenarios.
  • Improvements

    • Improved robustness when Docker-provided info is unavailable by falling back to standard CDI locations.
    • Optimized scanning to avoid redundant directory checks, reducing unnecessary work and false negatives.

Review Change Stack

@copy-pr-bot

copy-pr-bot Bot commented May 18, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented May 18, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4e10857a-53cc-4cbf-a491-386bc78f8992

📥 Commits

Reviewing files that changed from the base of the PR and between 19e81ce and fc3e9b1.

📒 Files selected for processing (2)
  • src/lib/onboard/docker-gpu-patch.test.ts
  • src/lib/onboard/docker-gpu-patch.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/lib/onboard/docker-gpu-patch.test.ts
  • src/lib/onboard/docker-gpu-patch.ts

📝 Walkthrough

Walkthrough

Adds injected filesystem helpers, default CDI directories, and heuristics to detect NVIDIA CDI specs; dockerReportsNvidiaCdiDevices now falls back to default CDI dirs when docker info fails. Four Vitest cases cover detection, no-match, docker-info exception fallback, and duplicate-scan avoidance.

Changes

NVIDIA CDI Detection for Docker 29.x Fallback

Layer / File(s) Summary
CDI detection interfaces and helpers
src/lib/onboard/docker-gpu-patch.ts
DockerGpuPatchDeps gains optional readDir and readFile. Adds exported DEFAULT_DOCKER_CDI_SPEC_DIRS (["/etc/cdi", "/var/run/cdi"]). Implements helpers to resolve/deduplicate scan dirs, list directory entries via injected fs, read file content, and identify NVIDIA CDI spec files by extension/content heuristics.
NVIDIA CDI fallback detection
src/lib/onboard/docker-gpu-patch.ts, src/lib/onboard/docker-gpu-patch.test.ts
dockerReportsNvidiaCdiDevices now falls back to default CDI dirs when docker info fails, uses deps.readDir/deps.readFile if provided, deduplicates directories, and applies the NVIDIA CDI predicate. Four Vitest tests validate default scanning, absence of NVIDIA specs, docker-info exception fallback, and avoiding duplicate rescans.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

Docker, fix, v0.0.44

Suggested reviewers

  • cjagwani
  • cv
  • zyang-dev

Poem

🐰 I hop through /etc and /var/run with glee,
I read each yaml for NVIDIA's key,
When docker hides dirs or throws a fit,
I fallback and sniff — I never quit.
Sandbox ready, GPUs hum for thee.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(onboard): scan default CDI dirs for NVIDIA specs (#3575)' clearly and concisely describes the main change: scanning default CDI directories for NVIDIA specs to address Docker 29 compatibility issues.
Linked Issues check ✅ Passed The PR directly addresses issue #3575 by implementing defensive fallback logic to scan default CDI directories when docker info fails or returns empty CDISpecDirs, enabling GPU detection on Docker 29 hosts without registered nvidia runtime.
Out of Scope Changes check ✅ Passed All changes are scoped to the docker-gpu-patch module and directly support the CDI spec detection objective. The addition of readDir/readFile dependencies and DEFAULT_DOCKER_CDI_SPEC_DIRS constant are necessary enablers for testability and the main functionality.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@latenighthackathon latenighthackathon force-pushed the fix/docker-gpu-cdi-fallback-3575 branch from 19e81ce to fc3e9b1 Compare May 18, 2026 03:54
@wscurran wscurran added Docker platform: ubuntu Affects Ubuntu Linux environments labels May 18, 2026
@wscurran

Copy link
Copy Markdown
Contributor

✨ Thanks for submitting this detailed PR to fix the onboard issue with scanning default CDI dirs for NVIDIA specs. This change aims to improve the reliability of the onboard process by extending the dockerReportsNvidiaCdiDevices function to scan Docker's well-known default CDI directories.


Related open issues:

On Docker 29 hosts with `nvidia-container-toolkit` installed but no
`/etc/docker/daemon.json`, `docker info` returns an empty CDISpecDirs
list. The previous detection in `dockerReportsNvidiaCdiDevices` only
scanned what docker info reported, so the `cdi` GPU mode was dropped
from the candidate list. `--gpus all` then tripped Docker 29's "AMD CDI
spec not found" fallback bug and onboard aborted at step 6/8 with no
retry, even though `docker run --device nvidia.com/gpu=0` worked on
the same host.

Extend the detector to also scan Docker's well-known default CDI
directories (`/etc/cdi`, `/var/run/cdi`) in addition to whatever
docker info reports. The defaults are deduplicated against the
reported set so a host that surfaces `/etc/cdi` explicitly is not
scanned twice. Adds optional `readDir` and `readFile` deps so the new
behavior is fully unit-testable.

Tests:
  - Empty CDISpecDirs + nvidia.yaml in /etc/cdi -> detected.
  - Default dirs with no NVIDIA specs -> not detected.
  - docker info throws -> still falls back to defaults.
  - Reported dir matching a default -> scanned once, not twice.

Closes NVIDIA#3575

Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com>
@latenighthackathon latenighthackathon force-pushed the fix/docker-gpu-cdi-fallback-3575 branch from 4a3c6b6 to c87c20c Compare May 21, 2026 04:43
@cv cv added the v0.0.51 Release target label May 25, 2026
@cv cv enabled auto-merge (squash) May 25, 2026 18:24
@cv cv merged commit 94fefa6 into NVIDIA:main May 25, 2026
17 checks passed
cv pushed a commit that referenced this pull request May 26, 2026
## Summary

- keep `~/.nemoclaw` directory hardening for user-owned config dirs
- skip directory `chmod` when the config dir is owned by another UID,
which preserves OpenClaw's root-owned `/sandbox/.nemoclaw` layout
- keep `config.json` restricted to `0600`
- add shell-level regression coverage for owned and non-owned config
directories

## Root Cause

PR #4054 changed the step-7 sandbox config sync script from writing
`~/.nemoclaw/config.json` to unconditionally running `chmod 700
~/.nemoclaw` first. In the OpenClaw sandbox image, `/sandbox/.nemoclaw`
is intentionally root-owned with mode `1755`, while
`/sandbox/.nemoclaw/config.json` is sandbox-owned and writable. The
directory chmod fails under `set -euo pipefail`, so `openshell sandbox
connect <sandbox>` exits 1 during:

`[7/8] Setting up OpenClaw inside sandbox`

This restores compatibility with that image contract without reverting
the file-permission hardening from #4054.

## Evidence

- Last known good public install: run `26413471038`, installed ref
`8ed9a2c04c80627deed519a7c5ffe8f95ebb5ddd`, `cloud-onboard-e2e` reached
`✓ OpenClaw gateway launched inside sandbox`.
- First observed failing public install: run `26414889786`, installed
ref `94fefa6fd08e33d48eede651e33770a0712da854`, `cloud-onboard-e2e`
failed at `openshell sandbox connect e2e-cloud-onboard`.
- The commit window contains #4054, #3980, and #3675; #4054 is the only
PR that changed the config-sync script executed at the failing boundary.

## Validation

- `npm ci --include=dev --ignore-scripts`
- `npx vitest run src/lib/onboard/config-sync.test.ts
--reporter=verbose`
- `npx @biomejs/biome check src/lib/onboard/config-sync.ts
src/lib/onboard/config-sync.test.ts`
- `git diff --check`
- `npm run build:cli`
- `npm run typecheck:cli`


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Refactor**
* Improved NemoClaw configuration directory permission handling to be
more robust in different user ownership scenarios.
* Enhanced sandbox configuration script to conditionally apply
permissions based on directory ownership.

* **Tests**
* Added comprehensive integration tests for configuration directory
permission management and ownership verification.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/4199?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

<!-- review_stack_entry_end -->

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@wscurran wscurran added area: packaging Packages, images, registries, installers, or distribution bug-fix PR fixes a bug or regression platform: container Affects Docker, containerd, Podman, or images and removed Docker labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: packaging Packages, images, registries, installers, or distribution bug-fix PR fixes a bug or regression platform: container Affects Docker, containerd, Podman, or images platform: ubuntu Affects Ubuntu Linux environments v0.0.51 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Ubuntu 24.04][Onboard] sandbox creation aborts on Docker 29 hosts without nvidia container runtime

3 participants