fix(onboard): scan default CDI dirs for NVIDIA specs (#3575)#3675
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughAdds injected filesystem helpers, default CDI directories, and heuristics to detect NVIDIA CDI specs; ChangesNVIDIA CDI Detection for Docker 29.x Fallback
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
19e81ce to
fc3e9b1
Compare
|
✨ Thanks for submitting this detailed PR to fix the onboard issue with scanning default CDI dirs for NVIDIA specs. This change aims to improve the reliability of the onboard process by extending the Related open issues: |
On Docker 29 hosts with `nvidia-container-toolkit` installed but no `/etc/docker/daemon.json`, `docker info` returns an empty CDISpecDirs list. The previous detection in `dockerReportsNvidiaCdiDevices` only scanned what docker info reported, so the `cdi` GPU mode was dropped from the candidate list. `--gpus all` then tripped Docker 29's "AMD CDI spec not found" fallback bug and onboard aborted at step 6/8 with no retry, even though `docker run --device nvidia.com/gpu=0` worked on the same host. Extend the detector to also scan Docker's well-known default CDI directories (`/etc/cdi`, `/var/run/cdi`) in addition to whatever docker info reports. The defaults are deduplicated against the reported set so a host that surfaces `/etc/cdi` explicitly is not scanned twice. Adds optional `readDir` and `readFile` deps so the new behavior is fully unit-testable. Tests: - Empty CDISpecDirs + nvidia.yaml in /etc/cdi -> detected. - Default dirs with no NVIDIA specs -> not detected. - docker info throws -> still falls back to defaults. - Reported dir matching a default -> scanned once, not twice. Closes NVIDIA#3575 Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com>
4a3c6b6 to
c87c20c
Compare
## Summary - keep `~/.nemoclaw` directory hardening for user-owned config dirs - skip directory `chmod` when the config dir is owned by another UID, which preserves OpenClaw's root-owned `/sandbox/.nemoclaw` layout - keep `config.json` restricted to `0600` - add shell-level regression coverage for owned and non-owned config directories ## Root Cause PR #4054 changed the step-7 sandbox config sync script from writing `~/.nemoclaw/config.json` to unconditionally running `chmod 700 ~/.nemoclaw` first. In the OpenClaw sandbox image, `/sandbox/.nemoclaw` is intentionally root-owned with mode `1755`, while `/sandbox/.nemoclaw/config.json` is sandbox-owned and writable. The directory chmod fails under `set -euo pipefail`, so `openshell sandbox connect <sandbox>` exits 1 during: `[7/8] Setting up OpenClaw inside sandbox` This restores compatibility with that image contract without reverting the file-permission hardening from #4054. ## Evidence - Last known good public install: run `26413471038`, installed ref `8ed9a2c04c80627deed519a7c5ffe8f95ebb5ddd`, `cloud-onboard-e2e` reached `✓ OpenClaw gateway launched inside sandbox`. - First observed failing public install: run `26414889786`, installed ref `94fefa6fd08e33d48eede651e33770a0712da854`, `cloud-onboard-e2e` failed at `openshell sandbox connect e2e-cloud-onboard`. - The commit window contains #4054, #3980, and #3675; #4054 is the only PR that changed the config-sync script executed at the failing boundary. ## Validation - `npm ci --include=dev --ignore-scripts` - `npx vitest run src/lib/onboard/config-sync.test.ts --reporter=verbose` - `npx @biomejs/biome check src/lib/onboard/config-sync.ts src/lib/onboard/config-sync.test.ts` - `git diff --check` - `npm run build:cli` - `npm run typecheck:cli` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Refactor** * Improved NemoClaw configuration directory permission handling to be more robust in different user ownership scenarios. * Enhanced sandbox configuration script to conditionally apply permissions based on directory ownership. * **Tests** * Added comprehensive integration tests for configuration directory permission management and ownership verification. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/4199?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Closes #3575. Extends
dockerReportsNvidiaCdiDevicesto scan Docker's well-known default CDI directories in addition to whateverdocker info --format '{{json .CDISpecDirs}}'returns, socdimode stays in the GPU candidate list on Docker 29 hosts that havenvidia-container-toolkitinstalled but no/etc/docker/daemon.json.Problem
On those hosts,
docker inforeturns an emptyCDISpecDirseven though/etc/cdi/nvidia.yamlis present and Docker reads it at runtime. The previous detector only scanned the dirs docker info reported, socdiAvailablecame backfalse,cdiwas dropped from the candidate list, andselectDockerGpuPatchModenever tried--device nvidia.com/gpu=DEVICE.--gpus allthen tripped Docker 29's "AMD CDI spec not found" fallback bug and onboard aborted at step 6/8 with no retry, even though the reporter verified thatdocker run --device nvidia.com/gpu=0 hello-worldworked on the same host.This matches the reporter's Option 2 ("docker-gpu-patch defensive") but does the work upfront in detection rather than after
--gpus allfails. Net result: on affected hosts, the existing candidate iteration inselectDockerGpuPatchModepicks upcdiautomatically aftergpus/nvidia-runtimeprobes fail, no retry-after-failure plumbing needed.Out of scope
The reporter's Option 1 (preflight prompt to run
nvidia-ctk runtime configure) is deferred. It is a UX/wizard change worth its own PR if Option 2 does not fully cover the ergonomics.Test plan
CDISpecDirs+nvidia.yamlin/etc/cdiresolvescdiAvailable=truecdiAvailable=falsedocker infothrows -> still falls back to defaultsnpx vitest run src/lib/onboard/- 410/410 pass (38 files)prek runclean on changed filesSigned-off-by: latenighthackathon latenighthackathon@users.noreply.github.com
Summary by CodeRabbit
Tests
Improvements