fix(onboard): grant Jetson Tegra device-node group so sandbox CUDA can init (#4231)#5018
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughDetect Tegra device-node owning GIDs and propagate them into recreated GPU sandboxes via ChangesJetson CUDA Group Permission Fix
Sequence DiagramsequenceDiagram
participant Test as E2E Script
participant Docker as Docker/NVIDIA
participant Installer as install.sh
participant Sandbox as Sandbox Container
participant CudaRuntime as CUDA Runtime
participant NemoClaw as nemoclaw status
Test->>Docker: Verify NVIDIA runtime configured
Test->>Installer: Run install.sh --non-interactive
Note over Installer: Installer logs expected --group-add for nvmap owner gid
Installer->>Sandbox: Create container with supplementary group
Test->>Sandbox: id -> verify user in nvmap owner group
Test->>Sandbox: ls -la /dev/nvmap -> confirm accessible
Test->>Sandbox: cuInit(0) probe via libcuda.so.1
Sandbox->>CudaRuntime: cuInit(0) call
CudaRuntime-->>Sandbox: Return 0 (success)
Test->>NemoClaw: Query nemoclaw <sandbox> status
NemoClaw-->>Test: "Sandbox GPU: enabled" + "CUDA verified"
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
…n init (NVIDIA#4231) On Jetson Orin the sandbox saw the GPU devices mounted but CUDA failed with `NvRmMemInitNvmap ... Permission denied` / `cuInit(0)=999` because the unprivileged sandbox user was not a member of the host group (`video`) that owns `/dev/nvmap` (`crw-rw---- root video`). PR NVIDIA#4599 improved the status/proof semantics but did not propagate that group, so QA reopened: CUDA stayed unusable and status still read "enabled". The Jetson Docker GPU recreate now detects the host group(s) owning the Tegra device nodes (`/dev/nvmap`, `/dev/nvhost-*`, `/dev/nvgpu/*`) and grants the sandbox user matching `--group-add <gid>` membership, so CUDA's nvmap init can open them. The existing post-recreate `cuInit(0)` proof then passes and `nemoclaw status` reports `(CUDA verified)`; if the group cannot be resolved, onboard warns and the proof still gates success, so status falls back to the honest `(last CUDA proof failed)` with `/dev/nvmap` remediation instead of a misleading "enabled". This automates the remediation the existing `jetsonGpuProofRemediationLines()` already documents. - detectTegraDeviceGroupGids(): stat the Tegra device nodes, return owning numeric GIDs (skip missing and root-owned); numeric GIDs work even when the sandbox image has no matching video/render group entry. - recreateOpenShellDockerSandboxWithGpu(): for the jetson backend, thread the detected GIDs into DockerGpuCloneRunOptions.extraGroupGids; buildDockerGpu CloneRunArgs emits --group-add (deduped vs baseline GroupAdd). - applyDockerGpuPatchOrExit(): thread `backend` explicitly so the fallback create path also grants the group. - Regression tests for GID detection, --group-add emission/dedupe, and the Jetson-vs-generic recreate plumbing. - Reporter-workflow E2E (test/e2e/test-jetson-nvmap-gpu.sh): onboard with GPU, inspect sandbox groups + /dev/nvmap, run cuInit(0) in-sandbox, assert status reports (CUDA verified). Wired as gpu-jetson-nvmap-e2e (Jetson-gated) and inventoried; skips cleanly on non-Jetson hosts. Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
eff17d2 to
55c4c6d
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/lib/onboard/docker-gpu-patch.test.ts (1)
773-779: 💤 Low valueConsider moving the Jetson describe block to a sibling scope.
The Jetson
/dev/nvmapgroup propagation tests are nested insidedescribe("docker-gpu-patch sandbox DNS fallback (#3579)"), but Jetson GID handling is semantically unrelated to DNS fallback. Moving this to a sibling describe block would improve test organization clarity.- // Jetson `/dev/nvmap` group-permission propagation (`#4231`). ... - describe("Jetson /dev/nvmap group propagation (`#4231`)", () => { +}); + +// Jetson `/dev/nvmap` group-permission propagation (`#4231`). ... +describe("docker-gpu-patch Jetson /dev/nvmap group propagation (`#4231`)", () => {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard/docker-gpu-patch.test.ts` around lines 773 - 779, The Jetson `/dev/nvmap` tests (describe("Jetson /dev/nvmap group propagation (`#4231`)")) are nested inside the unrelated describe("docker-gpu-patch sandbox DNS fallback (`#3579`)") block; extract the entire Jetson describe block and move it out so it becomes a sibling describe at the same top-level scope as the DNS fallback describe, preserving all its tests, hooks and imports, and update any relative references if needed so the tests run independently.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/nightly-e2e.yaml:
- Around line 2122-2129: The .coderabbit.yaml is missing a path_instructions
entry for the gpu-jetson-nvmap-e2e job so changes to
test/e2e/test-jetson-nvmap-gpu.sh won't trigger this job; add a
path_instructions mapping that references the job id "gpu-jetson-nvmap-e2e" and
includes the path "test/e2e/test-jetson-nvmap-gpu.sh" (and any related
directories or patterns) so selective path-based runs will correctly pick up
changes to that script.
---
Nitpick comments:
In `@src/lib/onboard/docker-gpu-patch.test.ts`:
- Around line 773-779: The Jetson `/dev/nvmap` tests (describe("Jetson
/dev/nvmap group propagation (`#4231`)")) are nested inside the unrelated
describe("docker-gpu-patch sandbox DNS fallback (`#3579`)") block; extract the
entire Jetson describe block and move it out so it becomes a sibling describe at
the same top-level scope as the DNS fallback describe, preserving all its tests,
hooks and imports, and update any relative references if needed so the tests run
independently.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5a8976ea-d489-44b0-957f-48a08692708e
📒 Files selected for processing (5)
.github/workflows/nightly-e2e.yamlsrc/lib/onboard/docker-gpu-patch.test.tssrc/lib/onboard/docker-gpu-patch.tstest/e2e-scenario/migration/legacy-inventory.jsontest/e2e/test-jetson-nvmap-gpu.sh
- .coderabbit.yaml: add path_instructions mapping test/e2e/test-jetson-nvmap-gpu.sh and src/lib/onboard/docker-gpu-patch.ts to the gpu-jetson-nvmap-e2e job, per the documented convention for new nightly E2E jobs. - docker-gpu-patch.test.ts: move the Jetson /dev/nvmap describe block out of the unrelated NVIDIA#3579 DNS-fallback describe into a top-level sibling scope. Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
## Summary - Add v0.0.62 release notes from Discussion #5100 and link release highlights to the relevant docs pages. - Document the release's GPU sandbox recreation, sandbox-side local inference verification, and Hermes dashboard port guard in the command and inference references. - Refresh generated NemoClaw user skills for the release-prep docs set. ## Source Summary - #4956 -> `docs/reference/commands.mdx`: Document CDI-first Docker GPU recreation behavior for Linux Docker-driver sandboxes. - #5024 -> `docs/inference/use-local-inference.mdx`: Document sandbox-runtime verification of the `inference.local` local inference route. - #5018 -> `docs/reference/commands.mdx`: Document Jetson/Tegra device-node group propagation for sandbox CUDA initialization. - #5012, #4763, #4706, #5030, #5015 -> `docs/about/release-notes.mdx`: Summarize onboarding and recovery reliability fixes, including the reserved Hermes API port guard. - #5017 and #5043 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Summarize mutable OpenClaw config recovery and host-side `agents list` coverage. - #5010 and #5016 -> `docs/about/release-notes.mdx`: Summarize Hermes upstream metadata visibility and WhatsApp QR rendering reliability. - #5045 and prior source docs in the v0.0.62 range -> `.agents/skills/`: Refresh generated user-skill references from the current docs source. ## Skipped - #5019 -> skipped for new prose because it touched `openclaw-sandbox-permissive.yaml`, which matches `docs/.docs-skip`. Existing source docs remain the source for generated skill synchronization. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `npm run docs` (passes; Fern reports 0 errors and 1 hidden warning) - Pre-commit hooks passed during commit, including docs-to-skills verification, markdown lint, gitleaks, and skills YAML tests. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added `nemoclaw <name> agents list` command. * v0.0.62 release notes added summarizing onboarding and recovery improvements. * **Bug Fixes** * Improved GPU sandbox onboarding reliability (NVIDIA CDI path, Jetson/Tegra device handling). * Better local inference verification and recovery for Linux Docker-driver GPU sandboxes. * Quieter/earlier handling of onboarding drift and port collisions. * **Documentation** * Expanded GPU passthrough, inference verification, writable paths (`/dev/pts`), port 8642 restriction, and command examples. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Prekshi Vyas <34834085+prekshivyas@users.noreply.github.com>
Summary
On Jetson Orin the sandbox saw the GPU devices mounted but CUDA failed with
NvRmMemInitNvmap ... Permission denied/cuInit(0)=999because theunprivileged sandbox user was not a member of the host group (
video) thatowns
/dev/nvmap(crw-rw---- root video). This grants the sandbox user thatgroup on the Jetson Docker GPU recreate so CUDA actually initializes, and the
existing post-recreate
cuInit(0)proof makesnemoclaw statusreport provenCUDA usability instead of a misleading bare "enabled".
Related Issue
Fixes #4231
PR #4599 improved status/proof semantics but did not propagate Jetson
/dev/nvmapgroup access, so QA reopened: CUDA stayed unusable inside thesandbox. This PR fixes the device-permission root cause.
Changes
docker-gpu-patch.ts— grant the Tegra device-node group on Jetsonrecreate (the fix): new
detectTegraDeviceGroupGids()stats the JetsonTegra device nodes (
/dev/nvmap,/dev/nvhost-*,/dev/nvgpu/*) on the hostand returns the owning numeric GID(s) (skipping missing and root-owned
nodes).
recreateOpenShellDockerSandboxWithGpupasses those throughDockerGpuCloneRunOptions.extraGroupGidsintobuildDockerGpuCloneRunArgs,which emits
--group-add <gid>(deduped against any baselineGroupAdd).Numeric GIDs are used on purpose — the sandbox image need not define a
matching
video/rendergroup. Only runs for thejetsonbackend;backendis now threaded explicitly throughapplyDockerGpuPatchOrExitsothe fallback create path is covered too. This automates the exact remediation
the existing
jetsonGpuProofRemediationLines()already documents.cuInit(0)proof fromfix(inference): prove WSL Docker Desktop GPUs and report sandbox CUDA proof state #4599 now passes once the device group is granted, so
nemoclaw statusshows(CUDA verified). If the group cannot be resolved, onboard warns and theproof still gates success, so status falls back to the honest
(last CUDA proof failed: …)with/dev/nvmapremediation rather than amisleading "enabled".
docker-gpu-patch.test.ts): GID detection (dedupe,skip missing/root),
--group-addemission + dedupe, and end-to-end plumbingthrough the Jetson recreate; plus a guard that the generic backend never adds
Tegra groups.
test/e2e/test-jetson-nvmap-gpu.sh,gpu-jetson-nvmap-e2einnightly-e2e.yaml): runs the reporter's exactJetson steps and inventoried in
legacy-inventory.json+.coderabbit.yaml.Type of Change
Verification
npm test(CLI project) passes — fullvitest --project cligreen onthis PR head after rebase (the only 2 reds are the pre-existing
snapshot-shields/e2e-fixture-contextflakes, confirmed failing onbase with my changes stashed).
npm run typecheck:clipasses.codex review --uncommittedclean (two flagged CI-integration gapsfixed: aggregate
needslists + migration inventory).Reporter-workflow E2E evidence
This is verified at two levels that together cover the exact reporter workflow:
src/lib/onboard/docker-gpu-patch.test.ts(describeJetson /dev/nvmap group propagation (#4231)) reproduces the precise reporter conditionhermetically: a sandbox user lacking the
/dev/nvmapowning group, andasserts the Jetson recreate now emits
--group-add <gid>for the Tegradevice-node group so the proof can pass. 56/56 pass on this PR head.
test/e2e/test-jetson-nvmap-gpu.sh,wired as the
gpu-jetson-nvmap-e2ejob innightly-e2e.yaml, performs thereporter's exact steps on a Jetson host: onboard with GPU, inspect the
sandbox user's groups and
/dev/nvmap, run the in-sandboxcuInit(0)CUDAproof, and assert
nemoclaw statusreports(CUDA verified)(a bare"enabled" fails the job). Trigger it on a Jetson runner with:
All required CI checks are green on this PR head (
cli-test-shards,build-typecheck,codebase-growth-guardrails,ShellCheck,dco-check,CodeRabbit); see the PR Checks tab for the run ids and job logs.Merge gate / remaining work
The live
gpu-jetson-nvmap-e2ejob is gated behindvars.JETSON_E2E_ENABLEDand a Jetson/Tegra GPU runner label (
vars.JETSON_E2E_RUNNER_LABEL). Theproject does not yet host an arm64/Jetson GPU runner, so a live green log on
real Jetson hardware is pending that runner being provisioned — set the
variable and label, then dispatch the job above. Issue #4231 stays assigned to
@yimoj until that live log is captured.
Signed-off-by: Yimo Jiang yimoj@nvidia.com
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Tests
Chores
Summary by CodeRabbit
New Features
Tests
/dev/nvmapGPU validation.