fix(onboard): grant Jetson Tegra device-node group so sandbox CUDA can init (#4231) by yimoj · Pull Request #5018 · NVIDIA/NemoClaw

yimoj · 2026-06-09T04:46:58Z

Summary

On Jetson Orin the sandbox saw the GPU devices mounted but CUDA failed with
NvRmMemInitNvmap ... Permission denied / cuInit(0)=999 because the
unprivileged sandbox user was not a member of the host group (video) that
owns /dev/nvmap (crw-rw---- root video). This grants the sandbox user that
group on the Jetson Docker GPU recreate so CUDA actually initializes, and the
existing post-recreate cuInit(0) proof makes nemoclaw status report proven
CUDA usability instead of a misleading bare "enabled".

Related Issue

Fixes #4231

PR #4599 improved status/proof semantics but did not propagate Jetson
/dev/nvmap group access, so QA reopened: CUDA stayed unusable inside the
sandbox. This PR fixes the device-permission root cause.

Changes

docker-gpu-patch.ts — grant the Tegra device-node group on Jetson
recreate (the fix): new detectTegraDeviceGroupGids() stats the Jetson
Tegra device nodes (/dev/nvmap, /dev/nvhost-*, /dev/nvgpu/*) on the host
and returns the owning numeric GID(s) (skipping missing and root-owned
nodes). recreateOpenShellDockerSandboxWithGpu passes those through
DockerGpuCloneRunOptions.extraGroupGids into buildDockerGpuCloneRunArgs,
which emits --group-add <gid> (deduped against any baseline GroupAdd).
Numeric GIDs are used on purpose — the sandbox image need not define a
matching video/render group. Only runs for the jetson backend;
backend is now threaded explicitly through applyDockerGpuPatchOrExit so
the fallback create path is covered too. This automates the exact remediation
the existing jetsonGpuProofRemediationLines() already documents.
Status correctness: the existing post-recreate cuInit(0) proof from
fix(inference): prove WSL Docker Desktop GPUs and report sandbox CUDA proof state #4599 now passes once the device group is granted, so nemoclaw status shows
(CUDA verified). If the group cannot be resolved, onboard warns and the
proof still gates success, so status falls back to the honest
(last CUDA proof failed: …) with /dev/nvmap remediation rather than a
misleading "enabled".
Regression tests (docker-gpu-patch.test.ts): GID detection (dedupe,
skip missing/root), --group-add emission + dedupe, and end-to-end plumbing
through the Jetson recreate; plus a guard that the generic backend never adds
Tegra groups.
Reporter-workflow E2E (test/e2e/test-jetson-nvmap-gpu.sh,
gpu-jetson-nvmap-e2e in nightly-e2e.yaml): runs the reporter's exact
Jetson steps and inventoried in legacy-inventory.json + .coderabbit.yaml.

Type of Change

Code change (feature, bug fix, or refactor)

Verification

npm test (CLI project) passes — full vitest --project cli green on
this PR head after rebase (the only 2 reds are the pre-existing
snapshot-shields / e2e-fixture-context flakes, confirmed failing on
base with my changes stashed).
npm run typecheck:cli passes.
codex review --uncommitted clean (two flagged CI-integration gaps
fixed: aggregate needs lists + migration inventory).
Tests added for new/changed behavior.
No secrets, API keys, or credentials committed.

Reporter-workflow E2E evidence

This is verified at two levels that together cover the exact reporter workflow:

Deterministic regression of the exact failure mode — the unit suite
src/lib/onboard/docker-gpu-patch.test.ts (describe Jetson /dev/nvmap group propagation (#4231)) reproduces the precise reporter condition
hermetically: a sandbox user lacking the /dev/nvmap owning group, and
asserts the Jetson recreate now emits --group-add <gid> for the Tegra
device-node group so the proof can pass. 56/56 pass on this PR head.
Reporter-workflow pipeline E2E — test/e2e/test-jetson-nvmap-gpu.sh,
wired as the gpu-jetson-nvmap-e2e job in nightly-e2e.yaml, performs the
reporter's exact steps on a Jetson host: onboard with GPU, inspect the
sandbox user's groups and /dev/nvmap, run the in-sandbox cuInit(0) CUDA
proof, and assert nemoclaw status reports (CUDA verified) (a bare
"enabled" fails the job). Trigger it on a Jetson runner with:
```
gh workflow run nightly-e2e.yaml --ref fix/4231-jetson-nvmap-gpu-status -f jobs=gpu-jetson-nvmap-e2e
```

All required CI checks are green on this PR head (cli-test-shards,
build-typecheck, codebase-growth-guardrails, ShellCheck, dco-check,
CodeRabbit); see the PR Checks tab for the run ids and job logs.

Merge gate / remaining work

The live gpu-jetson-nvmap-e2e job is gated behind vars.JETSON_E2E_ENABLED
and a Jetson/Tegra GPU runner label (vars.JETSON_E2E_RUNNER_LABEL). The
project does not yet host an arm64/Jetson GPU runner, so a live green log on
real Jetson hardware is pending that runner being provisioned — set the
variable and label, then dispatch the job above. Issue #4231 stays assigned to
@yimoj until that live log is captured.

Signed-off-by: Yimo Jiang yimoj@nvidia.com

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Jetson/Tegra GPU group-permission handling added to improve CUDA initialization on Jetson hardware.
Tests
- New end-to-end Jetson nvmap GPU test validating group permissions, CUDA initialization, and status reporting.
- Nightly E2E job added to run the Jetson GPU test, with configurable enablement and runner selection.
Chores
- CI reporting updated to include the new Jetson GPU job in failure notifications and reports.

Summary by CodeRabbit

New Features
- Added nightly end-to-end testing for Jetson Orin GPU support, validating CUDA usability and device access configuration.
- Improved GPU sandbox group permissions handling for Jetson devices to ensure proper GPU device access.
Tests
- Added comprehensive E2E test script for Jetson /dev/nvmap GPU validation.
- Extended test coverage for GPU sandbox group permission detection and application.

coderabbitai · 2026-06-09T04:47:12Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 562eeb08-b745-4e12-b8e2-a2aaa11dff0c

📥 Commits

Reviewing files that changed from the base of the PR and between 55c4c6d and ca86413.

📒 Files selected for processing (2)

.coderabbit.yaml
src/lib/onboard/docker-gpu-patch.test.ts

🚧 Files skipped from review as they are similar to previous changes (1)

src/lib/onboard/docker-gpu-patch.test.ts

📝 Walkthrough

Walkthrough

Detect Tegra device-node owning GIDs and propagate them into recreated GPU sandboxes via --group-add when backend is Jetson; add unit tests, a Jetson-gated E2E script that validates CUDA inside the sandbox, register the script in migration inventory, and add a nightly CI job to run it.

Changes

Jetson CUDA Group Permission Fix

Layer / File(s)	Summary
Tegra device group GID detection contract and types `src/lib/onboard/docker-gpu-patch.ts`	`DockerGpuPatchDeps` gains injectable `detectTegraDeviceGroupGids()`; `DockerGpuCloneRunOptions.extraGroupGids` added. Default implementation probes Tegra device paths, skips missing/root-owned nodes, deduplicates and sorts numeric GIDs.
Docker run args and sandbox recreation wiring `src/lib/onboard/docker-gpu-patch.ts`	When `backend === "jetson"`, `recreateOpenShellDockerSandboxWithGpu` calls `detectTegraDeviceGroupGids()` and sets `cloneOptions.extraGroupGids`. `buildDockerGpuCloneRunArgs` emits `--group-add <gid>` for each extra GID, deduped against baseline `HostConfig.GroupAdd`. `applyDockerGpuPatchOrExit` signature extended to accept `backend` and `openshellSandboxCommand`.
Unit tests for Tegra group permission handling `src/lib/onboard/docker-gpu-patch.test.ts`	Adds tests for GID detection (dedupe/sort/filter), `--group-add` emission and deduping, and Jetson vs generic backend wiring into sandbox recreation.
Jetson nvmap E2E validation script `test/e2e/test-jetson-nvmap-gpu.sh`	New Jetson-gated bash E2E script that runs onboarding, asserts installer log grants `--group-add` for nvmap GID, checks sandbox user supplementary groups include the nvmap owner GID, verifies `/dev/nvmap` inside sandbox, probes CUDA via `cuInit(0)`, and requires `nemoclaw status` to include `CUDA verified`.
Nightly E2E job and workflow wiring `.github/workflows/nightly-e2e.yaml`	Adds `gpu-jetson-nvmap-e2e` job (gated by `vars.JETSON_E2E_ENABLED` and `workflow_dispatch` inputs), configurable `runs-on` Jetson runner label, uploads failure artifacts, and registers the job in downstream reporting (`notify-on-failure`, `report-to-pr`, `scorecard`).
Migration inventory registration `test/e2e-scenario/migration/legacy-inventory.json`	Registers the E2E script under `platform` domain with `not-migrated` status and contextual notes referencing issue `#4231`.
CodeRabbit instructions `.coderabbit.yaml`	Adds shared `path_instructions` anchor for the Jetson E2E script and links it to `src/lib/onboard/docker-gpu-patch.ts` for guidance.

Sequence Diagram

sequenceDiagram
  participant Test as E2E Script
  participant Docker as Docker/NVIDIA
  participant Installer as install.sh
  participant Sandbox as Sandbox Container
  participant CudaRuntime as CUDA Runtime
  participant NemoClaw as nemoclaw status
  
  Test->>Docker: Verify NVIDIA runtime configured
  Test->>Installer: Run install.sh --non-interactive
  Note over Installer: Installer logs expected --group-add for nvmap owner gid
  Installer->>Sandbox: Create container with supplementary group
  Test->>Sandbox: id -> verify user in nvmap owner group
  Test->>Sandbox: ls -la /dev/nvmap -> confirm accessible
  Test->>Sandbox: cuInit(0) probe via libcuda.so.1
  Sandbox->>CudaRuntime: cuInit(0) call
  CudaRuntime-->>Sandbox: Return 0 (success)
  Test->>NemoClaw: Query nemoclaw <sandbox> status
  NemoClaw-->>Test: "Sandbox GPU: enabled" + "CUDA verified"

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

bug-fix, Sandbox, Docker, platform: container, area: cli

Suggested reviewers

cv

Poem

🐰 I sniffed the nvmap by moonlight's glow,
I hopped through gids that root won't show.
I added groups, the sandbox cheered,
Now CUDA wakes — the test has cleared.
Hooray — a tiny rabbit engineer!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 15.38% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(onboard): grant Jetson Tegra device-node group so sandbox CUDA can init' directly and clearly summarizes the main change—granting Jetson Tegra device-node group ownership to the sandbox user so CUDA initialization works.
Linked Issues check	✅ Passed	All code changes directly address issue `#4231`: detectTegraDeviceGroupGids detects Tegra device groups, sandbox recreate applies them via --group-add, tests verify the logic, and E2E script validates CUDA works post-onboard.
Out of Scope Changes check	✅ Passed	All changes are in scope and directly related to `#4231`: core GPU patch logic, unit tests, E2E test script, workflow configuration, and inventory updates all support the Jetson Tegra group-access objective.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…n init (NVIDIA#4231) On Jetson Orin the sandbox saw the GPU devices mounted but CUDA failed with `NvRmMemInitNvmap ... Permission denied` / `cuInit(0)=999` because the unprivileged sandbox user was not a member of the host group (`video`) that owns `/dev/nvmap` (`crw-rw---- root video`). PR NVIDIA#4599 improved the status/proof semantics but did not propagate that group, so QA reopened: CUDA stayed unusable and status still read "enabled". The Jetson Docker GPU recreate now detects the host group(s) owning the Tegra device nodes (`/dev/nvmap`, `/dev/nvhost-*`, `/dev/nvgpu/*`) and grants the sandbox user matching `--group-add <gid>` membership, so CUDA's nvmap init can open them. The existing post-recreate `cuInit(0)` proof then passes and `nemoclaw status` reports `(CUDA verified)`; if the group cannot be resolved, onboard warns and the proof still gates success, so status falls back to the honest `(last CUDA proof failed)` with `/dev/nvmap` remediation instead of a misleading "enabled". This automates the remediation the existing `jetsonGpuProofRemediationLines()` already documents. - detectTegraDeviceGroupGids(): stat the Tegra device nodes, return owning numeric GIDs (skip missing and root-owned); numeric GIDs work even when the sandbox image has no matching video/render group entry. - recreateOpenShellDockerSandboxWithGpu(): for the jetson backend, thread the detected GIDs into DockerGpuCloneRunOptions.extraGroupGids; buildDockerGpu CloneRunArgs emits --group-add (deduped vs baseline GroupAdd). - applyDockerGpuPatchOrExit(): thread `backend` explicitly so the fallback create path also grants the group. - Regression tests for GID detection, --group-add emission/dedupe, and the Jetson-vs-generic recreate plumbing. - Reporter-workflow E2E (test/e2e/test-jetson-nvmap-gpu.sh): onboard with GPU, inspect sandbox groups + /dev/nvmap, run cuInit(0) in-sandbox, assert status reports (CUDA verified). Wired as gpu-jetson-nvmap-e2e (Jetson-gated) and inventoried; skips cleanly on non-Jetson hosts. Signed-off-by: Yimo Jiang <yimoj@nvidia.com>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

src/lib/onboard/docker-gpu-patch.test.ts (1)

773-779: 💤 Low value

Consider moving the Jetson describe block to a sibling scope.

The Jetson /dev/nvmap group propagation tests are nested inside describe("docker-gpu-patch sandbox DNS fallback (#3579)"), but Jetson GID handling is semantically unrelated to DNS fallback. Moving this to a sibling describe block would improve test organization clarity.

-  // Jetson `/dev/nvmap` group-permission propagation (`#4231`). ...
-  describe("Jetson /dev/nvmap group propagation (`#4231`)", () => {
+});
+
+// Jetson `/dev/nvmap` group-permission propagation (`#4231`). ...
+describe("docker-gpu-patch Jetson /dev/nvmap group propagation (`#4231`)", () => {

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard/docker-gpu-patch.test.ts` around lines 773 - 779, The Jetson
`/dev/nvmap` tests (describe("Jetson /dev/nvmap group propagation (`#4231`)")) are
nested inside the unrelated describe("docker-gpu-patch sandbox DNS fallback
(`#3579`)") block; extract the entire Jetson describe block and move it out so it
becomes a sibling describe at the same top-level scope as the DNS fallback
describe, preserving all its tests, hooks and imports, and update any relative
references if needed so the tests run independently.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/nightly-e2e.yaml:
- Around line 2122-2129: The .coderabbit.yaml is missing a path_instructions
entry for the gpu-jetson-nvmap-e2e job so changes to
test/e2e/test-jetson-nvmap-gpu.sh won't trigger this job; add a
path_instructions mapping that references the job id "gpu-jetson-nvmap-e2e" and
includes the path "test/e2e/test-jetson-nvmap-gpu.sh" (and any related
directories or patterns) so selective path-based runs will correctly pick up
changes to that script.

---

Nitpick comments:
In `@src/lib/onboard/docker-gpu-patch.test.ts`:
- Around line 773-779: The Jetson `/dev/nvmap` tests (describe("Jetson
/dev/nvmap group propagation (`#4231`)")) are nested inside the unrelated
describe("docker-gpu-patch sandbox DNS fallback (`#3579`)") block; extract the
entire Jetson describe block and move it out so it becomes a sibling describe at
the same top-level scope as the DNS fallback describe, preserving all its tests,
hooks and imports, and update any relative references if needed so the tests run
independently.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5a8976ea-d489-44b0-957f-48a08692708e

📥 Commits

Reviewing files that changed from the base of the PR and between d9db199 and eff17d2.

📒 Files selected for processing (5)

.github/workflows/nightly-e2e.yaml
src/lib/onboard/docker-gpu-patch.test.ts
src/lib/onboard/docker-gpu-patch.ts
test/e2e-scenario/migration/legacy-inventory.json
test/e2e/test-jetson-nvmap-gpu.sh

- .coderabbit.yaml: add path_instructions mapping test/e2e/test-jetson-nvmap-gpu.sh and src/lib/onboard/docker-gpu-patch.ts to the gpu-jetson-nvmap-e2e job, per the documented convention for new nightly E2E jobs. - docker-gpu-patch.test.ts: move the Jetson /dev/nvmap describe block out of the unrelated NVIDIA#3579 DNS-fallback describe into a top-level sibling scope. Signed-off-by: Yimo Jiang <yimoj@nvidia.com>

## Summary - Add v0.0.62 release notes from Discussion #5100 and link release highlights to the relevant docs pages. - Document the release's GPU sandbox recreation, sandbox-side local inference verification, and Hermes dashboard port guard in the command and inference references. - Refresh generated NemoClaw user skills for the release-prep docs set. ## Source Summary - #4956 -> `docs/reference/commands.mdx`: Document CDI-first Docker GPU recreation behavior for Linux Docker-driver sandboxes. - #5024 -> `docs/inference/use-local-inference.mdx`: Document sandbox-runtime verification of the `inference.local` local inference route. - #5018 -> `docs/reference/commands.mdx`: Document Jetson/Tegra device-node group propagation for sandbox CUDA initialization. - #5012, #4763, #4706, #5030, #5015 -> `docs/about/release-notes.mdx`: Summarize onboarding and recovery reliability fixes, including the reserved Hermes API port guard. - #5017 and #5043 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Summarize mutable OpenClaw config recovery and host-side `agents list` coverage. - #5010 and #5016 -> `docs/about/release-notes.mdx`: Summarize Hermes upstream metadata visibility and WhatsApp QR rendering reliability. - #5045 and prior source docs in the v0.0.62 range -> `.agents/skills/`: Refresh generated user-skill references from the current docs source. ## Skipped - #5019 -> skipped for new prose because it touched `openclaw-sandbox-permissive.yaml`, which matches `docs/.docs-skip`. Existing source docs remain the source for generated skill synchronization. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `npm run docs` (passes; Fern reports 0 errors and 1 hidden warning) - Pre-commit hooks passed during commit, including docs-to-skills verification, markdown lint, gitleaks, and skills YAML tests.  ## Summary by CodeRabbit * **New Features** * Added `nemoclaw <name> agents list` command. * v0.0.62 release notes added summarizing onboarding and recovery improvements. * **Bug Fixes** * Improved GPU sandbox onboarding reliability (NVIDIA CDI path, Jetson/Tegra device handling). * Better local inference verification and recovery for Linux Docker-driver GPU sandboxes. * Quieter/earlier handling of onboarding drift and port collisions. * **Documentation** * Expanded GPU passthrough, inference verification, writable paths (`/dev/pts`), port 8642 restriction, and command examples.  --------- Co-authored-by: Prekshi Vyas <34834085+prekshivyas@users.noreply.github.com>

yimoj force-pushed the fix/4231-jetson-nvmap-gpu-status branch from eff17d2 to 55c4c6d Compare June 9, 2026 04:50

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread .github/workflows/nightly-e2e.yaml

yimoj added the v0.0.62 Release target label Jun 9, 2026

cv approved these changes Jun 9, 2026

View reviewed changes

cv merged commit 4ee5e7d into NVIDIA:main Jun 9, 2026
33 checks passed

miyoungc mentioned this pull request Jun 10, 2026

docs: refresh v0.0.62 release docs #5157

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(onboard): grant Jetson Tegra device-node group so sandbox CUDA can init (#4231)#5018

fix(onboard): grant Jetson Tegra device-node group so sandbox CUDA can init (#4231)#5018
cv merged 2 commits into
NVIDIA:mainfrom
yimoj:fix/4231-jetson-nvmap-gpu-status

yimoj commented Jun 9, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yimoj commented Jun 9, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Verification

Reporter-workflow E2E evidence

Merge gate / remaining work

Summary by CodeRabbit

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yimoj commented Jun 9, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading