Skip to content

fix(onboard): grant Jetson Tegra device-node group so sandbox CUDA can init (#4231)#5018

Merged
cv merged 2 commits into
NVIDIA:mainfrom
yimoj:fix/4231-jetson-nvmap-gpu-status
Jun 9, 2026
Merged

fix(onboard): grant Jetson Tegra device-node group so sandbox CUDA can init (#4231)#5018
cv merged 2 commits into
NVIDIA:mainfrom
yimoj:fix/4231-jetson-nvmap-gpu-status

Conversation

@yimoj

@yimoj yimoj commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

On Jetson Orin the sandbox saw the GPU devices mounted but CUDA failed with
NvRmMemInitNvmap ... Permission denied / cuInit(0)=999 because the
unprivileged sandbox user was not a member of the host group (video) that
owns /dev/nvmap (crw-rw---- root video). This grants the sandbox user that
group on the Jetson Docker GPU recreate so CUDA actually initializes, and the
existing post-recreate cuInit(0) proof makes nemoclaw status report proven
CUDA usability instead of a misleading bare "enabled".

Related Issue

Fixes #4231

PR #4599 improved status/proof semantics but did not propagate Jetson
/dev/nvmap group access, so QA reopened: CUDA stayed unusable inside the
sandbox. This PR fixes the device-permission root cause.

Changes

  • docker-gpu-patch.ts — grant the Tegra device-node group on Jetson
    recreate (the fix):
    new detectTegraDeviceGroupGids() stats the Jetson
    Tegra device nodes (/dev/nvmap, /dev/nvhost-*, /dev/nvgpu/*) on the host
    and returns the owning numeric GID(s) (skipping missing and root-owned
    nodes). recreateOpenShellDockerSandboxWithGpu passes those through
    DockerGpuCloneRunOptions.extraGroupGids into buildDockerGpuCloneRunArgs,
    which emits --group-add <gid> (deduped against any baseline GroupAdd).
    Numeric GIDs are used on purpose — the sandbox image need not define a
    matching video/render group. Only runs for the jetson backend;
    backend is now threaded explicitly through applyDockerGpuPatchOrExit so
    the fallback create path is covered too. This automates the exact remediation
    the existing jetsonGpuProofRemediationLines() already documents.
  • Status correctness: the existing post-recreate cuInit(0) proof from
    fix(inference): prove WSL Docker Desktop GPUs and report sandbox CUDA proof state #4599 now passes once the device group is granted, so nemoclaw status shows
    (CUDA verified). If the group cannot be resolved, onboard warns and the
    proof still gates success, so status falls back to the honest
    (last CUDA proof failed: …) with /dev/nvmap remediation rather than a
    misleading "enabled".
  • Regression tests (docker-gpu-patch.test.ts): GID detection (dedupe,
    skip missing/root), --group-add emission + dedupe, and end-to-end plumbing
    through the Jetson recreate; plus a guard that the generic backend never adds
    Tegra groups.
  • Reporter-workflow E2E (test/e2e/test-jetson-nvmap-gpu.sh,
    gpu-jetson-nvmap-e2e in nightly-e2e.yaml): runs the reporter's exact
    Jetson steps and inventoried in legacy-inventory.json + .coderabbit.yaml.

Type of Change

  • Code change (feature, bug fix, or refactor)

Verification

  • npm test (CLI project) passes — full vitest --project cli green on
    this PR head after rebase (the only 2 reds are the pre-existing
    snapshot-shields / e2e-fixture-context flakes, confirmed failing on
    base with my changes stashed).
  • npm run typecheck:cli passes.
  • codex review --uncommitted clean (two flagged CI-integration gaps
    fixed: aggregate needs lists + migration inventory).
  • Tests added for new/changed behavior.
  • No secrets, API keys, or credentials committed.

Reporter-workflow E2E evidence

This is verified at two levels that together cover the exact reporter workflow:

  1. Deterministic regression of the exact failure mode — the unit suite
    src/lib/onboard/docker-gpu-patch.test.ts (describe Jetson /dev/nvmap group propagation (#4231)) reproduces the precise reporter condition
    hermetically: a sandbox user lacking the /dev/nvmap owning group, and
    asserts the Jetson recreate now emits --group-add <gid> for the Tegra
    device-node group so the proof can pass. 56/56 pass on this PR head.
  2. Reporter-workflow pipeline E2Etest/e2e/test-jetson-nvmap-gpu.sh,
    wired as the gpu-jetson-nvmap-e2e job in nightly-e2e.yaml, performs the
    reporter's exact steps on a Jetson host: onboard with GPU, inspect the
    sandbox user's groups and /dev/nvmap, run the in-sandbox cuInit(0) CUDA
    proof, and assert nemoclaw status reports (CUDA verified) (a bare
    "enabled" fails the job). Trigger it on a Jetson runner with:
    gh workflow run nightly-e2e.yaml --ref fix/4231-jetson-nvmap-gpu-status -f jobs=gpu-jetson-nvmap-e2e
    

All required CI checks are green on this PR head (cli-test-shards,
build-typecheck, codebase-growth-guardrails, ShellCheck, dco-check,
CodeRabbit); see the PR Checks tab for the run ids and job logs.

Merge gate / remaining work

The live gpu-jetson-nvmap-e2e job is gated behind vars.JETSON_E2E_ENABLED
and a Jetson/Tegra GPU runner label (vars.JETSON_E2E_RUNNER_LABEL). The
project does not yet host an arm64/Jetson GPU runner, so a live green log on
real Jetson hardware is pending that runner being provisioned — set the
variable and label, then dispatch the job above. Issue #4231 stays assigned to
@yimoj until that live log is captured.


Signed-off-by: Yimo Jiang yimoj@nvidia.com

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Jetson/Tegra GPU group-permission handling added to improve CUDA initialization on Jetson hardware.
  • Tests

    • New end-to-end Jetson nvmap GPU test validating group permissions, CUDA initialization, and status reporting.
    • Nightly E2E job added to run the Jetson GPU test, with configurable enablement and runner selection.
  • Chores

    • CI reporting updated to include the new Jetson GPU job in failure notifications and reports.

Summary by CodeRabbit

  • New Features

    • Added nightly end-to-end testing for Jetson Orin GPU support, validating CUDA usability and device access configuration.
    • Improved GPU sandbox group permissions handling for Jetson devices to ensure proper GPU device access.
  • Tests

    • Added comprehensive E2E test script for Jetson /dev/nvmap GPU validation.
    • Extended test coverage for GPU sandbox group permission detection and application.

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 562eeb08-b745-4e12-b8e2-a2aaa11dff0c

📥 Commits

Reviewing files that changed from the base of the PR and between 55c4c6d and ca86413.

📒 Files selected for processing (2)
  • .coderabbit.yaml
  • src/lib/onboard/docker-gpu-patch.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/onboard/docker-gpu-patch.test.ts

📝 Walkthrough

Walkthrough

Detect Tegra device-node owning GIDs and propagate them into recreated GPU sandboxes via --group-add when backend is Jetson; add unit tests, a Jetson-gated E2E script that validates CUDA inside the sandbox, register the script in migration inventory, and add a nightly CI job to run it.

Changes

Jetson CUDA Group Permission Fix

Layer / File(s) Summary
Tegra device group GID detection contract and types
src/lib/onboard/docker-gpu-patch.ts
DockerGpuPatchDeps gains injectable detectTegraDeviceGroupGids(); DockerGpuCloneRunOptions.extraGroupGids added. Default implementation probes Tegra device paths, skips missing/root-owned nodes, deduplicates and sorts numeric GIDs.
Docker run args and sandbox recreation wiring
src/lib/onboard/docker-gpu-patch.ts
When backend === "jetson", recreateOpenShellDockerSandboxWithGpu calls detectTegraDeviceGroupGids() and sets cloneOptions.extraGroupGids. buildDockerGpuCloneRunArgs emits --group-add <gid> for each extra GID, deduped against baseline HostConfig.GroupAdd. applyDockerGpuPatchOrExit signature extended to accept backend and openshellSandboxCommand.
Unit tests for Tegra group permission handling
src/lib/onboard/docker-gpu-patch.test.ts
Adds tests for GID detection (dedupe/sort/filter), --group-add emission and deduping, and Jetson vs generic backend wiring into sandbox recreation.
Jetson nvmap E2E validation script
test/e2e/test-jetson-nvmap-gpu.sh
New Jetson-gated bash E2E script that runs onboarding, asserts installer log grants --group-add for nvmap GID, checks sandbox user supplementary groups include the nvmap owner GID, verifies /dev/nvmap inside sandbox, probes CUDA via cuInit(0), and requires nemoclaw status to include CUDA verified.
Nightly E2E job and workflow wiring
.github/workflows/nightly-e2e.yaml
Adds gpu-jetson-nvmap-e2e job (gated by vars.JETSON_E2E_ENABLED and workflow_dispatch inputs), configurable runs-on Jetson runner label, uploads failure artifacts, and registers the job in downstream reporting (notify-on-failure, report-to-pr, scorecard).
Migration inventory registration
test/e2e-scenario/migration/legacy-inventory.json
Registers the E2E script under platform domain with not-migrated status and contextual notes referencing issue #4231.
CodeRabbit instructions
.coderabbit.yaml
Adds shared path_instructions anchor for the Jetson E2E script and links it to src/lib/onboard/docker-gpu-patch.ts for guidance.

Sequence Diagram

sequenceDiagram
  participant Test as E2E Script
  participant Docker as Docker/NVIDIA
  participant Installer as install.sh
  participant Sandbox as Sandbox Container
  participant CudaRuntime as CUDA Runtime
  participant NemoClaw as nemoclaw status
  
  Test->>Docker: Verify NVIDIA runtime configured
  Test->>Installer: Run install.sh --non-interactive
  Note over Installer: Installer logs expected --group-add for nvmap owner gid
  Installer->>Sandbox: Create container with supplementary group
  Test->>Sandbox: id -> verify user in nvmap owner group
  Test->>Sandbox: ls -la /dev/nvmap -> confirm accessible
  Test->>Sandbox: cuInit(0) probe via libcuda.so.1
  Sandbox->>CudaRuntime: cuInit(0) call
  CudaRuntime-->>Sandbox: Return 0 (success)
  Test->>NemoClaw: Query nemoclaw <sandbox> status
  NemoClaw-->>Test: "Sandbox GPU: enabled" + "CUDA verified"
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

bug-fix, Sandbox, Docker, platform: container, area: cli

Suggested reviewers

  • cv

Poem

🐰 I sniffed the nvmap by moonlight's glow,
I hopped through gids that root won't show.
I added groups, the sandbox cheered,
Now CUDA wakes — the test has cleared.
Hooray — a tiny rabbit engineer!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.38% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(onboard): grant Jetson Tegra device-node group so sandbox CUDA can init' directly and clearly summarizes the main change—granting Jetson Tegra device-node group ownership to the sandbox user so CUDA initialization works.
Linked Issues check ✅ Passed All code changes directly address issue #4231: detectTegraDeviceGroupGids detects Tegra device groups, sandbox recreate applies them via --group-add, tests verify the logic, and E2E script validates CUDA works post-onboard.
Out of Scope Changes check ✅ Passed All changes are in scope and directly related to #4231: core GPU patch logic, unit tests, E2E test script, workflow configuration, and inventory updates all support the Jetson Tegra group-access objective.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

…n init (NVIDIA#4231)

On Jetson Orin the sandbox saw the GPU devices mounted but CUDA failed with
`NvRmMemInitNvmap ... Permission denied` / `cuInit(0)=999` because the
unprivileged sandbox user was not a member of the host group (`video`) that
owns `/dev/nvmap` (`crw-rw---- root video`). PR NVIDIA#4599 improved the status/proof
semantics but did not propagate that group, so QA reopened: CUDA stayed
unusable and status still read "enabled".

The Jetson Docker GPU recreate now detects the host group(s) owning the Tegra
device nodes (`/dev/nvmap`, `/dev/nvhost-*`, `/dev/nvgpu/*`) and grants the
sandbox user matching `--group-add <gid>` membership, so CUDA's nvmap init can
open them. The existing post-recreate `cuInit(0)` proof then passes and
`nemoclaw status` reports `(CUDA verified)`; if the group cannot be resolved,
onboard warns and the proof still gates success, so status falls back to the
honest `(last CUDA proof failed)` with `/dev/nvmap` remediation instead of a
misleading "enabled". This automates the remediation the existing
`jetsonGpuProofRemediationLines()` already documents.

- detectTegraDeviceGroupGids(): stat the Tegra device nodes, return owning
  numeric GIDs (skip missing and root-owned); numeric GIDs work even when the
  sandbox image has no matching video/render group entry.
- recreateOpenShellDockerSandboxWithGpu(): for the jetson backend, thread the
  detected GIDs into DockerGpuCloneRunOptions.extraGroupGids; buildDockerGpu
  CloneRunArgs emits --group-add (deduped vs baseline GroupAdd).
- applyDockerGpuPatchOrExit(): thread `backend` explicitly so the fallback
  create path also grants the group.
- Regression tests for GID detection, --group-add emission/dedupe, and the
  Jetson-vs-generic recreate plumbing.
- Reporter-workflow E2E (test/e2e/test-jetson-nvmap-gpu.sh): onboard with GPU,
  inspect sandbox groups + /dev/nvmap, run cuInit(0) in-sandbox, assert
  status reports (CUDA verified). Wired as gpu-jetson-nvmap-e2e (Jetson-gated)
  and inventoried; skips cleanly on non-Jetson hosts.

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
@yimoj yimoj force-pushed the fix/4231-jetson-nvmap-gpu-status branch from eff17d2 to 55c4c6d Compare June 9, 2026 04:50

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/lib/onboard/docker-gpu-patch.test.ts (1)

773-779: 💤 Low value

Consider moving the Jetson describe block to a sibling scope.

The Jetson /dev/nvmap group propagation tests are nested inside describe("docker-gpu-patch sandbox DNS fallback (#3579)"), but Jetson GID handling is semantically unrelated to DNS fallback. Moving this to a sibling describe block would improve test organization clarity.

-  // Jetson `/dev/nvmap` group-permission propagation (`#4231`). ...
-  describe("Jetson /dev/nvmap group propagation (`#4231`)", () => {
+});
+
+// Jetson `/dev/nvmap` group-permission propagation (`#4231`). ...
+describe("docker-gpu-patch Jetson /dev/nvmap group propagation (`#4231`)", () => {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard/docker-gpu-patch.test.ts` around lines 773 - 779, The Jetson
`/dev/nvmap` tests (describe("Jetson /dev/nvmap group propagation (`#4231`)")) are
nested inside the unrelated describe("docker-gpu-patch sandbox DNS fallback
(`#3579`)") block; extract the entire Jetson describe block and move it out so it
becomes a sibling describe at the same top-level scope as the DNS fallback
describe, preserving all its tests, hooks and imports, and update any relative
references if needed so the tests run independently.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/nightly-e2e.yaml:
- Around line 2122-2129: The .coderabbit.yaml is missing a path_instructions
entry for the gpu-jetson-nvmap-e2e job so changes to
test/e2e/test-jetson-nvmap-gpu.sh won't trigger this job; add a
path_instructions mapping that references the job id "gpu-jetson-nvmap-e2e" and
includes the path "test/e2e/test-jetson-nvmap-gpu.sh" (and any related
directories or patterns) so selective path-based runs will correctly pick up
changes to that script.

---

Nitpick comments:
In `@src/lib/onboard/docker-gpu-patch.test.ts`:
- Around line 773-779: The Jetson `/dev/nvmap` tests (describe("Jetson
/dev/nvmap group propagation (`#4231`)")) are nested inside the unrelated
describe("docker-gpu-patch sandbox DNS fallback (`#3579`)") block; extract the
entire Jetson describe block and move it out so it becomes a sibling describe at
the same top-level scope as the DNS fallback describe, preserving all its tests,
hooks and imports, and update any relative references if needed so the tests run
independently.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5a8976ea-d489-44b0-957f-48a08692708e

📥 Commits

Reviewing files that changed from the base of the PR and between d9db199 and eff17d2.

📒 Files selected for processing (5)
  • .github/workflows/nightly-e2e.yaml
  • src/lib/onboard/docker-gpu-patch.test.ts
  • src/lib/onboard/docker-gpu-patch.ts
  • test/e2e-scenario/migration/legacy-inventory.json
  • test/e2e/test-jetson-nvmap-gpu.sh

Comment thread .github/workflows/nightly-e2e.yaml
- .coderabbit.yaml: add path_instructions mapping test/e2e/test-jetson-nvmap-gpu.sh
  and src/lib/onboard/docker-gpu-patch.ts to the gpu-jetson-nvmap-e2e job, per the
  documented convention for new nightly E2E jobs.
- docker-gpu-patch.test.ts: move the Jetson /dev/nvmap describe block out of the
  unrelated NVIDIA#3579 DNS-fallback describe into a top-level sibling scope.

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
@yimoj yimoj added the v0.0.62 Release target label Jun 9, 2026
@cv cv merged commit 4ee5e7d into NVIDIA:main Jun 9, 2026
33 checks passed
jyaunches pushed a commit that referenced this pull request Jun 10, 2026
## Summary
- Add v0.0.62 release notes from Discussion #5100 and link release
highlights to the relevant docs pages.
- Document the release's GPU sandbox recreation, sandbox-side local
inference verification, and Hermes dashboard port guard in the command
and inference references.
- Refresh generated NemoClaw user skills for the release-prep docs set.

## Source Summary
- #4956 -> `docs/reference/commands.mdx`: Document CDI-first Docker GPU
recreation behavior for Linux Docker-driver sandboxes.
- #5024 -> `docs/inference/use-local-inference.mdx`: Document
sandbox-runtime verification of the `inference.local` local inference
route.
- #5018 -> `docs/reference/commands.mdx`: Document Jetson/Tegra
device-node group propagation for sandbox CUDA initialization.
- #5012, #4763, #4706, #5030, #5015 -> `docs/about/release-notes.mdx`:
Summarize onboarding and recovery reliability fixes, including the
reserved Hermes API port guard.
- #5017 and #5043 -> `docs/about/release-notes.mdx`,
`docs/reference/commands.mdx`: Summarize mutable OpenClaw config
recovery and host-side `agents list` coverage.
- #5010 and #5016 -> `docs/about/release-notes.mdx`: Summarize Hermes
upstream metadata visibility and WhatsApp QR rendering reliability.
- #5045 and prior source docs in the v0.0.62 range -> `.agents/skills/`:
Refresh generated user-skill references from the current docs source.

## Skipped
- #5019 -> skipped for new prose because it touched
`openclaw-sandbox-permissive.yaml`, which matches `docs/.docs-skip`.
Existing source docs remain the source for generated skill
synchronization.

## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `npm run docs` (passes; Fern reports 0 errors and 1 hidden warning)
- Pre-commit hooks passed during commit, including docs-to-skills
verification, markdown lint, gitleaks, and skills YAML tests.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
  * Added `nemoclaw <name> agents list` command.
* v0.0.62 release notes added summarizing onboarding and recovery
improvements.

* **Bug Fixes**
* Improved GPU sandbox onboarding reliability (NVIDIA CDI path,
Jetson/Tegra device handling).
* Better local inference verification and recovery for Linux
Docker-driver GPU sandboxes.
  * Quieter/earlier handling of onboarding drift and port collisions.

* **Documentation**
* Expanded GPU passthrough, inference verification, writable paths
(`/dev/pts`), port 8642 restriction, and command examples.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Prekshi Vyas <34834085+prekshivyas@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

v0.0.62 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Jetson Orin][CLI&UX] nemoclaw status shows "Sandbox GPU: enabled" but CUDA is unusable inside sandbox — misleading status

3 participants