fix(inference): prove WSL Docker Desktop GPUs and report sandbox CUDA proof state by yimoj · Pull Request #4599 · NVIDIA/NemoClaw

yimoj · 2026-06-01T04:40:52Z

Summary

Two grouped GPU trust/proof/status fixes. nemoclaw now accepts real Windows-ARM N1X (WSL2 + Docker Desktop) GPUs when a bounded Docker --gpus CUDA proof succeeds (#4565), and nemoclaw status reports proven CUDA usability instead of treating any configured GPU as healthy (#4231).

Related Issue

Fixes #4565
Fixes #4231

Changes

[Windows ARM][Onboard] NemoClaw preflight reports 'no NVIDIA GPU detected' despite nvidia-smi and Docker --gpus confirming 65GB GPU #4565 — accept N1X without reopening the Snapdragon false positive ([WSL2][Onboard] preflight false-positive: Snapdragon iGPU reported as "NVIDIA GPU detected" on Windows ARM #3988/fix(preflight): gate NVIDIA detection on JMJWOA denylist + ARM64 kernel-interface check #4424): detectGpu() still rejects a denylisted JMJWOA-Generic-* name by default; the only escape is createArm64WslDockerDesktopGpuProver, which runs one bounded docker run --gpus all … CUDA workload on ARM64 Docker Desktop WSL hosts. The proof image is nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0 (a genuine aarch64 binary running a real CUDA kernel — device alloc + add + result verification). The previous cuda-sample:nbody image was wrong for this ARM64-only path: its arm64 manifest entry actually ships an x86-64 ELF, so it fails with exec format error on the exact N1X hardware this feature targets (reported in-thread). Only a real GPU passes, so N1X is accepted while the Snapdragon nvidia-smi shim (no usable CUDA device) stays fail-closed. The proof timeout is bounded (default 180s, NEMOCLAW_WSL_GPU_PROOF_TIMEOUT_MS override) and failures keep the CPU fallback with --no-gpu guidance. An explicit exec format error diagnostic now distinguishes an image-architecture problem from a missing GPU.
[Jetson Orin][CLI&UX] nemoclaw status shows "Sandbox GPU: enabled" but CUDA is unusable inside sandbox — misleading status #4231 — status reflects CUDA proof, not just config: the direct sandbox GPU verifier returns a SandboxGpuProofResult (verified/unverified/failed) keyed on the cuInit(0)=0 usability proof instead of silently swallowing optional-proof failures. A zero exit that still printed a non-zero cuInit(0) code (a wrapper that swallowed the real exit) is treated as failed, not verified. The result is persisted to the sandbox registry and rendered by nemoclaw status as (CUDA verified) / (CUDA unverified) / (last CUDA proof failed: …). CUDA failures print Jetson /dev/nvmap + video/render group remediation. The proof is captured by the existing verifyGpuSandboxAfterReady wrapper (so src/lib/onboard.ts is unchanged / net-zero), and cleared on snapshot clone so a restored sandbox cannot inherit another sandbox's CUDA verified state.
Fail-closed CPU fallback and explicit --no-gpu guidance preserved on every proof-failure path.
Captured stderr in runCaptureEx so Docker/CUDA failure diagnostics are no longer dropped.
The default ARM64 prover only swallows MODULE_NOT_FOUND and rethrows internal initialization errors (earlier CodeRabbit nit).

Type of Change

Code change (feature, bug fix, or refactor)

Verification

Rebased onto current upstream/main; resolved conflicts in status.ts/status-snapshot.ts/status.test.ts (upstream extracted the snapshot/report code into status-snapshot.ts) and threaded the proof result through the #4509 verifyGpuSandboxAfterReady wrapper.
Targeted GPU/status/registry/snapshot suites green (wsl-docker-desktop-gpu, nim, sandbox-gpu-preflight, docker-gpu-local-inference, status, registry, snapshot*).
npm test (cli project): only pre-existing, environment-only failures remain (test/cli.test.ts, test/ssrf-parity.test.ts, config-sync/nemoclaw-start root-ownership tests — file-mode/ownership/network checks unrelated to this change; none touch the modified files).
codex review --base upstream/main clean after addressing two P2 findings (stale proof on snapshot clone; require cuInit(0)=0 before verifying).
Tests added or updated for new or changed behavior.
No secrets, API keys, or credentials committed.
npx prek pre-commit/pre-push hooks pass (format, lint, typecheck).

Notes

The proof-image bug was diagnosed from the image manifest + file on the extracted binary (the nbody arm64 tag contains an x86-64 ELF; the vectoradd-cuda12.5.0 arm64 tag contains a real aarch64 binary). No live Windows-ARM/WSL GPU hardware was available on the triage host, so the N1X run was not reproduced live — see the in-thread reply for the exact commands and evidence.
Both issues were reproduced hermetically (no GPU hardware): detectGpu proof gating via injected prover, and the verifier/status proof classification via fixtures, confirming the pre-fix reject ([Windows ARM][Onboard] NemoClaw preflight reports 'no NVIDIA GPU detected' despite nvidia-smi and Docker --gpus confirming 65GB GPU #4565) and misleading "enabled" ([Jetson Orin][CLI&UX] nemoclaw status shows "Sandbox GPU: enabled" but CUDA is unusable inside sandbox — misleading status #4231) before fixing.

Signed-off-by: Yimo Jiang yimoj@nvidia.com

Summary by CodeRabbit

New Features
- Persistent per-sandbox CUDA proof tracking and reporting (verified / unverified / failed) with human-readable status lines and platform-specific remediation guidance.
- ARM64 WSL Docker Desktop GPU verification path with configurable timeout and clearer diagnostics.
Bug Fixes
- Snapshot restore no longer inherits a source sandbox’s GPU proof status.
Tests
- Updated unit and E2E GPU tests to validate CUDA usability states instead of a generic GPU-enabled marker.

coderabbitai · 2026-06-01T04:41:06Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f9f87825-a65f-4a85-983f-75aed6836e6f

📥 Commits

Reviewing files that changed from the base of the PR and between eef7250 and a56e438.

📒 Files selected for processing (19)

src/lib/actions/sandbox/snapshot.ts
src/lib/actions/sandbox/status-snapshot.ts
src/lib/actions/sandbox/status.test.ts
src/lib/actions/sandbox/status.ts
src/lib/inference/gpu-trust.ts
src/lib/inference/nim.test.ts
src/lib/inference/nim.ts
src/lib/onboard/docker-gpu-local-inference.test.ts
src/lib/onboard/docker-gpu-local-inference.ts
src/lib/onboard/docker-gpu-sandbox-create.ts
src/lib/onboard/sandbox-gpu-mode.ts
src/lib/onboard/sandbox-gpu-preflight.test.ts
src/lib/onboard/sandbox-gpu-preflight.ts
src/lib/onboard/sandbox-registry-metadata.ts
src/lib/onboard/wsl-docker-desktop-gpu.test.ts
src/lib/onboard/wsl-docker-desktop-gpu.ts
src/lib/runner.ts
src/lib/state/registry.ts
test/e2e/test-gpu-e2e.sh

🚧 Files skipped from review as they are similar to previous changes (18)

src/lib/actions/sandbox/snapshot.ts
src/lib/actions/sandbox/status.test.ts
src/lib/inference/gpu-trust.ts
test/e2e/test-gpu-e2e.sh
src/lib/runner.ts
src/lib/actions/sandbox/status-snapshot.ts
src/lib/onboard/sandbox-gpu-mode.ts
src/lib/onboard/sandbox-registry-metadata.ts
src/lib/onboard/docker-gpu-sandbox-create.ts
src/lib/onboard/docker-gpu-local-inference.test.ts
src/lib/state/registry.ts
src/lib/inference/nim.ts
src/lib/onboard/wsl-docker-desktop-gpu.ts
src/lib/onboard/docker-gpu-local-inference.ts
src/lib/onboard/wsl-docker-desktop-gpu.test.ts
src/lib/inference/nim.test.ts
src/lib/onboard/sandbox-gpu-preflight.ts
src/lib/onboard/sandbox-gpu-preflight.test.ts

📝 Walkthrough

Walkthrough

Adds structured CUDA-usability proof results, runs bounded Docker/WSL proofs (ARM64), persists proof outcomes in the registry, integrates optional proof injection into GPU detection, captures proofs during onboarding, and renders proof status (verified/failed/unverified) in sandbox status output with tests and e2e update.

Changes

Sandbox GPU proof state and verification

Layer / File(s)	Summary
GPU proof types and registry persistence `src/lib/state/registry.ts`	`SandboxGpuProofStatus` and `SandboxGpuProofResult` added; `SandboxEntry` extended with `sandboxGpuProof`; `registerSandbox` persists proof as `entry.sandboxGpuProof ?? null`.
Direct sandbox GPU verifier refactoring `src/lib/onboard/sandbox-gpu-preflight.ts`, `src/lib/onboard/sandbox-gpu-preflight.test.ts`	`createDirectSandboxGpuVerifier` now returns `SandboxGpuProofResult`, classifies CUDA usability via `cuInit(0)=` marker, accumulates optional failures, throws on required failures, and logs platform-specific remediation (Jetson vs WSL). Tests cover verified/failed/unverified and optional handling.
Runner stderr capture extension `src/lib/runner.ts`	`CaptureResult` now includes optional `stderr`; `runCaptureEx` returns trimmed `stderr` alongside `stdout` for diagnostic use in proof runners.
Docker GPU proof result contract `src/lib/inference/gpu-trust.ts`	Adds `DockerGpuProofResult` and `Arm64WslDockerDesktopGpuProver` types describing bounded Docker `--gpus` proof outcomes and prover contract.
ARM64 WSL Docker Desktop prover and tests `src/lib/onboard/wsl-docker-desktop-gpu.ts`, `src/lib/onboard/wsl-docker-desktop-gpu.test.ts`	Introduces `WSL_DOCKER_DESKTOP_GPU_PROOF_COMMAND` update, timeout helper `wslDockerDesktopGpuProofTimeoutMs`, `isExecFormatErrorDiagnostic`, an ARM64 prover factory mapping runner output to `DockerGpuProofResult`, and tests validating platform gating, argv/image selection, timeout, and exec-format diagnostics.
GPU detection integration with proof injection `src/lib/inference/nim.ts`, `src/lib/inference/nim.test.ts`	`detectGpu(deps)` accepts `DetectGpuDeps` with optional ARM64 WSL prover; on denylisted placeholder rows runs bounded proof, rejects on missing/failed proof, or accepts and sets `wslDockerDesktopGpuProofPassed` when proof passes. Tests for accept-path, fail-closed, and null-prover cases added.
Onboarding sandbox GPU configuration `src/lib/onboard/sandbox-gpu-mode.ts`	Adds optional `sandboxGpuProof?: SandboxGpuProofResult
Local inference verification and proof persistence `src/lib/onboard/docker-gpu-local-inference.ts`, `src/lib/onboard/docker-gpu-local-inference.test.ts`	`verifyGpuSandboxAfterReady` captures `SandboxGpuProofResult` from verification hooks and assigns to `config.sandboxGpuProof`; option types updated and test added to verify persistence.
Docker GPU sandbox creation proof wiring `src/lib/onboard/docker-gpu-sandbox-create.ts`	`verifyGpuOrExit` now returns `SandboxGpuProofResult` and forwards `verifyDirectSandboxGpu` result through the create-sandbox flow.
Registry metadata proof persistence `src/lib/onboard/sandbox-registry-metadata.ts`	`getSandboxRuntimeRegistryFields` includes `sandboxGpuProof` in returned fields and conditionally adds it to metadata only when `config.sandboxGpuProof` is present.
Sandbox status reporting and tests `src/lib/actions/sandbox/status-snapshot.ts`, `src/lib/actions/sandbox/status.ts`, `src/lib/actions/sandbox/status.test.ts`	Adds `sandboxGpuProof` to `SandboxStatusReport`, populates it from registry, adds `sandboxGpuProofUnverified()` and `sandboxGpuProofStatusSuffix()` helpers, appends proof suffix to "Sandbox GPU" line, and prints guidance on failed/unverified. Tests verify suffix rendering and legacy null behavior.
Snapshot restore sandbox isolation `src/lib/actions/sandbox/snapshot.ts`	Auto-created destination sandbox for snapshot restore explicitly clears `sandboxGpuProof` to `null` to avoid inheriting GPU proof when no GPU flags are passed.
End-to-end GPU status validation `test/e2e/test-gpu-e2e.sh`	E2E script updated to assert CUDA proof state: pass on "CUDA verified", fail on "CUDA unverified" or "last CUDA proof failed", skip if absent.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

NVIDIA/NemoClaw#4407: Related changes to Docker GPU onboarding verification flow and proof-result handling.
NVIDIA/NemoClaw#4600: Related showSandboxStatus() rendering changes affecting sandbox status output.
NVIDIA/NemoClaw#4609: Related modifications to verifyGpuSandboxAfterReady onboarding gating.

Suggested labels

Docker, Sandbox, NemoClaw CLI, Integration: OpenClaw, v0.0.57

Suggested reviewers

prekshivyas

Poem

🐰 I hopped through proofs in midnight code,
cuInit traces on a winding road.
ARM64 provers bounded in time,
status shows truth in succinct rhyme—
CUDA verified, cocoa and code.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 44.83% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title clearly summarizes the main changes: implementing Docker Desktop GPU proof for Windows-ARM and reporting sandbox CUDA proof state in the status output.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

src/lib/inference/nim.ts (1)

96-100: ⚡ Quick win

Don't swallow internal prover initialization failures.

This catch turns bugs inside ../onboard/wsl-docker-desktop-gpu into a silent CPU fallback, which makes supported N1X hosts look like “no GPU” instead of surfacing the regression. Only the optional module-resolution case should return null; other failures should bubble.

Suggested change

 function defaultArm64WslDockerDesktopGpuProver(): Arm64WslDockerDesktopGpuProver | null {
   try {
     return require("../onboard/wsl-docker-desktop-gpu").createArm64WslDockerDesktopGpuProver();
-  } catch {
-    return null;
+  } catch (error) {
+    if (
+      error &&
+      typeof error === "object" &&
+      "code" in error &&
+      (error as NodeJS.ErrnoException).code === "MODULE_NOT_FOUND"
+    ) {
+      return null;
+    }
+    throw error;
   }
 }

Based on learnings: "avoid adding defensive error handling (e.g., try/catch wrappers, fallbacks, or extra validation) around internal helper logic when there is no realistic throwing path or failure mode."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/inference/nim.ts` around lines 96 - 100, The function
defaultArm64WslDockerDesktopGpuProver currently swallows all errors from
requiring or initializing ../onboard/wsl-docker-desktop-gpu; change it so only
the optional-module-resolution case returns null and any other initialization
errors bubble up: call require("../onboard/wsl-docker-desktop-gpu") and attempt
to createArm64WslDockerDesktopGpuProver inside a try/catch that only returns
null when the caught error is a module-not-found resolution (e.g., error.code
=== 'MODULE_NOT_FOUND' or equivalent), otherwise rethrow the error so failures
inside createArm64WslDockerDesktopGpuProver are not silenced.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/lib/inference/nim.ts`:
- Around line 96-100: The function defaultArm64WslDockerDesktopGpuProver
currently swallows all errors from requiring or initializing
../onboard/wsl-docker-desktop-gpu; change it so only the
optional-module-resolution case returns null and any other initialization errors
bubble up: call require("../onboard/wsl-docker-desktop-gpu") and attempt to
createArm64WslDockerDesktopGpuProver inside a try/catch that only returns null
when the caught error is a module-not-found resolution (e.g., error.code ===
'MODULE_NOT_FOUND' or equivalent), otherwise rethrow the error so failures
inside createArm64WslDockerDesktopGpuProver are not silenced.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a97cb9ac-bff2-4a0c-b99c-99c28f76ae2b

📥 Commits

Reviewing files that changed from the base of the PR and between df7d054 and 559211e.

📒 Files selected for processing (15)

src/lib/actions/sandbox/status.test.ts
src/lib/actions/sandbox/status.ts
src/lib/inference/gpu-trust.ts
src/lib/inference/nim.test.ts
src/lib/inference/nim.ts
src/lib/onboard.ts
src/lib/onboard/sandbox-gpu-mode.ts
src/lib/onboard/sandbox-gpu-preflight.test.ts
src/lib/onboard/sandbox-gpu-preflight.ts
src/lib/onboard/sandbox-registry-metadata.ts
src/lib/onboard/wsl-docker-desktop-gpu.test.ts
src/lib/onboard/wsl-docker-desktop-gpu.ts
src/lib/runner.ts
src/lib/state/registry.ts
test/e2e/test-gpu-e2e.sh

yimoj · 2026-06-01T05:02:48Z

@coderabbitai review

The incremental commit addresses your nitpick on defaultArm64WslDockerDesktopGpuProver (now only swallows MODULE_NOT_FOUND and rethrows other errors). Please re-review to confirm.

coderabbitai · 2026-06-01T05:04:16Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check the current state of wsl-docker-desktop-gpu.ts for MODULE_NOT_FOUND handling
rg -n "MODULE_NOT_FOUND\|defaultArm64Wsl\|require.*runCaptureEx\|catch" src/lib/onboard/wsl-docker-desktop-gpu.ts -A3 -B3

echo "---nim.ts defaultArm64WslDockerDesktopGpuProver---"
rg -n "defaultArm64Wsl\|MODULE_NOT_FOUND" src/lib/inference/nim.ts -A5 -B3

Length of output: 646

🏁 Script executed:

#!/bin/bash
# Find the relevant files
fd "wsl-docker-desktop-gpu.ts" --type f
fd "nim.ts" --type f | grep -v test | grep -v node_modules

echo "---"
# Try broader search
rg -rn "MODULE_NOT_FOUND" --type ts -C3

Length of output: 1760

🏁 Script executed:

#!/bin/bash
echo "=== nim.ts defaultArm64WslDockerDesktopGpuProver / lazy require section ==="
cat src/lib/inference/nim.ts | grep -n "MODULE_NOT_FOUND\|require\|catch\|proveArm64\|defaultArm64\|lazy" | head -60

echo ""
echo "=== Full nim.ts catch blocks ==="
rg -n "catch" src/lib/inference/nim.ts -A 8 -B 2

echo ""
echo "=== wsl-docker-desktop-gpu.ts full file ==="
cat src/lib/onboard/wsl-docker-desktop-gpu.ts

Length of output: 16786

🏁 Script executed:

#!/bin/bash
# Read the exact lines around the defaultArm64WslDockerDesktopGpuProver function to confirm rethrow
awk 'NR>=92 && NR<=125' src/lib/inference/nim.ts

Length of output: 1920

@yimoj The fix is confirmed correct. Here's what I verified in src/lib/inference/nim.ts (Lines 96–113):

function defaultArm64WslDockerDesktopGpuProver(): Arm64WslDockerDesktopGpuProver | null {
  try {
    return require("../onboard/wsl-docker-desktop-gpu").createArm64WslDockerDesktopGpuProver();
  } catch (error) {
    // Only the optional module-resolution case should degrade to "no prover";
    // a real bug inside the prover module must bubble up rather than masquerade
    // as a missing GPU on an otherwise-supported N1X host.
    if (
      error &&
      typeof error === "object" &&
      (error as NodeJS.ErrnoException).code === "MODULE_NOT_FOUND"
    ) {
      return null;
    }
    throw error;  // ← all other errors rethrown
  }
}

The narrowed catch is exactly right:

MODULE_NOT_FOUND → return null (graceful fallback when the onboard module isn't present in a slim bundle)
Any other error → throw error (genuine prover failures bubble up and are no longer silently converted to "no GPU detected" on a real N1X host)

The previous nitpick is fully addressed. The rest of the PR also looks solid — the wsl-docker-desktop-gpu.ts prover correctly splits the proof command into argv (no shell), captures stderr for diagnostics, and applies the configurable timeout. The sandbox CUDA proof result flow (verified/unverified/failed → registry → nemoclaw status) is coherent end-to-end.

[approve]

wscurran · 2026-06-01T14:45:47Z

✨
Related open PRs:

#4424 fix(preflight): gate NVIDIA detection on JMJWOA denylist + ARM64 kernel-interface check

Related open issues:

#4565 [Windows ARM][Onboard] NemoClaw preflight reports 'no NVIDIA GPU detected' despite nvidia-smi and Docker --gpus confirming 65GB GPU
#4231 [Jetson Orin][CLI&UX] nemoclaw status shows "Sandbox GPU: enabled" but CUDA is unusable inside sandbox — misleading status
#3988 [WSL2][Onboard] preflight false-positive: Snapdragon iGPU reported as "NVIDIA GPU detected" on Windows ARM

prekshivyas · 2026-06-02T15:51:10Z

@yimoj this one's showing CONFLICTING against main — could you re-sync (merge/rebase main) so it can go through review? CI is otherwise green and there are no open threads. Thanks!

zyang-dev · 2026-06-02T20:56:31Z

@yimoj
I ran this command and got the error "exec /cuda-samples/sample: exec format error".

export const WSL_DOCKER_DESKTOP_GPU_PROOF_COMMAND =
  "docker run --rm --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark";

Did you see this error when you tested it on Windows-ARM with WSL?

yimoj · 2026-06-03T03:26:11Z

@zyang-dev Thank you — that exec /cuda-samples/sample: exec format error is a real bug, not just a docs question, and it would have defeated the whole point of this PR on the exact N1X hardware #4565 is about. I've fixed it.

To be upfront: I do not have Windows-ARM + WSL hardware, so I did not reproduce your run live. I diagnosed it from the image itself, which is conclusive here:

# The nbody image's manifest list does advertise an arm64 entry...
docker manifest inspect nvcr.io/nvidia/k8s/cuda-sample:nbody
#   -> manifests: [ {amd64}, {arm64} ]

# ...but the binary inside that arm64 entry is actually x86-64:
docker pull --platform linux/arm64 nvcr.io/nvidia/k8s/cuda-sample@sha256:3652af0e06f16b41eaae6f1b3134b3dd07144156549923ca8a0bccb432bf0a1b
file /cuda-samples/nbody
#   -> ELF 64-bit LSB pie executable, x86-64, ... (/cuda-samples/sample -> /cuda-samples/nbody)

So on an aarch64 host Docker pulls the "arm64" variant whose binary is x86-64, and execve fails with exec format error. Since this proof path only ever runs on ARM64, that image could never pass — exactly what you hit.

Fix (pushed):

Switched the proof to nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0, whose arm64 variant ships a genuine ARM aarch64 binary (verified the same way: file reports ELF 64-bit LSB pie executable, ARM aarch64). vectorAdd runs a real CUDA kernel (device alloc + add + result verification), so it's still a strong usability proof that stays fail-closed on the Snapdragon nvidia-smi shim.
Added explicit exec format error detection so any future image/arch mismatch is reported as a proof-image problem rather than silently reading as "no GPU".

If you're able to re-run docker run --rm --gpus all nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0 on your N1X WSL setup, I'd be grateful for confirmation that it now reports a passing kernel — that's the one piece I can't verify without the hardware.

yimoj · 2026-06-03T04:00:12Z

@coderabbitai review

Rebased onto current main (resolves the prior CONFLICTING state) and addressed the Windows-ARM exec format error reported in-thread: the proof image is now the arch-correct cuda-sample:vectoradd-cuda12.5.0 (the old nbody arm64 tag shipped an x86-64 binary). Also hardened the #4231 path: cuInit(0)=0 is now required before reporting CUDA verified, and the proof is cleared on snapshot clone. Please re-review the latest head.

coderabbitai · 2026-06-03T04:01:29Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
echo "=== wsl-docker-desktop-gpu.ts: proof command + timeout + isExecFormatError ==="
cat src/lib/onboard/wsl-docker-desktop-gpu.ts

echo ""
echo "=== sandbox-gpu-preflight.ts: cuInit(0)=0 requirement + jetson remediation ==="
cat src/lib/onboard/sandbox-gpu-preflight.ts

echo ""
echo "=== snapshot.ts: sandboxGpuProof cleared on clone ==="
rg -n "sandboxGpuProof" src/lib/actions/sandbox/snapshot.ts -C5

echo ""
echo "=== status.ts: proof suffix rendering ==="
cat src/lib/actions/sandbox/status.ts

echo ""
echo "=== registry.ts: SandboxGpuProofResult + SandboxEntry ==="
rg -n "SandboxGpuProof\|sandboxGpuProof" src/lib/state/registry.ts -C3

echo ""
echo "=== nim.ts: MODULE_NOT_FOUND narrowed catch (confirm still in place) ==="
awk 'NR>=88 && NR<=125' src/lib/inference/nim.ts

Length of output: 50273

@yimoj All three change groups verified against the latest head.

#4565 — vectoradd-cuda12.5.0 proof image + exec-format-error hardening

WSL_DOCKER_DESKTOP_GPU_PROOF_COMMAND now uses the arch-correct vectoradd-cuda12.5.0 image, with inline documentation explaining exactly why nbody failed (x86-64 ELF inside the arm64 manifest entry). ✅
wslDockerDesktopGpuProofTimeoutMs validates the env override with Number.isFinite(raw) && raw > 0. ✅
isExecFormatErrorDiagnostic is correctly gated by /exec format error/i and the createArm64WslDockerDesktopGpuProver closure logs a distinct "proof-image issue, not a missing GPU" message for that branch — it does not fall through to "treating GPU as unproven", so the failure mode is actionable rather than silent. ✅

#4231 — cuInit(0)=0 required for CUDA verified

CUDA_INIT_RESULT_PATTERN captures the return code; only cudaInitMatch?.[1] === "0" promotes to cudaVerified = true. ✅
Symmetry on the success path is present: exit 0 + cuInit ran + non-zero return code → cudaFailure (not silently verified), which correctly handles the "wrapper swallows exit code but prints cuInit(0)=<err>" scenario. ✅
Jetson remediation path wired via resolvedPlatform === "jetson" branch in the failed case. ✅

Snapshot clone

sandboxGpuProof: null explicitly set for snapshot restore --to <dst> auto-create with a comment explaining why inheritance is wrong. ✅

MODULE_NOT_FOUND narrowed catch (from previous review)

Still in place in nim.ts Lines 96–113. ✅

One cosmetic nit — not blocking: Line ~188 in wsl-docker-desktop-gpu.ts has a stale comment // prefer it for the diagnostic and fall back to stdout (nbody banner). The "nbody banner" phrase is a leftover from the old proof command and could be updated to reference vectorAdd output, but it has no functional impact.

Everything is correct end-to-end. Approving.

[approve]

… proof state Two grouped GPU trust/proof/status fixes, rebased onto current main. NVIDIA#4565 — accept real Windows-ARM N1X (WSL2 + Docker Desktop) GPUs without reopening the Snapdragon false positive (NVIDIA#3988/NVIDIA#4424). detectGpu() still rejects a denylisted JMJWOA-Generic-* name by default; the only escape is the ARM64 WSL Docker Desktop prover, which runs one bounded Docker --gpus CUDA workload. The proof image is now the arch-correct cuda-sample:vectoradd-cuda12.5.0 (a genuine aarch64 binary running a real CUDA kernel) instead of cuda-sample:nbody, whose arm64 manifest entry actually ships an x86-64 ELF and therefore fails with "exec format error" on the very N1X target this feature accepts. An explicit exec-format-error diagnostic now distinguishes an image-architecture problem from a missing GPU. A real GPU passes; the Snapdragon nvidia-smi shim (no usable CUDA device) stays fail-closed. NVIDIA#4231 — nemoclaw status reflects CUDA proof, not just config. The direct sandbox GPU verifier returns a SandboxGpuProofResult (verified/unverified/failed) keyed on cuInit(0)=0, persisted to the registry and rendered by status as "(CUDA verified)" / "(CUDA unverified)" / "(last CUDA proof failed: …)". A zero exit that printed a non-zero cuInit code (swallowed exit) is treated as failed, not verified. The proof is captured by the verifyGpuSandboxAfterReady wrapper (net-zero onboard.ts) and cleared on snapshot clone so a restored sandbox cannot inherit another sandbox's "CUDA verified" state. CUDA failures print Jetson /dev/nvmap + video/render group remediation. Fail-closed CPU fallback with explicit --no-gpu guidance is preserved on every proof-failure path. Captured stderr in runCaptureEx so Docker/CUDA diagnostics are no longer dropped. The default ARM64 prover only swallows MODULE_NOT_FOUND and rethrows internal initialization errors. Fixes NVIDIA#4565 Fixes NVIDIA#4231 Signed-off-by: Yimo Jiang <yimoj@nvidia.com>

yimoj · 2026-06-03T04:03:33Z

@coderabbitai review

Latest head only updates one stale code comment (the nbody banner → vectorAdd output nit you flagged); no functional change from the approved revision.

coderabbitai · 2026-06-03T04:04:14Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
echo "=== Comment around line 188 in wsl-docker-desktop-gpu.ts ==="
awk 'NR>=183 && NR<=200' src/lib/onboard/wsl-docker-desktop-gpu.ts

Length of output: 1101

🏁 Script executed:

#!/bin/bash
echo "=== Search for vectorAdd / nbody / diagnostic comment in wsl-docker-desktop-gpu.ts ==="
rg -n "vectorAdd\|nbody\|prefer.*diagnostic\|fall back.*stdout" src/lib/onboard/wsl-docker-desktop-gpu.ts -C2

Length of output: 545

🏁 Script executed:

#!/bin/bash
echo "=== Lines 160-192 in wsl-docker-desktop-gpu.ts ==="
awk 'NR>=160 && NR<=192' src/lib/onboard/wsl-docker-desktop-gpu.ts

Length of output: 1747

@yimoj Confirmed — Line 176 in wsl-docker-desktop-gpu.ts now reads:

// prefer it for the diagnostic and fall back to stdout (vectorAdd output).

The nbody banner reference is gone, replaced with the accurate vectorAdd output. That's the only change from the approved revision, and it's exactly right.

[approve]

…n init (NVIDIA#4231) On Jetson Orin the sandbox saw the GPU devices mounted but CUDA failed with `NvRmMemInitNvmap ... Permission denied` / `cuInit(0)=999` because the unprivileged sandbox user was not a member of the host group (`video`) that owns `/dev/nvmap` (`crw-rw---- root video`). PR NVIDIA#4599 improved the status/proof semantics but did not propagate that group, so QA reopened: CUDA stayed unusable and status still read "enabled". The Jetson Docker GPU recreate now detects the host group(s) owning the Tegra device nodes (`/dev/nvmap`, `/dev/nvhost-*`, `/dev/nvgpu/*`) and grants the sandbox user matching `--group-add <gid>` membership, so CUDA's nvmap init can open them. The existing post-recreate `cuInit(0)` proof then passes and `nemoclaw status` reports `(CUDA verified)`; if the group cannot be resolved, onboard warns and the proof still gates success, so status falls back to the honest `(last CUDA proof failed)` with `/dev/nvmap` remediation instead of a misleading "enabled". This automates the remediation the existing `jetsonGpuProofRemediationLines()` already documents. - detectTegraDeviceGroupGids(): stat the Tegra device nodes, return owning numeric GIDs (skip missing and root-owned); numeric GIDs work even when the sandbox image has no matching video/render group entry. - recreateOpenShellDockerSandboxWithGpu(): for the jetson backend, thread the detected GIDs into DockerGpuCloneRunOptions.extraGroupGids; buildDockerGpu CloneRunArgs emits --group-add (deduped vs baseline GroupAdd). - applyDockerGpuPatchOrExit(): thread `backend` explicitly so the fallback create path also grants the group. - Regression tests for GID detection, --group-add emission/dedupe, and the Jetson-vs-generic recreate plumbing. - Reporter-workflow E2E (test/e2e/test-jetson-nvmap-gpu.sh): onboard with GPU, inspect sandbox groups + /dev/nvmap, run cuInit(0) in-sandbox, assert status reports (CUDA verified). Wired as gpu-jetson-nvmap-e2e (Jetson-gated) and inventoried; skips cleanly on non-Jetson hosts. Signed-off-by: Yimo Jiang <yimoj@nvidia.com>

@yimoj

…n init (#4231) (#5018) ## Summary On Jetson Orin the sandbox saw the GPU devices mounted but CUDA failed with `NvRmMemInitNvmap ... Permission denied` / `cuInit(0)=999` because the unprivileged sandbox user was not a member of the host group (`video`) that owns `/dev/nvmap` (`crw-rw---- root video`). This grants the sandbox user that group on the Jetson Docker GPU recreate so CUDA actually initializes, and the existing post-recreate `cuInit(0)` proof makes `nemoclaw status` report proven CUDA usability instead of a misleading bare "enabled". ## Related Issue Fixes #4231 PR #4599 improved status/proof semantics but did not propagate Jetson `/dev/nvmap` group access, so QA reopened: CUDA stayed unusable inside the sandbox. This PR fixes the device-permission root cause. ## Changes - **`docker-gpu-patch.ts` — grant the Tegra device-node group on Jetson recreate (the fix):** new `detectTegraDeviceGroupGids()` stats the Jetson Tegra device nodes (`/dev/nvmap`, `/dev/nvhost-*`, `/dev/nvgpu/*`) on the host and returns the owning numeric GID(s) (skipping missing and root-owned nodes). `recreateOpenShellDockerSandboxWithGpu` passes those through `DockerGpuCloneRunOptions.extraGroupGids` into `buildDockerGpuCloneRunArgs`, which emits `--group-add <gid>` (deduped against any baseline `GroupAdd`). Numeric GIDs are used on purpose — the sandbox image need not define a matching `video`/`render` group. Only runs for the `jetson` backend; `backend` is now threaded explicitly through `applyDockerGpuPatchOrExit` so the fallback create path is covered too. This automates the exact remediation the existing `jetsonGpuProofRemediationLines()` already documents. - **Status correctness:** the existing post-recreate `cuInit(0)` proof from #4599 now passes once the device group is granted, so `nemoclaw status` shows `(CUDA verified)`. If the group cannot be resolved, onboard warns and the proof still gates success, so status falls back to the honest `(last CUDA proof failed: …)` with `/dev/nvmap` remediation rather than a misleading "enabled". - **Regression tests** (`docker-gpu-patch.test.ts`): GID detection (dedupe, skip missing/root), `--group-add` emission + dedupe, and end-to-end plumbing through the Jetson recreate; plus a guard that the generic backend never adds Tegra groups. - **Reporter-workflow E2E** (`test/e2e/test-jetson-nvmap-gpu.sh`, `gpu-jetson-nvmap-e2e` in `nightly-e2e.yaml`): runs the reporter's exact Jetson steps and inventoried in `legacy-inventory.json` + `.coderabbit.yaml`. ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## Verification - [x] `npm test` (CLI project) passes — full `vitest --project cli` green on this PR head after rebase (the only 2 reds are the pre-existing `snapshot-shields` / `e2e-fixture-context` flakes, confirmed failing on base with my changes stashed). - [x] `npm run typecheck:cli` passes. - [x] `codex review --uncommitted` clean (two flagged CI-integration gaps fixed: aggregate `needs` lists + migration inventory). - [x] Tests added for new/changed behavior. - [x] No secrets, API keys, or credentials committed. ### Reporter-workflow E2E evidence This is verified at two levels that together cover the exact reporter workflow: 1. **Deterministic regression of the exact failure mode** — the unit suite `src/lib/onboard/docker-gpu-patch.test.ts` (describe `Jetson /dev/nvmap group propagation (#4231)`) reproduces the precise reporter condition hermetically: a sandbox user lacking the `/dev/nvmap` owning group, and asserts the Jetson recreate now emits `--group-add <gid>` for the Tegra device-node group so the proof can pass. **56/56 pass on this PR head.** 2. **Reporter-workflow pipeline E2E** — `test/e2e/test-jetson-nvmap-gpu.sh`, wired as the `gpu-jetson-nvmap-e2e` job in `nightly-e2e.yaml`, performs the reporter's exact steps on a Jetson host: onboard with GPU, inspect the sandbox user's groups and `/dev/nvmap`, run the in-sandbox `cuInit(0)` CUDA proof, and assert `nemoclaw status` reports `(CUDA verified)` (a bare "enabled" fails the job). Trigger it on a Jetson runner with: ``` gh workflow run nightly-e2e.yaml --ref fix/4231-jetson-nvmap-gpu-status -f jobs=gpu-jetson-nvmap-e2e ``` All required CI checks are green on this PR head (`cli-test-shards`, `build-typecheck`, `codebase-growth-guardrails`, `ShellCheck`, `dco-check`, `CodeRabbit`); see the PR Checks tab for the run ids and job logs. ## Merge gate / remaining work The live `gpu-jetson-nvmap-e2e` job is gated behind `vars.JETSON_E2E_ENABLED` and a Jetson/Tegra GPU runner label (`vars.JETSON_E2E_RUNNER_LABEL`). The project does not yet host an arm64/Jetson GPU runner, so a live green log on real Jetson hardware is pending that runner being provisioned — set the variable and label, then dispatch the job above. Issue #4231 stays assigned to @yimoj until that live log is captured. --- Signed-off-by: Yimo Jiang <yimoj@nvidia.com> 🤖 Generated with [Claude Code](https://claude.com/claude-code) ## Summary by CodeRabbit * **New Features** * Jetson/Tegra GPU group-permission handling added to improve CUDA initialization on Jetson hardware. * **Tests** * New end-to-end Jetson nvmap GPU test validating group permissions, CUDA initialization, and status reporting. * Nightly E2E job added to run the Jetson GPU test, with configurable enablement and runner selection. * **Chores** * CI reporting updated to include the new Jetson GPU job in failure notifications and reports.   ## Summary by CodeRabbit * **New Features** * Added nightly end-to-end testing for Jetson Orin GPU support, validating CUDA usability and device access configuration. * Improved GPU sandbox group permissions handling for Jetson devices to ensure proper GPU device access. * **Tests** * Added comprehensive E2E test script for Jetson `/dev/nvmap` GPU validation. * Extended test coverage for GPU sandbox group permission detection and application.  --------- Signed-off-by: Yimo Jiang <yimoj@nvidia.com>

coderabbitai Bot reviewed Jun 1, 2026

View reviewed changes

yimoj added v0.0.60 Release target v0.0.56 Release target and removed v0.0.60 Release target labels Jun 1, 2026

wscurran added enhancement: inference labels Jun 1, 2026

cv added v0.0.57 Release target and removed v0.0.56 Release target labels Jun 1, 2026

cv assigned zyang-dev Jun 2, 2026

zyang-dev self-requested a review June 2, 2026 20:57

cv added v0.0.58 Release target and removed v0.0.57 Release target labels Jun 3, 2026

wscurran added area: inference Inference routing, serving, model selection, or outputs bug-fix PR fixes a bug or regression feature PR adds or expands user-visible functionality labels Jun 3, 2026

yimoj force-pushed the fix/4565-gpu-proof-trust-status branch from cbdf60f to eef7250 Compare June 3, 2026 03:25

yimoj force-pushed the fix/4565-gpu-proof-trust-status branch from eef7250 to a56e438 Compare June 3, 2026 04:03

wscurran removed the fix label Jun 3, 2026

wscurran removed the enhancement: inference label Jun 3, 2026

cv approved these changes Jun 3, 2026

View reviewed changes

cv merged commit b8e9531 into NVIDIA:main Jun 3, 2026
21 checks passed

miyoungc mentioned this pull request Jun 4, 2026

docs: refresh 0.0.58 release docs and refresh skills #4743

Merged

wscurran removed the feature PR adds or expands user-visible functionality label Jun 9, 2026

yimoj mentioned this pull request Jun 9, 2026

fix(onboard): grant Jetson Tegra device-node group so sandbox CUDA can init (#4231) #5018

Merged

6 tasks

coderabbitai Bot mentioned this pull request Jun 9, 2026

fix(onboard): prove GPU sandbox local inference from the agent runtime (#4509) #5024

Merged

6 tasks

Conversation

yimoj commented Jun 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Verification

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

yimoj commented Jun 1, 2026

Uh oh!

coderabbitai Bot commented Jun 1, 2026

Uh oh!

wscurran commented Jun 1, 2026

Uh oh!

prekshivyas commented Jun 2, 2026

Uh oh!

zyang-dev commented Jun 2, 2026

Uh oh!

yimoj commented Jun 3, 2026

Uh oh!

yimoj commented Jun 3, 2026

Uh oh!

coderabbitai Bot commented Jun 3, 2026

Uh oh!

yimoj commented Jun 3, 2026

Uh oh!

coderabbitai Bot commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yimoj commented Jun 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading