fix(preflight): gate NVIDIA detection on JMJWOA denylist + ARM64 kernel-interface check#4424
Conversation
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughBroaden the WDDM/WSL2 placeholder denylist to ChangesNVIDIA placeholder hardening
Sequence Diagram(s)sequenceDiagram
participant Client as detectGpu()
participant SMI as nvidia-smi
participant FW as firmware detection
participant Kernel as nvidiaHostLooksGenuine()
Client->>SMI: run nvidia-smi probe (parse rows)
SMI-->>Client: CSV rows / names
Client->>FW: is platform vouched? (spark/station/jetson)
FW-->>Client: vouched | unvouched
alt unvouched
Client->>Kernel: check /proc/driver/nvidia on linux/arm64
Kernel-->>Client: present | absent
Client->>Client: if any row matches denylist -> return null
else vouched
Client->>Client: bypass denylist/kernel gate, apply plausibility filter
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Auto-dispatched E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
E2E Scenario Advisor RecommendationRequired scenario E2E: Dispatch required scenario E2E:
Full scenario advisor summaryE2E Scenario AdvisorBase: Required scenario E2E
Optional scenario E2E
Relevant changed files
|
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-e2e | ⏭️ skipped |
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
PR Review AdvisorFindings: 1 needs attention, 2 worth checking, 0 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Since last review detailsCurrent findings:
This is an automated advisory review. A human maintainer must make the final merge decision. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/lib/inference/nim.test.ts (1)
431-461: ⚡ Quick winAdd vendor-prefixed denylist cases to the forged-strict-path safety-net test.
This table only asserts bare
JMJWOA-Generic-*names. Add prefixed variants (for example,NVIDIA JMJWOA-Generic-GPU) so the denylist guard is also covered when strict identity is forged-valid.♻️ Proposed test expansion
it.each([ "JMJWOA-Generic-GPU", "JMJWOA-Generic-NPU", "JMJWOA-Generic-Future", + "NVIDIA JMJWOA-Generic-GPU", + "NVIDIA JMJWOA-Generic-NPU", + "NVIDIA JMJWOA-Generic-Future", ])(🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/inference/nim.test.ts` around lines 431 - 461, Update the table in the test "denylist rejects %s even when strict probe somehow validates (`#3988`)" to include vendor-prefixed variants of the denylisted names (e.g., "NVIDIA JMJWOA-Generic-GPU", "NVIDIA JMJWOA-Generic-NPU", "NVIDIA JMJWOA-Generic-Future") alongside the existing bare names so the denylist guard is exercised even when isStrictNvidiaIdentityProbe and strictNvidiaIdentitiesCsv return a forged-valid identity; modify the array passed to it.each (the list consumed by the test using runCapture, loadNimWithMockedRunner, and nimModule.detectGpu()) to add these prefixed strings.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/lib/inference/nim.test.ts`:
- Around line 431-461: Update the table in the test "denylist rejects %s even
when strict probe somehow validates (`#3988`)" to include vendor-prefixed variants
of the denylisted names (e.g., "NVIDIA JMJWOA-Generic-GPU", "NVIDIA
JMJWOA-Generic-NPU", "NVIDIA JMJWOA-Generic-Future") alongside the existing bare
names so the denylist guard is exercised even when isStrictNvidiaIdentityProbe
and strictNvidiaIdentitiesCsv return a forged-valid identity; modify the array
passed to it.each (the list consumed by the test using runCapture,
loadNimWithMockedRunner, and nimModule.detectGpu()) to add these prefixed
strings.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5ca5e453-ef2b-451d-bc73-36026f077f40
📒 Files selected for processing (2)
src/lib/inference/nim.test.tssrc/lib/inference/nim.ts
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-e2e | ⏭️ skipped |
…prerequisite Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
|
🌿 Preview your docs: https://nvidia-preview-pr-4424.docs.buildwithfern.com/nemoclaw |
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-e2e | ⏭️ skipped |
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-e2e | ⏭️ skipped |
|
Actionable comments posted: 0 |
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-e2e | ⏭️ skipped |
|
Actionable comments posted: 0 |
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-e2e | ⏭️ skipped |
|
Actionable comments posted: 0 |
…omments Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-e2e | ⏭️ skipped |
|
Actionable comments posted: 0 |
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/reference/commands.mdx`:
- Around line 285-286: Rewrite the passive/incomplete sentence into active voice
and place each sentence on its own line: replace "NVIDIA GPU drivers installed
and working." with "Ensure NVIDIA GPU drivers are installed and working." and
move the remainder ("On generic NVIDIA hosts this means `nvidia-smi` must
succeed; on Jetson/Tegra hosts shipping without `nvidia-smi`, the devicetree
firmware fallback substitutes.") onto a separate line so each sentence occupies
its own source line.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f65c37f8-8b42-4c7e-8ad0-29f10779182f
📒 Files selected for processing (2)
docs/reference/commands.mdxsrc/lib/inference/nim.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- src/lib/inference/nim.test.ts
… bullet Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
|
Actionable comments posted: 0 |
… tests Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
|
Actionable comments posted: 0 |
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-e2e | ⏭️ skipped |
…path Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-e2e | ⏭️ skipped |
Selective E2E Results — ✅ All requested jobs passedRun: 26668800443
|
… proof state Two grouped GPU trust/proof/status fixes, rebased onto current main. NVIDIA#4565 — accept real Windows-ARM N1X (WSL2 + Docker Desktop) GPUs without reopening the Snapdragon false positive (NVIDIA#3988/NVIDIA#4424). detectGpu() still rejects a denylisted JMJWOA-Generic-* name by default; the only escape is the ARM64 WSL Docker Desktop prover, which runs one bounded Docker --gpus CUDA workload. The proof image is now the arch-correct cuda-sample:vectoradd-cuda12.5.0 (a genuine aarch64 binary running a real CUDA kernel) instead of cuda-sample:nbody, whose arm64 manifest entry actually ships an x86-64 ELF and therefore fails with "exec format error" on the very N1X target this feature accepts. An explicit exec-format-error diagnostic now distinguishes an image-architecture problem from a missing GPU. A real GPU passes; the Snapdragon nvidia-smi shim (no usable CUDA device) stays fail-closed. NVIDIA#4231 — nemoclaw status reflects CUDA proof, not just config. The direct sandbox GPU verifier returns a SandboxGpuProofResult (verified/unverified/failed) keyed on cuInit(0)=0, persisted to the registry and rendered by status as "(CUDA verified)" / "(CUDA unverified)" / "(last CUDA proof failed: …)". A zero exit that printed a non-zero cuInit code (swallowed exit) is treated as failed, not verified. The proof is captured by the verifyGpuSandboxAfterReady wrapper (net-zero onboard.ts) and cleared on snapshot clone so a restored sandbox cannot inherit another sandbox's "CUDA verified" state. CUDA failures print Jetson /dev/nvmap + video/render group remediation. Fail-closed CPU fallback with explicit --no-gpu guidance is preserved on every proof-failure path. Captured stderr in runCaptureEx so Docker/CUDA diagnostics are no longer dropped. The default ARM64 prover only swallows MODULE_NOT_FOUND and rethrows internal initialization errors. Fixes NVIDIA#4565 Fixes NVIDIA#4231 Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
… proof state Two grouped GPU trust/proof/status fixes, rebased onto current main. NVIDIA#4565 — accept real Windows-ARM N1X (WSL2 + Docker Desktop) GPUs without reopening the Snapdragon false positive (NVIDIA#3988/NVIDIA#4424). detectGpu() still rejects a denylisted JMJWOA-Generic-* name by default; the only escape is the ARM64 WSL Docker Desktop prover, which runs one bounded Docker --gpus CUDA workload. The proof image is now the arch-correct cuda-sample:vectoradd-cuda12.5.0 (a genuine aarch64 binary running a real CUDA kernel) instead of cuda-sample:nbody, whose arm64 manifest entry actually ships an x86-64 ELF and therefore fails with "exec format error" on the very N1X target this feature accepts. An explicit exec-format-error diagnostic now distinguishes an image-architecture problem from a missing GPU. A real GPU passes; the Snapdragon nvidia-smi shim (no usable CUDA device) stays fail-closed. NVIDIA#4231 — nemoclaw status reflects CUDA proof, not just config. The direct sandbox GPU verifier returns a SandboxGpuProofResult (verified/unverified/failed) keyed on cuInit(0)=0, persisted to the registry and rendered by status as "(CUDA verified)" / "(CUDA unverified)" / "(last CUDA proof failed: …)". A zero exit that printed a non-zero cuInit code (swallowed exit) is treated as failed, not verified. The proof is captured by the verifyGpuSandboxAfterReady wrapper (net-zero onboard.ts) and cleared on snapshot clone so a restored sandbox cannot inherit another sandbox's "CUDA verified" state. CUDA failures print Jetson /dev/nvmap + video/render group remediation. Fail-closed CPU fallback with explicit --no-gpu guidance is preserved on every proof-failure path. Captured stderr in runCaptureEx so Docker/CUDA diagnostics are no longer dropped. The default ARM64 prover only swallows MODULE_NOT_FOUND and rethrows internal initialization errors. Fixes NVIDIA#4565 Fixes NVIDIA#4231 Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
… proof state (#4599) ## Summary Two grouped GPU trust/proof/status fixes. `nemoclaw` now accepts real Windows-ARM N1X (WSL2 + Docker Desktop) GPUs when a bounded Docker `--gpus` CUDA proof succeeds (#4565), and `nemoclaw status` reports proven CUDA usability instead of treating any configured GPU as healthy (#4231). ## Related Issue Fixes #4565 Fixes #4231 ## Changes - **#4565 — accept N1X without reopening the Snapdragon false positive (#3988/#4424):** `detectGpu()` still rejects a denylisted `JMJWOA-Generic-*` name by default; the only escape is `createArm64WslDockerDesktopGpuProver`, which runs one bounded `docker run --gpus all …` CUDA workload on ARM64 Docker Desktop WSL hosts. **The proof image is `nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0`** (a genuine aarch64 binary running a real CUDA kernel — device alloc + add + result verification). The previous `cuda-sample:nbody` image was wrong for this ARM64-only path: its arm64 manifest entry actually ships an **x86-64 ELF**, so it fails with `exec format error` on the exact N1X hardware this feature targets (reported in-thread). Only a real GPU passes, so N1X is accepted while the Snapdragon nvidia-smi shim (no usable CUDA device) stays fail-closed. The proof timeout is bounded (default 180s, `NEMOCLAW_WSL_GPU_PROOF_TIMEOUT_MS` override) and failures keep the CPU fallback with `--no-gpu` guidance. An explicit `exec format error` diagnostic now distinguishes an image-architecture problem from a missing GPU. - **#4231 — status reflects CUDA proof, not just config:** the direct sandbox GPU verifier returns a `SandboxGpuProofResult` (`verified`/`unverified`/`failed`) keyed on the `cuInit(0)=0` usability proof instead of silently swallowing optional-proof failures. A zero exit that still printed a non-zero `cuInit(0)` code (a wrapper that swallowed the real exit) is treated as **failed**, not verified. The result is persisted to the sandbox registry and rendered by `nemoclaw status` as `(CUDA verified)` / `(CUDA unverified)` / `(last CUDA proof failed: …)`. CUDA failures print Jetson `/dev/nvmap` + `video`/`render` group remediation. The proof is captured by the existing `verifyGpuSandboxAfterReady` wrapper (so `src/lib/onboard.ts` is unchanged / net-zero), and **cleared on snapshot clone** so a restored sandbox cannot inherit another sandbox's `CUDA verified` state. - Fail-closed CPU fallback and explicit `--no-gpu` guidance preserved on every proof-failure path. - Captured stderr in `runCaptureEx` so Docker/CUDA failure diagnostics are no longer dropped. - The default ARM64 prover only swallows `MODULE_NOT_FOUND` and rethrows internal initialization errors (earlier CodeRabbit nit). ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## Verification - [x] Rebased onto current `upstream/main`; resolved conflicts in `status.ts`/`status-snapshot.ts`/`status.test.ts` (upstream extracted the snapshot/report code into `status-snapshot.ts`) and threaded the proof result through the `#4509` `verifyGpuSandboxAfterReady` wrapper. - [x] Targeted GPU/status/registry/snapshot suites green (`wsl-docker-desktop-gpu`, `nim`, `sandbox-gpu-preflight`, `docker-gpu-local-inference`, `status`, `registry`, `snapshot*`). - [x] `npm test` (cli project): only pre-existing, environment-only failures remain (`test/cli.test.ts`, `test/ssrf-parity.test.ts`, `config-sync`/`nemoclaw-start` root-ownership tests — file-mode/ownership/network checks unrelated to this change; none touch the modified files). - [x] `codex review --base upstream/main` clean after addressing two P2 findings (stale proof on snapshot clone; require `cuInit(0)=0` before verifying). - [x] Tests added or updated for new or changed behavior. - [x] No secrets, API keys, or credentials committed. - [x] `npx prek` pre-commit/pre-push hooks pass (format, lint, typecheck). ## Notes - The proof-image bug was diagnosed from the image manifest + `file` on the extracted binary (the `nbody` arm64 tag contains an x86-64 ELF; the `vectoradd-cuda12.5.0` arm64 tag contains a real aarch64 binary). No live Windows-ARM/WSL GPU hardware was available on the triage host, so the N1X run was not reproduced live — see the in-thread reply for the exact commands and evidence. - Both issues were reproduced hermetically (no GPU hardware): `detectGpu` proof gating via injected prover, and the verifier/status proof classification via fixtures, confirming the pre-fix reject (#4565) and misleading "enabled" (#4231) before fixing. --- Signed-off-by: Yimo Jiang <yimoj@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Persistent per-sandbox CUDA proof tracking and reporting (verified / unverified / failed) with human-readable status lines and platform-specific remediation guidance. * ARM64 WSL Docker Desktop GPU verification path with configurable timeout and clearer diagnostics. * **Bug Fixes** * Snapshot restore no longer inherits a source sandbox’s GPU proof status. * **Tests** * Updated unit and E2E GPU tests to validate CUDA usability states instead of a generic GPU-enabled marker. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
Summary
The observed Windows-on-ARM (WoA) WSL2
nvidia-smishim fakes thenameandmemory.totalfields of a real NVIDIA card, including emitting format-validuuid/compute_cap/vbios_versiontriples and a Windows-sideWin32_VideoController.AdapterCompatibility = "NVIDIA"that pass every userland check (QA-confirmed on the affected WoA host — see #3988 comment). The shim does, however, ship no NVIDIA kernel module, so the kernel-side/proc/driver/nvidia/interface that a real driver populates is absent. The observedJMJWOA-Generic-*shim profile is also WoA/ARM64-only — Microsoft's WoA platform is ARM-only by spec, so any non-ARM64 Linux host that exposesnvidia-smicannot be the observed shim. (Broader WSL2 GPU-PV / D3D12 plumbing ships on x86_64 too; the constraint applies specifically to this shim profile, not to all WSL2 GPU acceleration.) The detection gate now composes those signals as a trust-tier check on hosts whose firmware does not vouch for Spark/Station/Jetson, and the same gate also applies to the unified-memory fallback path so a shim cannot side-step the primary--query-gpu=memory.totalprobe.Related Issue
Fixes #3988.
Trust-tier gate
Off firmware vouch (i.e. when
detectNvidiaPlatform()does not return"spark"/"station"/"jetson"):\bJMJWOA-Generic-rejects the whole probe regardless of architecture or kernel-interface state. Catches the GPU and NPU placeholder variants QA observed plus any future suffix from this shim family without a code change./proc/driver/nvidia/exists — definite NVIDIA: a real kernel driver is bound, and the shim never creates this path. Trusted.process.arch !== "arm64"— trusted: the observedJMJWOA-Generic-*shim profile is WoA/ARM64-only. A Linux x86_64 host that exposesnvidia-smicannot be this shim./proc/driver/nvidia/+ denylist clean) — WoA shim profile, rejected.Firmware-vouched platforms (Spark, Station, Jetson) continue to bypass the gate entirely so real DGX Spark with the legitimate
JMJWOA-Generic-GPUplaceholder name keeps working (#3510).Changes
src/lib/inference/nim.ts:NVIDIA_GPU_NAME_DENYLIST_PATTERNwidens from the literal\bJMJWOA-Generic-GPU\bto the family prefix\bJMJWOA-Generic-.nvidiaHostLooksGenuine()helper applies the trust-tier check: returnstruewhen the platform is not Linux, or when the architecture is notarm64, or when/proc/driver/nvidia/exists. The remaining ARM64-Linux-without-kernel-interface case returnsfalseand is rejected by the caller.detectGpu()primary path: on non-firmware-vouched hosts, any GPU row matching the widened denylist rejects the whole probe (no partial slicing — a mixed-row spoof must not let one normal row through), and the host is additionally rejected whennvidiaHostLooksGenuine()returnsfalse.detectGpu()unified-memory fallback: same denylist + trust-tier gate on non-firmware-vouched hosts so the names-only fallback cannot be used to side-step the primary-path probe.docs/reference/commands.mdx: the GPU passthrough section now documents the trust-tier rule and theJMJWOA-Generic-*denylist for non-firmware-vouched hosts.src/lib/inference/nim.test.ts:withProcessArch(arch, fn)helper temporarily overridesprocess.archso tests that exercise the trust-tier gate can simulate an ARM64 host on x64 CI runners.it.eachover the denylisted name family on generic firmware now coversJMJWOA-Generic-GPU,JMJWOA-Generic-NPU,JMJWOA-Generic-Future, plus the vendor-prefixedNVIDIA JMJWOA-Generic-{GPU,NPU,Future}variants./proc/driver/nvidia/is absent./proc/driver/nvidia/is present./proc/driver/nvidia/is absent./proc/driver/nvidia/absent and aJMJWOA-Generic-GPUplaceholder name on ARM64 ([DGX Spark][Install] install-ollama pulls 35B model after preflight reports "no GPU detected" — no guard or model downgrade #3510 regression guard).NVIDIA Jetson AGX Orin) on ARM64 generic firmware when/proc/driver/nvidia/is absent.Why not WMI?
WMI /
Win32_VideoController.AdapterCompatibilityis not a usable discriminator here. The issue evidence shows the affected driver self-reports asNVIDIAat the Windows WMI layer (see the issue body), so a positiveAdapterCompatibility = "NVIDIA"does not prove a real NVIDIA device. Adding a WMI veto would only catch a hypothetical "lazy shim" that skips WMI spoofing — the actually observed shim would still slip past it — at the cost of a powershell.exe interop spawn (~200–500 ms) on every WSL2 GPU detection, plus a new interop /appendWindowsPathdependency. The trust-tier gate above covers the observed cases without that overhead.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesnpm run docsbuilds without warnings (doc changes only)Signed-off-by: Tinson Lai tinsonl@nvidia.com
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Documentation