fix(preflight): gate NVIDIA detection on JMJWOA denylist + ARM64 kernel-interface check by laitingsheng · Pull Request #4424 · NVIDIA/NemoClaw

laitingsheng · 2026-05-28T09:14:25Z

Summary

The observed Windows-on-ARM (WoA) WSL2 nvidia-smi shim fakes the name and memory.total fields of a real NVIDIA card, including emitting format-valid uuid/compute_cap/vbios_version triples and a Windows-side Win32_VideoController.AdapterCompatibility = "NVIDIA" that pass every userland check (QA-confirmed on the affected WoA host — see #3988 comment). The shim does, however, ship no NVIDIA kernel module, so the kernel-side /proc/driver/nvidia/ interface that a real driver populates is absent. The observed JMJWOA-Generic-* shim profile is also WoA/ARM64-only — Microsoft's WoA platform is ARM-only by spec, so any non-ARM64 Linux host that exposes nvidia-smi cannot be the observed shim. (Broader WSL2 GPU-PV / D3D12 plumbing ships on x86_64 too; the constraint applies specifically to this shim profile, not to all WSL2 GPU acceleration.) The detection gate now composes those signals as a trust-tier check on hosts whose firmware does not vouch for Spark/Station/Jetson, and the same gate also applies to the unified-memory fallback path so a shim cannot side-step the primary --query-gpu=memory.total probe.

Related Issue

Fixes #3988.

Trust-tier gate

Off firmware vouch (i.e. when detectNvidiaPlatform() does not return "spark"/"station"/"jetson"):

Denylist (universal reject) — any GPU name matching \bJMJWOA-Generic- rejects the whole probe regardless of architecture or kernel-interface state. Catches the GPU and NPU placeholder variants QA observed plus any future suffix from this shim family without a code change.
/proc/driver/nvidia/ exists — definite NVIDIA: a real kernel driver is bound, and the shim never creates this path. Trusted.
process.arch !== "arm64" — trusted: the observed JMJWOA-Generic-* shim profile is WoA/ARM64-only. A Linux x86_64 host that exposes nvidia-smi cannot be this shim.
Otherwise (ARM64 Linux + no /proc/driver/nvidia/ + denylist clean) — WoA shim profile, rejected.

Firmware-vouched platforms (Spark, Station, Jetson) continue to bypass the gate entirely so real DGX Spark with the legitimate JMJWOA-Generic-GPU placeholder name keeps working (#3510).

Changes

src/lib/inference/nim.ts:
- NVIDIA_GPU_NAME_DENYLIST_PATTERN widens from the literal \bJMJWOA-Generic-GPU\b to the family prefix \bJMJWOA-Generic-.
- New nvidiaHostLooksGenuine() helper applies the trust-tier check: returns true when the platform is not Linux, or when the architecture is not arm64, or when /proc/driver/nvidia/ exists. The remaining ARM64-Linux-without-kernel-interface case returns false and is rejected by the caller.
- detectGpu() primary path: on non-firmware-vouched hosts, any GPU row matching the widened denylist rejects the whole probe (no partial slicing — a mixed-row spoof must not let one normal row through), and the host is additionally rejected when nvidiaHostLooksGenuine() returns false.
- detectGpu() unified-memory fallback: same denylist + trust-tier gate on non-firmware-vouched hosts so the names-only fallback cannot be used to side-step the primary-path probe.
docs/reference/commands.mdx: the GPU passthrough section now documents the trust-tier rule and the JMJWOA-Generic-* denylist for non-firmware-vouched hosts.
src/lib/inference/nim.test.ts:
- New withProcessArch(arch, fn) helper temporarily overrides process.arch so tests that exercise the trust-tier gate can simulate an ARM64 host on x64 CI runners.
- it.each over the denylisted name family on generic firmware now covers JMJWOA-Generic-GPU, JMJWOA-Generic-NPU, JMJWOA-Generic-Future, plus the vendor-prefixed NVIDIA JMJWOA-Generic-{GPU,NPU,Future} variants.
- Mixed-row spoof on generic firmware (one denylisted row alongside a normal NVIDIA row) is rejected as a whole.
- Primary path on ARM64 generic firmware rejects a plausibly-named NVIDIA GPU when /proc/driver/nvidia/ is absent.
- Primary path on ARM64 generic firmware accepts a plausibly-named NVIDIA GPU when /proc/driver/nvidia/ is present.
- Primary path on x86_64 generic firmware trusts a plausibly-named NVIDIA GPU even when /proc/driver/nvidia/ is absent.
- Primary path on x86_64 generic firmware still rejects denylisted names.
- Spark firmware continues to vouch even with /proc/driver/nvidia/ absent and a JMJWOA-Generic-GPU placeholder name on ARM64 ([DGX Spark][Install] install-ollama pulls 35B model after preflight reports "no GPU detected" — no guard or model downgrade #3510 regression guard).
- Unified-memory fallback rejects a denylisted name on generic firmware.
- Unified-memory fallback rejects a tagged name (e.g. NVIDIA Jetson AGX Orin) on ARM64 generic firmware when /proc/driver/nvidia/ is absent.

Why not WMI?

WMI / Win32_VideoController.AdapterCompatibility is not a usable discriminator here. The issue evidence shows the affected driver self-reports as NVIDIA at the Windows WMI layer (see the issue body), so a positive AdapterCompatibility = "NVIDIA" does not prove a real NVIDIA device. Adding a WMI veto would only catch a hypothetical "lazy shim" that skips WMI spoofing — the actually observed shim would still slip past it — at the cost of a powershell.exe interop spawn (~200–500 ms) on every WSL2 GPU detection, plus a new interop / appendWindowsPath dependency. The trust-tier gate above covers the observed cases without that overhead.

Type of Change

Code change (feature, bug fix, or refactor)
Code change with doc updates
Doc only (prose changes, no code sample modifications)
Doc only (includes code sample changes)

Verification

npx prek run --all-files passes
npm test passes
Tests added or updated for new or changed behavior
No secrets, API keys, or credentials committed
Docs updated for user-facing behavior changes
npm run docs builds without warnings (doc changes only)
Doc pages follow the style guide (doc changes only)
New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tinson Lai tinsonl@nvidia.com

Summary by CodeRabbit

New Features
- Stricter NVIDIA host validation: require kernel-driver evidence on non-firmware-vouched hosts and tighten unified-memory fallback checks.
Bug Fixes
- Broadened placeholder denylist to wildcard JMJWOA-Generic-*, reject probes with mixed spoofed rows, and enforce ARM64/Linux-specific gating to avoid false positives.
Tests
- Expanded coverage for placeholder families, kernel-driver presence/absence, mixed-row spoofing, firmware gates, and unified-memory fallback.
Documentation
- Updated onboard passthrough docs to reflect the stricter detection rules.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

coderabbitai · 2026-05-28T09:14:38Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Broaden the WDDM/WSL2 placeholder denylist to JMJWOA-Generic-*, add a Linux /proc/driver/nvidia kernel-interface check, and make detectGpu() early-reject probes with denylisted names or missing kernel driver when firmware doesn't vouch; update tests and docs to cover these gates and unified-memory fallback paths.

Changes

NVIDIA placeholder hardening

Layer / File(s)	Summary
Denylist and kernel-interface helper `src/lib/inference/nim.ts`, `docs/reference/commands.mdx`	Replace the single `JMJWOA-Generic-GPU` string with a broader `JMJWOA-Generic-*` denylist and add a Linux `nvidiaHostLooksGenuine()` check for `/proc/driver/nvidia` (non-Linux/non-`arm64` returns `true`); update onboarding docs to describe the gate.
detectGpu firmware-unconfirmed gate changes `src/lib/inference/nim.ts`	When firmware does not confirm NVIDIA, `detectGpu()` now rejects the entire probe if any parsed GPU row matches the denylist or if the kernel-interface check fails; unified-memory fallback gains analogous denylist and kernel-interface gates for tagged names.
Test harness: kernel-interface & arch helpers `src/lib/inference/nim.test.ts`	Add helpers to mock `fs.existsSync('/proc/driver/nvidia')` and to temporarily override `process.platform`/`process.arch` so ARM64/x86 tests exercise the trust-tier gates deterministically; default shim makes kernel interface appear present unless overridden.
Parameterized denylist regression tests `src/lib/inference/nim.test.ts`	Replace single-placeholder regression test with `it.each` covering multiple `JMJWOA-Generic-*` variants (including `NVIDIA` -prefixed forms) and stub `nvidia-smi` outputs accordingly.
Mixed-row and kernel-interface gate tests `src/lib/inference/nim.test.ts`	Add a mixed-row spoof test asserting any denylisted row rejects the probe; add tests verifying rejection when kernel interface is absent and acceptance when present for known NVIDIA names; include a Spark firmware bypass test.
Unified-memory fallback tests `src/lib/inference/nim.test.ts`	Add fallback tests ensuring denylist and kernel-interface gates also apply to unified-memory fallback paths on generic firmware (reject placeholders and reject tagged names when kernel interface absent).

Sequence Diagram(s)

sequenceDiagram
  participant Client as detectGpu()
  participant SMI as nvidia-smi
  participant FW as firmware detection
  participant Kernel as nvidiaHostLooksGenuine()
  Client->>SMI: run nvidia-smi probe (parse rows)
  SMI-->>Client: CSV rows / names
  Client->>FW: is platform vouched? (spark/station/jetson)
  FW-->>Client: vouched | unvouched
  alt unvouched
    Client->>Kernel: check /proc/driver/nvidia on linux/arm64
    Kernel-->>Client: present | absent
    Client->>Client: if any row matches denylist -> return null
  else vouched
    Client->>Client: bypass denylist/kernel gate, apply plausibility filter
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

NVIDIA/NemoClaw#4062: Similar changes expanding JMJWOA-Generic-* denylist and adding kernel-interface early-reject logic; overlaps in detection logic and tests.

Suggested labels

Platform: Windows/WSL, v0.0.53

Suggested reviewers

ericksoa

Poem

🐰 I sniff the names beneath system logs so wide,
I hop through kernels to see if drivers hide.
Placeholders tremble when my regex is near,
Real cards step forward, the fakes disappear.
I nibble tests and docs until the logic’s clear.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the primary changes: hardening NVIDIA detection via a JMJWOA denylist and ARM64 kernel-interface check.
Linked Issues check	✅ Passed	Code changes directly address all linked issue `#3988` objectives: denylist rejects JMJWOA-Generic-* placeholders, kernel-interface gate distinguishes real NVIDIA hosts, and non-firmware-vouched hosts require both signals.
Out of Scope Changes check	✅ Passed	All changes scope to NVIDIA detection hardening: test harness for reproducible behavior, denylist logic and kernel-interface gate in detection, and documentation of the trust-tier rule.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/3988-strict-nvidia-identity-gate

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-28T09:16:27Z

E2E Advisor Recommendation

Required E2E: gpu-e2e, cloud-onboard-e2e
Optional E2E: gpu-double-onboard-e2e, issue-3600-gpu-proof-optional-e2e, wsl-repo-cloud-openclaw

Dispatch hint: gpu-e2e,cloud-onboard-e2e

Auto-dispatched E2E: gpu-e2e, cloud-onboard-e2e via nightly-e2e.yaml at 4218159b9ca885871c4cee618827060c47847603 — nightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

gpu-e2e (high): Validates the highest-risk runtime path changed by this PR: a real NVIDIA GPU host is still trusted, onboard enables GPU/local Ollama flow, sandbox creation succeeds, and inference works through the sandbox.
cloud-onboard-e2e (medium): Validates standard Ubuntu cloud onboarding on a non-GPU runner after the preflight/GPU-detection changes, including sandbox health and security checks with GPU passthrough left disabled unless a trusted GPU is found.

Optional E2E

gpu-double-onboard-e2e (high): Additional confidence that repeated GPU-provider onboarding still handles GPU detection, gateway/sandbox reuse, and local Ollama proxy state consistently after the trust-gate refactor.
issue-3600-gpu-proof-optional-e2e (low): Adjacent GPU preflight guard that checks optional direct sandbox GPU proof handling; useful because this PR changes sandbox GPU preflight behavior, but it does not exercise the new nvidia-smi trust decision end-to-end.
wsl-repo-cloud-openclaw (high): Adjacent WSL onboarding scenario because the motivating spoof source is WSL-related. Current runner is Windows/x64 and is unlikely to reproduce Windows-on-ARM nvidia-smi spoofing, so this is confidence-only rather than merge-blocking.

New E2E recommendations

gpu-detection-trust-security-boundary (high): No existing E2E appears to exercise an ARM64 Linux or Windows-on-ARM WSL environment where nvidia-smi emits JMJWOA-Generic-* while /proc/driver/nvidia is absent. Unit tests cover this, but the user-visible safety boundary is onboarding preflight and sandbox GPU passthrough suppression.
- Suggested test: Add a hermetic E2E negative preflight spoof test that injects a fake nvidia-smi returning JMJWOA-Generic-* on simulated ARM64 Linux with generic firmware and no /proc/driver/nvidia, then asserts nemoclaw onboard reports no trusted NVIDIA GPU and does not pass gateway/sandbox GPU flags.
jetson-tegra-gpu-fallback (medium): The trust gate intentionally bypasses Jetson/Tegra firmware and device-node fallback paths, but there is no clear E2E coverage proving hosts without nvidia-smi still onboard with NVIDIA unified-memory GPU detection.
- Suggested test: Add an E2E or scenario fixture for Jetson/Tegra-style preflight that stubs devicetree/device-node detection without nvidia-smi and verifies sandbox GPU config remains in the intended auto/Jetson mode.

Dispatch hint

Workflow: .github/workflows/nightly-e2e.yaml
jobs input: gpu-e2e,cloud-onboard-e2e

github-actions · 2026-05-28T09:16:29Z

E2E Scenario Advisor Recommendation

Required scenario E2E: gpu-repo-local-ollama-openclaw, ubuntu-repo-cloud-openclaw
Optional scenario E2E: wsl-repo-cloud-openclaw

Dispatch required scenario E2E:

gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=gpu-repo-local-ollama-openclaw
gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

gpu-repo-local-ollama-openclaw: Changes affect NVIDIA GPU trust/detection and NIM GPU handling; this is the only routed scenario with a real NVIDIA GPU/CDI runner and local Ollama GPU inference coverage.
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=gpu-repo-local-ollama-openclaw
ubuntu-repo-cloud-openclaw: Exercises the standard repo onboarding/preflight path on Ubuntu and helps catch regressions in default cloud onboarding after the GPU detection/preflight changes.
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Optional scenario E2E

wsl-repo-cloud-openclaw: Adjacent coverage for WSL onboarding, relevant because the change targets Windows-on-ARM/WSL-style nvidia-smi shim false positives, though the routed WSL runner is a special platform scenario and not the primary GPU path.
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=wsl-repo-cloud-openclaw

Relevant changed files

src/lib/inference/gpu-trust.ts
src/lib/inference/nim.ts

github-actions · 2026-05-28T09:16:31Z

Selective E2E Results — ⚠️ No requested jobs ran

Run: 26565918524
Target ref: 90b7f3ef08d7d2cc298490809ae20fde8f5c54a3
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job	Result
gpu-e2e	⏭️ skipped

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

github-actions · 2026-05-28T09:18:09Z

PR Review Advisor

Findings: 1 needs attention, 2 worth checking, 0 nice ideas
Since last review: 4 prior items resolved, 3 still apply, 0 new items found

Review findings

🛠️ Needs attention

N1X acceptance is still not proven at the user-facing preflight boundary (src/lib/onboard/machine/handlers/preflight.test.ts:82): The linked issue is about `nemoclaw onboard` preflight printing `✓ NVIDIA GPU detected` for Snapdragon/N1X hardware and then proceeding down GPU/CDI paths as though NVIDIA hardware exists. This PR now has strong `detectGpu()` unit coverage and a handler-level test showing that a mocked null GPU disables sandbox GPU, but it still does not assert the real `onboard.ts` preflight output and sandbox GPU decision for the N1X spoof fixture.
- Recommendation: Add a targeted preflight-level regression that feeds the N1X `nvidia-smi`/generic-firmware/no-`/proc/driver/nvidia` fixture through the actual preflight output path and asserts that `✓ NVIDIA GPU detected` is absent, the no-GPU or GPU-disabled line is present, CDI/GPU passthrough is opted out, and sandbox GPU remains disabled unless explicitly requested.
- Evidence: Issue [WSL2][Onboard] preflight false-positive: Snapdragon iGPU reported as "NVIDIA GPU detected" on Windows ARM #3988 says reporting `✓ NVIDIA GPU detected` for `JMJWOA-Generic-GPU` is wrong and misleading. The diff adds `nimModule.detectGpu()` rejection tests and a `handlePreflightState` test with `runPreflight: vi.fn(async () => null)`, while `src/lib/onboard.ts` still owns `formatNvidiaGpuPreflightLines()`, `Local NIM unavailable — no GPU detected`, and `Sandbox GPU` output.

🔎 Worth checking

Generic x86_64 Linux still trusts plausible nvidia-smi names without a driver signal (src/lib/inference/gpu-trust.ts:45): The new trust gate intentionally keeps historical behavior for non-ARM64 Linux by returning true before checking `/proc/driver/nvidia`. That addresses the observed WoA/ARM64 shim, but on generic x86_64 Linux or WSL a spoofed or shimmed `nvidia-smi` that returns a plausible NVIDIA product name can still enable gateway/sandbox GPU passthrough without an independent kernel-driver signal.
- Recommendation: Either require an additional real-driver signal for generic Linux/WSL before enabling GPU passthrough, or explicitly accept this as a security tradeoff and add a negative regression for any x86_64 spoof shape that should be rejected. Keep the current denylist as a universal reject.
- Evidence: `nvidiaHostLooksGenuine()` returns true when `process.arch !== "arm64"`, and the test `trusts x86_64 generic firmware even when /proc/driver/nvidia/ is absent` locks in the permissive behavior.
Inference test monolith grows substantially (src/lib/inference/nim.test.ts:1): The already-large NIM inference test file gains another broad GPU spoofing and compatibility matrix, including process platform/architecture monkeypatching and fs monkeypatching. Keeping primary trust checks, unified-memory fallback checks, Spark regression guards, and unrelated NIM tests together makes future security-sensitive regressions harder to review.
- Recommendation: Move the GPU spoofing, generic-firmware, ARM64 kernel-interface, and unified-memory fallback cases into a focused GPU detection test file, or extract shared fixtures/helpers and offset the growth so `nim.test.ts` does not keep expanding.
- Evidence: Deterministic monolith analysis reports `src/lib/inference/nim.test.ts` baseLines 1361, headLines 1698, delta +337. Although `gpu-trust.test.ts` was added, most new detection matrix coverage remains in `nim.test.ts`.

🌱 Nice ideas

None.

Since last review details

Current findings:

N1X acceptance is still not proven at the user-facing preflight boundary (src/lib/onboard/machine/handlers/preflight.test.ts:82): The linked issue is about `nemoclaw onboard` preflight printing `✓ NVIDIA GPU detected` for Snapdragon/N1X hardware and then proceeding down GPU/CDI paths as though NVIDIA hardware exists. This PR now has strong `detectGpu()` unit coverage and a handler-level test showing that a mocked null GPU disables sandbox GPU, but it still does not assert the real `onboard.ts` preflight output and sandbox GPU decision for the N1X spoof fixture.
- Recommendation: Add a targeted preflight-level regression that feeds the N1X `nvidia-smi`/generic-firmware/no-`/proc/driver/nvidia` fixture through the actual preflight output path and asserts that `✓ NVIDIA GPU detected` is absent, the no-GPU or GPU-disabled line is present, CDI/GPU passthrough is opted out, and sandbox GPU remains disabled unless explicitly requested.
- Evidence: Issue [WSL2][Onboard] preflight false-positive: Snapdragon iGPU reported as "NVIDIA GPU detected" on Windows ARM #3988 says reporting `✓ NVIDIA GPU detected` for `JMJWOA-Generic-GPU` is wrong and misleading. The diff adds `nimModule.detectGpu()` rejection tests and a `handlePreflightState` test with `runPreflight: vi.fn(async () => null)`, while `src/lib/onboard.ts` still owns `formatNvidiaGpuPreflightLines()`, `Local NIM unavailable — no GPU detected`, and `Sandbox GPU` output.
Generic x86_64 Linux still trusts plausible nvidia-smi names without a driver signal (src/lib/inference/gpu-trust.ts:45): The new trust gate intentionally keeps historical behavior for non-ARM64 Linux by returning true before checking `/proc/driver/nvidia`. That addresses the observed WoA/ARM64 shim, but on generic x86_64 Linux or WSL a spoofed or shimmed `nvidia-smi` that returns a plausible NVIDIA product name can still enable gateway/sandbox GPU passthrough without an independent kernel-driver signal.
- Recommendation: Either require an additional real-driver signal for generic Linux/WSL before enabling GPU passthrough, or explicitly accept this as a security tradeoff and add a negative regression for any x86_64 spoof shape that should be rejected. Keep the current denylist as a universal reject.
- Evidence: `nvidiaHostLooksGenuine()` returns true when `process.arch !== "arm64"`, and the test `trusts x86_64 generic firmware even when /proc/driver/nvidia/ is absent` locks in the permissive behavior.
Inference test monolith grows substantially (src/lib/inference/nim.test.ts:1): The already-large NIM inference test file gains another broad GPU spoofing and compatibility matrix, including process platform/architecture monkeypatching and fs monkeypatching. Keeping primary trust checks, unified-memory fallback checks, Spark regression guards, and unrelated NIM tests together makes future security-sensitive regressions harder to review.
- Recommendation: Move the GPU spoofing, generic-firmware, ARM64 kernel-interface, and unified-memory fallback cases into a focused GPU detection test file, or extract shared fixtures/helpers and offset the growth so `nim.test.ts` does not keep expanding.
- Evidence: Deterministic monolith analysis reports `src/lib/inference/nim.test.ts` baseLines 1361, headLines 1698, delta +337. Although `gpu-trust.test.ts` was added, most new detection matrix coverage remains in `nim.test.ts`.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

coderabbitai

🧹 Nitpick comments (1)

src/lib/inference/nim.test.ts (1)

431-461: ⚡ Quick win

Add vendor-prefixed denylist cases to the forged-strict-path safety-net test.

This table only asserts bare JMJWOA-Generic-* names. Add prefixed variants (for example, NVIDIA JMJWOA-Generic-GPU) so the denylist guard is also covered when strict identity is forged-valid.

♻️ Proposed test expansion

     it.each([
       "JMJWOA-Generic-GPU",
       "JMJWOA-Generic-NPU",
       "JMJWOA-Generic-Future",
+      "NVIDIA JMJWOA-Generic-GPU",
+      "NVIDIA JMJWOA-Generic-NPU",
+      "NVIDIA JMJWOA-Generic-Future",
     ])(

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/inference/nim.test.ts` around lines 431 - 461, Update the table in
the test "denylist rejects %s even when strict probe somehow validates (`#3988`)"
to include vendor-prefixed variants of the denylisted names (e.g., "NVIDIA
JMJWOA-Generic-GPU", "NVIDIA JMJWOA-Generic-NPU", "NVIDIA
JMJWOA-Generic-Future") alongside the existing bare names so the denylist guard
is exercised even when isStrictNvidiaIdentityProbe and strictNvidiaIdentitiesCsv
return a forged-valid identity; modify the array passed to it.each (the list
consumed by the test using runCapture, loadNimWithMockedRunner, and
nimModule.detectGpu()) to add these prefixed strings.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/lib/inference/nim.test.ts`:
- Around line 431-461: Update the table in the test "denylist rejects %s even
when strict probe somehow validates (`#3988`)" to include vendor-prefixed variants
of the denylisted names (e.g., "NVIDIA JMJWOA-Generic-GPU", "NVIDIA
JMJWOA-Generic-NPU", "NVIDIA JMJWOA-Generic-Future") alongside the existing bare
names so the denylist guard is exercised even when isStrictNvidiaIdentityProbe
and strictNvidiaIdentitiesCsv return a forged-valid identity; modify the array
passed to it.each (the list consumed by the test using runCapture,
loadNimWithMockedRunner, and nimModule.detectGpu()) to add these prefixed
strings.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5ca5e453-ef2b-451d-bc73-36026f077f40

📥 Commits

Reviewing files that changed from the base of the PR and between 0c108ae and 90b7f3e.

📒 Files selected for processing (2)

src/lib/inference/nim.test.ts
src/lib/inference/nim.ts

github-actions · 2026-05-28T09:20:25Z

Selective E2E Results — ⚠️ No requested jobs ran

Run: 26566101305
Target ref: 03a59efa16a6c624a83ed032ee834eeac8cc15de
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job	Result
gpu-e2e	⏭️ skipped

…prerequisite Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

github-actions · 2026-05-28T09:35:31Z

🌿 Preview your docs: https://nvidia-preview-pr-4424.docs.buildwithfern.com/nemoclaw

github-actions · 2026-05-28T09:37:55Z

Selective E2E Results — ⚠️ No requested jobs ran

Run: 26566937322
Target ref: f34aad30881999804a5fcd2bc02f38d2db56badd
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job	Result
gpu-e2e	⏭️ skipped

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

github-actions · 2026-05-28T13:28:37Z

Selective E2E Results — ⚠️ No requested jobs ran

Run: 26577649867
Target ref: 0d95658a20bd5b92fa8e51d6af60d1519fbe6aa7
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job	Result
gpu-e2e	⏭️ skipped

coderabbitai · 2026-05-28T13:32:56Z

Actionable comments posted: 0

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

github-actions · 2026-05-28T13:47:49Z

Selective E2E Results — ⚠️ No requested jobs ran

Run: 26578700615
Target ref: 8edf5d9e5aed4bd2d911b8671b95a323f6feb392
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job	Result
gpu-e2e	⏭️ skipped

coderabbitai · 2026-05-28T13:50:53Z

Actionable comments posted: 0

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

github-actions · 2026-05-28T14:31:21Z

Selective E2E Results — ⚠️ No requested jobs ran

Run: 26581181276
Target ref: b3c4c5b897db146cc197009631c9abc5d408da18
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job	Result
gpu-e2e	⏭️ skipped

coderabbitai · 2026-05-28T14:36:10Z

Actionable comments posted: 0

…omments Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

github-actions · 2026-05-28T14:45:18Z

Selective E2E Results — ⚠️ No requested jobs ran

Run: 26581982181
Target ref: 3db4ebd2426baa44c543962fe0c538e6f61f3cd8
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job	Result
gpu-e2e	⏭️ skipped

coderabbitai · 2026-05-28T14:46:30Z

Actionable comments posted: 0

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/reference/commands.mdx`:
- Around line 285-286: Rewrite the passive/incomplete sentence into active voice
and place each sentence on its own line: replace "NVIDIA GPU drivers installed
and working." with "Ensure NVIDIA GPU drivers are installed and working." and
move the remainder ("On generic NVIDIA hosts this means `nvidia-smi` must
succeed; on Jetson/Tegra hosts shipping without `nvidia-smi`, the devicetree
firmware fallback substitutes.") onto a separate line so each sentence occupies
its own source line.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f65c37f8-8b42-4c7e-8ad0-29f10779182f

📥 Commits

Reviewing files that changed from the base of the PR and between 3db4ebd and b7c035d.

📒 Files selected for processing (2)

docs/reference/commands.mdx
src/lib/inference/nim.test.ts

🚧 Files skipped from review as they are similar to previous changes (1)

src/lib/inference/nim.test.ts

… bullet Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

coderabbitai · 2026-05-28T15:18:35Z

Actionable comments posted: 0

… tests Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

coderabbitai · 2026-05-28T15:32:25Z

Actionable comments posted: 0

github-actions · 2026-05-28T15:32:40Z

Selective E2E Results — ⚠️ No requested jobs ran

Run: 26584692989
Target ref: cc30e1c0e2ccda2a3b23d4cdb64e8a6707579967
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job	Result
gpu-e2e	⏭️ skipped

…path Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

github-actions · 2026-05-29T16:00:35Z

Selective E2E Results — ⚠️ No requested jobs ran

Run: 26647769480
Target ref: 3ed93b3be22699adf87e148eb657e8b77fd85bb0
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job	Result
gpu-e2e	⏭️ skipped

github-actions · 2026-05-30T00:21:56Z

Selective E2E Results — ✅ All requested jobs passed

Run: 26668800443
Target ref: 4218159b9ca885871c4cee618827060c47847603
Workflow ref: main
Requested jobs: gpu-e2e,cloud-onboard-e2e
Summary: 1 passed, 0 failed, 1 skipped

Job	Result
cloud-onboard-e2e	✅ success
gpu-e2e	⏭️ skipped

… proof state Two grouped GPU trust/proof/status fixes, rebased onto current main. NVIDIA#4565 — accept real Windows-ARM N1X (WSL2 + Docker Desktop) GPUs without reopening the Snapdragon false positive (NVIDIA#3988/NVIDIA#4424). detectGpu() still rejects a denylisted JMJWOA-Generic-* name by default; the only escape is the ARM64 WSL Docker Desktop prover, which runs one bounded Docker --gpus CUDA workload. The proof image is now the arch-correct cuda-sample:vectoradd-cuda12.5.0 (a genuine aarch64 binary running a real CUDA kernel) instead of cuda-sample:nbody, whose arm64 manifest entry actually ships an x86-64 ELF and therefore fails with "exec format error" on the very N1X target this feature accepts. An explicit exec-format-error diagnostic now distinguishes an image-architecture problem from a missing GPU. A real GPU passes; the Snapdragon nvidia-smi shim (no usable CUDA device) stays fail-closed. NVIDIA#4231 — nemoclaw status reflects CUDA proof, not just config. The direct sandbox GPU verifier returns a SandboxGpuProofResult (verified/unverified/failed) keyed on cuInit(0)=0, persisted to the registry and rendered by status as "(CUDA verified)" / "(CUDA unverified)" / "(last CUDA proof failed: …)". A zero exit that printed a non-zero cuInit code (swallowed exit) is treated as failed, not verified. The proof is captured by the verifyGpuSandboxAfterReady wrapper (net-zero onboard.ts) and cleared on snapshot clone so a restored sandbox cannot inherit another sandbox's "CUDA verified" state. CUDA failures print Jetson /dev/nvmap + video/render group remediation. Fail-closed CPU fallback with explicit --no-gpu guidance is preserved on every proof-failure path. Captured stderr in runCaptureEx so Docker/CUDA diagnostics are no longer dropped. The default ARM64 prover only swallows MODULE_NOT_FOUND and rethrows internal initialization errors. Fixes NVIDIA#4565 Fixes NVIDIA#4231 Signed-off-by: Yimo Jiang <yimoj@nvidia.com>

… proof state (#4599) ## Summary Two grouped GPU trust/proof/status fixes. `nemoclaw` now accepts real Windows-ARM N1X (WSL2 + Docker Desktop) GPUs when a bounded Docker `--gpus` CUDA proof succeeds (#4565), and `nemoclaw status` reports proven CUDA usability instead of treating any configured GPU as healthy (#4231). ## Related Issue Fixes #4565 Fixes #4231 ## Changes - **#4565 — accept N1X without reopening the Snapdragon false positive (#3988/#4424):** `detectGpu()` still rejects a denylisted `JMJWOA-Generic-*` name by default; the only escape is `createArm64WslDockerDesktopGpuProver`, which runs one bounded `docker run --gpus all …` CUDA workload on ARM64 Docker Desktop WSL hosts. **The proof image is `nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0`** (a genuine aarch64 binary running a real CUDA kernel — device alloc + add + result verification). The previous `cuda-sample:nbody` image was wrong for this ARM64-only path: its arm64 manifest entry actually ships an **x86-64 ELF**, so it fails with `exec format error` on the exact N1X hardware this feature targets (reported in-thread). Only a real GPU passes, so N1X is accepted while the Snapdragon nvidia-smi shim (no usable CUDA device) stays fail-closed. The proof timeout is bounded (default 180s, `NEMOCLAW_WSL_GPU_PROOF_TIMEOUT_MS` override) and failures keep the CPU fallback with `--no-gpu` guidance. An explicit `exec format error` diagnostic now distinguishes an image-architecture problem from a missing GPU. - **#4231 — status reflects CUDA proof, not just config:** the direct sandbox GPU verifier returns a `SandboxGpuProofResult` (`verified`/`unverified`/`failed`) keyed on the `cuInit(0)=0` usability proof instead of silently swallowing optional-proof failures. A zero exit that still printed a non-zero `cuInit(0)` code (a wrapper that swallowed the real exit) is treated as **failed**, not verified. The result is persisted to the sandbox registry and rendered by `nemoclaw status` as `(CUDA verified)` / `(CUDA unverified)` / `(last CUDA proof failed: …)`. CUDA failures print Jetson `/dev/nvmap` + `video`/`render` group remediation. The proof is captured by the existing `verifyGpuSandboxAfterReady` wrapper (so `src/lib/onboard.ts` is unchanged / net-zero), and **cleared on snapshot clone** so a restored sandbox cannot inherit another sandbox's `CUDA verified` state. - Fail-closed CPU fallback and explicit `--no-gpu` guidance preserved on every proof-failure path. - Captured stderr in `runCaptureEx` so Docker/CUDA failure diagnostics are no longer dropped. - The default ARM64 prover only swallows `MODULE_NOT_FOUND` and rethrows internal initialization errors (earlier CodeRabbit nit). ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## Verification - [x] Rebased onto current `upstream/main`; resolved conflicts in `status.ts`/`status-snapshot.ts`/`status.test.ts` (upstream extracted the snapshot/report code into `status-snapshot.ts`) and threaded the proof result through the `#4509` `verifyGpuSandboxAfterReady` wrapper. - [x] Targeted GPU/status/registry/snapshot suites green (`wsl-docker-desktop-gpu`, `nim`, `sandbox-gpu-preflight`, `docker-gpu-local-inference`, `status`, `registry`, `snapshot*`). - [x] `npm test` (cli project): only pre-existing, environment-only failures remain (`test/cli.test.ts`, `test/ssrf-parity.test.ts`, `config-sync`/`nemoclaw-start` root-ownership tests — file-mode/ownership/network checks unrelated to this change; none touch the modified files). - [x] `codex review --base upstream/main` clean after addressing two P2 findings (stale proof on snapshot clone; require `cuInit(0)=0` before verifying). - [x] Tests added or updated for new or changed behavior. - [x] No secrets, API keys, or credentials committed. - [x] `npx prek` pre-commit/pre-push hooks pass (format, lint, typecheck). ## Notes - The proof-image bug was diagnosed from the image manifest + `file` on the extracted binary (the `nbody` arm64 tag contains an x86-64 ELF; the `vectoradd-cuda12.5.0` arm64 tag contains a real aarch64 binary). No live Windows-ARM/WSL GPU hardware was available on the triage host, so the N1X run was not reproduced live — see the in-thread reply for the exact commands and evidence. - Both issues were reproduced hermetically (no GPU hardware): `detectGpu` proof gating via injected prover, and the verifier/status proof classification via fixtures, confirming the pre-fix reject (#4565) and misleading "enabled" (#4231) before fixing. --- Signed-off-by: Yimo Jiang <yimoj@nvidia.com>  ## Summary by CodeRabbit * **New Features** * Persistent per-sandbox CUDA proof tracking and reporting (verified / unverified / failed) with human-readable status lines and platform-specific remediation guidance. * ARM64 WSL Docker Desktop GPU verification path with configurable timeout and clearer diagnostics. * **Bug Fixes** * Snapshot restore no longer inherits a source sandbox’s GPU proof status. * **Tests** * Updated unit and E2E GPU tests to validate CUDA usability states instead of a generic GPU-enabled marker.  Signed-off-by: Yimo Jiang <yimoj@nvidia.com>

fix(preflight): cross-check NVIDIA GPU via strict-identity probe

90b7f3e

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

test(nim): obfuscate sample uuid/vbios in strict-identity gate tests

03a59ef

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

laitingsheng added the fix label May 28, 2026

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

fix(preflight): reject mixed-row denylist spoof + clarify nvidia-smi …

f34aad3

…prerequisite Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

fix(preflight): gate NVIDIA detection on /proc/driver/nvidia presence

0d95658

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

fix(preflight): extend kernel-interface gate to unified-memory fallback

8edf5d9

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

fix(preflight): scope kernel-interface check to ARM64 Linux only

b3c4c5b

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

fix(preflight): pin process.platform in gate tests, refine docs and c…

3db4ebd

…omments Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

fix(preflight): add Linux-x64 gate helper, refine docs and test wording

b7c035d

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

Comment thread docs/reference/commands.mdx Outdated

laitingsheng changed the title ~~fix(preflight): cross-check NVIDIA GPU via strict-identity probe~~ fix(preflight): gate NVIDIA detection on JMJWOA denylist + ARM64 kernel-interface check May 28, 2026

fix(preflight): cover fail-closed kernel-interface probe, refine docs…

7c2542b

… bullet Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

test(preflight): pin default /proc/driver/nvidia present for non-gate…

cc30e1c

… tests Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

laitingsheng added the v0.0.55 label May 28, 2026

docs(commands): note Tegra device-node fallback alongside devicetree …

3ed93b3

…path Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

jyaunches added R2 v0.0.56 Release target and removed v0.0.55 labels May 29, 2026

cv approved these changes May 29, 2026

View reviewed changes

fix(preflight): isolate NVIDIA GPU trust gate

4218159

cv merged commit 0d7c48b into main May 30, 2026
29 checks passed

cv deleted the fix/3988-strict-nvidia-identity-gate branch May 30, 2026 01:04

laitingsheng mentioned this pull request May 30, 2026

[Windows ARM][Onboard] NemoClaw preflight reports 'no NVIDIA GPU detected' despite nvidia-smi and Docker --gpus confirming 65GB GPU #4565

Closed

yimoj mentioned this pull request Jun 1, 2026

fix(inference): prove WSL Docker Desktop GPUs and report sandbox CUDA proof state #4599

Merged

8 tasks

wscurran added the bug-fix PR fixes a bug or regression label Jun 3, 2026

wscurran removed the fix label Jun 3, 2026

Conversation

laitingsheng commented May 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Trust-tier gate

Changes

Why not WMI?

Type of Change

Verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Advisor Recommendation

E2E Recommendation Advisor

Required E2E

Optional E2E

New E2E recommendations

Dispatch hint

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Scenario Advisor Recommendation

E2E Scenario Advisor

Required scenario E2E

Optional scenario E2E

Relevant changed files

Uh oh!

github-actions Bot commented May 28, 2026

Selective E2E Results — ⚠️ No requested jobs ran

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review Advisor

🛠️ Needs attention

🔎 Worth checking

🌱 Nice ideas

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 28, 2026

Selective E2E Results — ⚠️ No requested jobs ran

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026

Selective E2E Results — ⚠️ No requested jobs ran

Uh oh!

github-actions Bot commented May 28, 2026

Selective E2E Results — ⚠️ No requested jobs ran

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026

Selective E2E Results — ⚠️ No requested jobs ran

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026

Selective E2E Results — ⚠️ No requested jobs ran

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026

Selective E2E Results — ⚠️ No requested jobs ran

laitingsheng commented May 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 28, 2026 •

edited

Loading

github-actions Bot commented May 28, 2026 •

edited

Loading

github-actions Bot commented May 28, 2026 •

edited

Loading

github-actions Bot commented May 28, 2026 •

edited

Loading