Skip to content

fix(preflight): gate NVIDIA detection on JMJWOA denylist + ARM64 kernel-interface check#4424

Merged
cv merged 12 commits into
mainfrom
fix/3988-strict-nvidia-identity-gate
May 30, 2026
Merged

fix(preflight): gate NVIDIA detection on JMJWOA denylist + ARM64 kernel-interface check#4424
cv merged 12 commits into
mainfrom
fix/3988-strict-nvidia-identity-gate

Conversation

@laitingsheng

@laitingsheng laitingsheng commented May 28, 2026

Copy link
Copy Markdown
Contributor

Summary

The observed Windows-on-ARM (WoA) WSL2 nvidia-smi shim fakes the name and memory.total fields of a real NVIDIA card, including emitting format-valid uuid/compute_cap/vbios_version triples and a Windows-side Win32_VideoController.AdapterCompatibility = "NVIDIA" that pass every userland check (QA-confirmed on the affected WoA host — see #3988 comment). The shim does, however, ship no NVIDIA kernel module, so the kernel-side /proc/driver/nvidia/ interface that a real driver populates is absent. The observed JMJWOA-Generic-* shim profile is also WoA/ARM64-only — Microsoft's WoA platform is ARM-only by spec, so any non-ARM64 Linux host that exposes nvidia-smi cannot be the observed shim. (Broader WSL2 GPU-PV / D3D12 plumbing ships on x86_64 too; the constraint applies specifically to this shim profile, not to all WSL2 GPU acceleration.) The detection gate now composes those signals as a trust-tier check on hosts whose firmware does not vouch for Spark/Station/Jetson, and the same gate also applies to the unified-memory fallback path so a shim cannot side-step the primary --query-gpu=memory.total probe.

Related Issue

Fixes #3988.

Trust-tier gate

Off firmware vouch (i.e. when detectNvidiaPlatform() does not return "spark"/"station"/"jetson"):

  1. Denylist (universal reject) — any GPU name matching \bJMJWOA-Generic- rejects the whole probe regardless of architecture or kernel-interface state. Catches the GPU and NPU placeholder variants QA observed plus any future suffix from this shim family without a code change.
  2. /proc/driver/nvidia/ exists — definite NVIDIA: a real kernel driver is bound, and the shim never creates this path. Trusted.
  3. process.arch !== "arm64" — trusted: the observed JMJWOA-Generic-* shim profile is WoA/ARM64-only. A Linux x86_64 host that exposes nvidia-smi cannot be this shim.
  4. Otherwise (ARM64 Linux + no /proc/driver/nvidia/ + denylist clean) — WoA shim profile, rejected.

Firmware-vouched platforms (Spark, Station, Jetson) continue to bypass the gate entirely so real DGX Spark with the legitimate JMJWOA-Generic-GPU placeholder name keeps working (#3510).

Changes

  • src/lib/inference/nim.ts:
    • NVIDIA_GPU_NAME_DENYLIST_PATTERN widens from the literal \bJMJWOA-Generic-GPU\b to the family prefix \bJMJWOA-Generic-.
    • New nvidiaHostLooksGenuine() helper applies the trust-tier check: returns true when the platform is not Linux, or when the architecture is not arm64, or when /proc/driver/nvidia/ exists. The remaining ARM64-Linux-without-kernel-interface case returns false and is rejected by the caller.
    • detectGpu() primary path: on non-firmware-vouched hosts, any GPU row matching the widened denylist rejects the whole probe (no partial slicing — a mixed-row spoof must not let one normal row through), and the host is additionally rejected when nvidiaHostLooksGenuine() returns false.
    • detectGpu() unified-memory fallback: same denylist + trust-tier gate on non-firmware-vouched hosts so the names-only fallback cannot be used to side-step the primary-path probe.
  • docs/reference/commands.mdx: the GPU passthrough section now documents the trust-tier rule and the JMJWOA-Generic-* denylist for non-firmware-vouched hosts.
  • src/lib/inference/nim.test.ts:
    • New withProcessArch(arch, fn) helper temporarily overrides process.arch so tests that exercise the trust-tier gate can simulate an ARM64 host on x64 CI runners.
    • it.each over the denylisted name family on generic firmware now covers JMJWOA-Generic-GPU, JMJWOA-Generic-NPU, JMJWOA-Generic-Future, plus the vendor-prefixed NVIDIA JMJWOA-Generic-{GPU,NPU,Future} variants.
    • Mixed-row spoof on generic firmware (one denylisted row alongside a normal NVIDIA row) is rejected as a whole.
    • Primary path on ARM64 generic firmware rejects a plausibly-named NVIDIA GPU when /proc/driver/nvidia/ is absent.
    • Primary path on ARM64 generic firmware accepts a plausibly-named NVIDIA GPU when /proc/driver/nvidia/ is present.
    • Primary path on x86_64 generic firmware trusts a plausibly-named NVIDIA GPU even when /proc/driver/nvidia/ is absent.
    • Primary path on x86_64 generic firmware still rejects denylisted names.
    • Spark firmware continues to vouch even with /proc/driver/nvidia/ absent and a JMJWOA-Generic-GPU placeholder name on ARM64 ([DGX Spark][Install] install-ollama pulls 35B model after preflight reports "no GPU detected" — no guard or model downgrade #3510 regression guard).
    • Unified-memory fallback rejects a denylisted name on generic firmware.
    • Unified-memory fallback rejects a tagged name (e.g. NVIDIA Jetson AGX Orin) on ARM64 generic firmware when /proc/driver/nvidia/ is absent.

Why not WMI?

WMI / Win32_VideoController.AdapterCompatibility is not a usable discriminator here. The issue evidence shows the affected driver self-reports as NVIDIA at the Windows WMI layer (see the issue body), so a positive AdapterCompatibility = "NVIDIA" does not prove a real NVIDIA device. Adding a WMI veto would only catch a hypothetical "lazy shim" that skips WMI spoofing — the actually observed shim would still slip past it — at the cost of a powershell.exe interop spawn (~200–500 ms) on every WSL2 GPU detection, plus a new interop / appendWindowsPath dependency. The trust-tier gate above covers the observed cases without that overhead.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • npm run docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tinson Lai tinsonl@nvidia.com

Summary by CodeRabbit

  • New Features

    • Stricter NVIDIA host validation: require kernel-driver evidence on non-firmware-vouched hosts and tighten unified-memory fallback checks.
  • Bug Fixes

    • Broadened placeholder denylist to wildcard JMJWOA-Generic-*, reject probes with mixed spoofed rows, and enforce ARM64/Linux-specific gating to avoid false positives.
  • Tests

    • Expanded coverage for placeholder families, kernel-driver presence/absence, mixed-row spoofing, firmware gates, and unified-memory fallback.
  • Documentation

    • Updated onboard passthrough docs to reflect the stricter detection rules.

Review Change Stack

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Broaden the WDDM/WSL2 placeholder denylist to JMJWOA-Generic-*, add a Linux /proc/driver/nvidia kernel-interface check, and make detectGpu() early-reject probes with denylisted names or missing kernel driver when firmware doesn't vouch; update tests and docs to cover these gates and unified-memory fallback paths.

Changes

NVIDIA placeholder hardening

Layer / File(s) Summary
Denylist and kernel-interface helper
src/lib/inference/nim.ts, docs/reference/commands.mdx
Replace the single JMJWOA-Generic-GPU string with a broader JMJWOA-Generic-* denylist and add a Linux nvidiaHostLooksGenuine() check for /proc/driver/nvidia (non-Linux/non-arm64 returns true); update onboarding docs to describe the gate.
detectGpu firmware-unconfirmed gate changes
src/lib/inference/nim.ts
When firmware does not confirm NVIDIA, detectGpu() now rejects the entire probe if any parsed GPU row matches the denylist or if the kernel-interface check fails; unified-memory fallback gains analogous denylist and kernel-interface gates for tagged names.
Test harness: kernel-interface & arch helpers
src/lib/inference/nim.test.ts
Add helpers to mock fs.existsSync('/proc/driver/nvidia') and to temporarily override process.platform/process.arch so ARM64/x86 tests exercise the trust-tier gates deterministically; default shim makes kernel interface appear present unless overridden.
Parameterized denylist regression tests
src/lib/inference/nim.test.ts
Replace single-placeholder regression test with it.each covering multiple JMJWOA-Generic-* variants (including NVIDIA -prefixed forms) and stub nvidia-smi outputs accordingly.
Mixed-row and kernel-interface gate tests
src/lib/inference/nim.test.ts
Add a mixed-row spoof test asserting any denylisted row rejects the probe; add tests verifying rejection when kernel interface is absent and acceptance when present for known NVIDIA names; include a Spark firmware bypass test.
Unified-memory fallback tests
src/lib/inference/nim.test.ts
Add fallback tests ensuring denylist and kernel-interface gates also apply to unified-memory fallback paths on generic firmware (reject placeholders and reject tagged names when kernel interface absent).

Sequence Diagram(s)

sequenceDiagram
  participant Client as detectGpu()
  participant SMI as nvidia-smi
  participant FW as firmware detection
  participant Kernel as nvidiaHostLooksGenuine()
  Client->>SMI: run nvidia-smi probe (parse rows)
  SMI-->>Client: CSV rows / names
  Client->>FW: is platform vouched? (spark/station/jetson)
  FW-->>Client: vouched | unvouched
  alt unvouched
    Client->>Kernel: check /proc/driver/nvidia on linux/arm64
    Kernel-->>Client: present | absent
    Client->>Client: if any row matches denylist -> return null
  else vouched
    Client->>Client: bypass denylist/kernel gate, apply plausibility filter
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#4062: Similar changes expanding JMJWOA-Generic-* denylist and adding kernel-interface early-reject logic; overlaps in detection logic and tests.

Suggested labels

Platform: Windows/WSL, v0.0.53

Suggested reviewers

  • ericksoa

Poem

🐰 I sniff the names beneath system logs so wide,
I hop through kernels to see if drivers hide.
Placeholders tremble when my regex is near,
Real cards step forward, the fakes disappear.
I nibble tests and docs until the logic’s clear.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the primary changes: hardening NVIDIA detection via a JMJWOA denylist and ARM64 kernel-interface check.
Linked Issues check ✅ Passed Code changes directly address all linked issue #3988 objectives: denylist rejects JMJWOA-Generic-* placeholders, kernel-interface gate distinguishes real NVIDIA hosts, and non-firmware-vouched hosts require both signals.
Out of Scope Changes check ✅ Passed All changes scope to NVIDIA detection hardening: test harness for reproducible behavior, denylist logic and kernel-interface gate in detection, and documentation of the trust-tier rule.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/3988-strict-nvidia-identity-gate

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: gpu-e2e, cloud-onboard-e2e
Optional E2E: gpu-double-onboard-e2e, issue-3600-gpu-proof-optional-e2e, wsl-repo-cloud-openclaw

Dispatch hint: gpu-e2e,cloud-onboard-e2e

Auto-dispatched E2E: gpu-e2e, cloud-onboard-e2e via nightly-e2e.yaml at 4218159b9ca885871c4cee618827060c47847603nightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • gpu-e2e (high): Validates the highest-risk runtime path changed by this PR: a real NVIDIA GPU host is still trusted, onboard enables GPU/local Ollama flow, sandbox creation succeeds, and inference works through the sandbox.
  • cloud-onboard-e2e (medium): Validates standard Ubuntu cloud onboarding on a non-GPU runner after the preflight/GPU-detection changes, including sandbox health and security checks with GPU passthrough left disabled unless a trusted GPU is found.

Optional E2E

  • gpu-double-onboard-e2e (high): Additional confidence that repeated GPU-provider onboarding still handles GPU detection, gateway/sandbox reuse, and local Ollama proxy state consistently after the trust-gate refactor.
  • issue-3600-gpu-proof-optional-e2e (low): Adjacent GPU preflight guard that checks optional direct sandbox GPU proof handling; useful because this PR changes sandbox GPU preflight behavior, but it does not exercise the new nvidia-smi trust decision end-to-end.
  • wsl-repo-cloud-openclaw (high): Adjacent WSL onboarding scenario because the motivating spoof source is WSL-related. Current runner is Windows/x64 and is unlikely to reproduce Windows-on-ARM nvidia-smi spoofing, so this is confidence-only rather than merge-blocking.

New E2E recommendations

  • gpu-detection-trust-security-boundary (high): No existing E2E appears to exercise an ARM64 Linux or Windows-on-ARM WSL environment where nvidia-smi emits JMJWOA-Generic-* while /proc/driver/nvidia is absent. Unit tests cover this, but the user-visible safety boundary is onboarding preflight and sandbox GPU passthrough suppression.
    • Suggested test: Add a hermetic E2E negative preflight spoof test that injects a fake nvidia-smi returning JMJWOA-Generic-* on simulated ARM64 Linux with generic firmware and no /proc/driver/nvidia, then asserts nemoclaw onboard reports no trusted NVIDIA GPU and does not pass gateway/sandbox GPU flags.
  • jetson-tegra-gpu-fallback (medium): The trust gate intentionally bypasses Jetson/Tegra firmware and device-node fallback paths, but there is no clear E2E coverage proving hosts without nvidia-smi still onboard with NVIDIA unified-memory GPU detection.
    • Suggested test: Add an E2E or scenario fixture for Jetson/Tegra-style preflight that stubs devicetree/device-node detection without nvidia-smi and verifies sandbox GPU config remains in the intended auto/Jetson mode.

Dispatch hint

  • Workflow: .github/workflows/nightly-e2e.yaml
  • jobs input: gpu-e2e,cloud-onboard-e2e

@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

E2E Scenario Advisor Recommendation

Required scenario E2E: gpu-repo-local-ollama-openclaw, ubuntu-repo-cloud-openclaw
Optional scenario E2E: wsl-repo-cloud-openclaw

Dispatch required scenario E2E:

  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=gpu-repo-local-ollama-openclaw
  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

  • gpu-repo-local-ollama-openclaw: Changes affect NVIDIA GPU trust/detection and NIM GPU handling; this is the only routed scenario with a real NVIDIA GPU/CDI runner and local Ollama GPU inference coverage.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=gpu-repo-local-ollama-openclaw
  • ubuntu-repo-cloud-openclaw: Exercises the standard repo onboarding/preflight path on Ubuntu and helps catch regressions in default cloud onboarding after the GPU detection/preflight changes.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Optional scenario E2E

  • wsl-repo-cloud-openclaw: Adjacent coverage for WSL onboarding, relevant because the change targets Windows-on-ARM/WSL-style nvidia-smi shim false positives, though the routed WSL runner is a special platform scenario and not the primary GPU path.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=wsl-repo-cloud-openclaw

Relevant changed files

  • src/lib/inference/gpu-trust.ts
  • src/lib/inference/nim.ts

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ⚠️ No requested jobs ran

Run: 26565918524
Target ref: 90b7f3ef08d7d2cc298490809ae20fde8f5c54a3
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job Result
gpu-e2e ⏭️ skipped

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

PR Review Advisor

Findings: 1 needs attention, 2 worth checking, 0 nice ideas
Since last review: 4 prior items resolved, 3 still apply, 0 new items found

Review findings

🛠️ Needs attention

  • N1X acceptance is still not proven at the user-facing preflight boundary (src/lib/onboard/machine/handlers/preflight.test.ts:82): The linked issue is about `nemoclaw onboard` preflight printing `✓ NVIDIA GPU detected` for Snapdragon/N1X hardware and then proceeding down GPU/CDI paths as though NVIDIA hardware exists. This PR now has strong `detectGpu()` unit coverage and a handler-level test showing that a mocked null GPU disables sandbox GPU, but it still does not assert the real `onboard.ts` preflight output and sandbox GPU decision for the N1X spoof fixture.
    • Recommendation: Add a targeted preflight-level regression that feeds the N1X `nvidia-smi`/generic-firmware/no-`/proc/driver/nvidia` fixture through the actual preflight output path and asserts that `✓ NVIDIA GPU detected` is absent, the no-GPU or GPU-disabled line is present, CDI/GPU passthrough is opted out, and sandbox GPU remains disabled unless explicitly requested.
    • Evidence: Issue [WSL2][Onboard] preflight false-positive: Snapdragon iGPU reported as "NVIDIA GPU detected" on Windows ARM #3988 says reporting `✓ NVIDIA GPU detected` for `JMJWOA-Generic-GPU` is wrong and misleading. The diff adds `nimModule.detectGpu()` rejection tests and a `handlePreflightState` test with `runPreflight: vi.fn(async () => null)`, while `src/lib/onboard.ts` still owns `formatNvidiaGpuPreflightLines()`, `Local NIM unavailable — no GPU detected`, and `Sandbox GPU` output.

🔎 Worth checking

  • Generic x86_64 Linux still trusts plausible nvidia-smi names without a driver signal (src/lib/inference/gpu-trust.ts:45): The new trust gate intentionally keeps historical behavior for non-ARM64 Linux by returning true before checking `/proc/driver/nvidia`. That addresses the observed WoA/ARM64 shim, but on generic x86_64 Linux or WSL a spoofed or shimmed `nvidia-smi` that returns a plausible NVIDIA product name can still enable gateway/sandbox GPU passthrough without an independent kernel-driver signal.
    • Recommendation: Either require an additional real-driver signal for generic Linux/WSL before enabling GPU passthrough, or explicitly accept this as a security tradeoff and add a negative regression for any x86_64 spoof shape that should be rejected. Keep the current denylist as a universal reject.
    • Evidence: `nvidiaHostLooksGenuine()` returns true when `process.arch !== "arm64"`, and the test `trusts x86_64 generic firmware even when /proc/driver/nvidia/ is absent` locks in the permissive behavior.
  • Inference test monolith grows substantially (src/lib/inference/nim.test.ts:1): The already-large NIM inference test file gains another broad GPU spoofing and compatibility matrix, including process platform/architecture monkeypatching and fs monkeypatching. Keeping primary trust checks, unified-memory fallback checks, Spark regression guards, and unrelated NIM tests together makes future security-sensitive regressions harder to review.
    • Recommendation: Move the GPU spoofing, generic-firmware, ARM64 kernel-interface, and unified-memory fallback cases into a focused GPU detection test file, or extract shared fixtures/helpers and offset the growth so `nim.test.ts` does not keep expanding.
    • Evidence: Deterministic monolith analysis reports `src/lib/inference/nim.test.ts` baseLines 1361, headLines 1698, delta +337. Although `gpu-trust.test.ts` was added, most new detection matrix coverage remains in `nim.test.ts`.

🌱 Nice ideas

  • None.
Since last review details

Current findings:

  • N1X acceptance is still not proven at the user-facing preflight boundary (src/lib/onboard/machine/handlers/preflight.test.ts:82): The linked issue is about `nemoclaw onboard` preflight printing `✓ NVIDIA GPU detected` for Snapdragon/N1X hardware and then proceeding down GPU/CDI paths as though NVIDIA hardware exists. This PR now has strong `detectGpu()` unit coverage and a handler-level test showing that a mocked null GPU disables sandbox GPU, but it still does not assert the real `onboard.ts` preflight output and sandbox GPU decision for the N1X spoof fixture.
    • Recommendation: Add a targeted preflight-level regression that feeds the N1X `nvidia-smi`/generic-firmware/no-`/proc/driver/nvidia` fixture through the actual preflight output path and asserts that `✓ NVIDIA GPU detected` is absent, the no-GPU or GPU-disabled line is present, CDI/GPU passthrough is opted out, and sandbox GPU remains disabled unless explicitly requested.
    • Evidence: Issue [WSL2][Onboard] preflight false-positive: Snapdragon iGPU reported as "NVIDIA GPU detected" on Windows ARM #3988 says reporting `✓ NVIDIA GPU detected` for `JMJWOA-Generic-GPU` is wrong and misleading. The diff adds `nimModule.detectGpu()` rejection tests and a `handlePreflightState` test with `runPreflight: vi.fn(async () => null)`, while `src/lib/onboard.ts` still owns `formatNvidiaGpuPreflightLines()`, `Local NIM unavailable — no GPU detected`, and `Sandbox GPU` output.
  • Generic x86_64 Linux still trusts plausible nvidia-smi names without a driver signal (src/lib/inference/gpu-trust.ts:45): The new trust gate intentionally keeps historical behavior for non-ARM64 Linux by returning true before checking `/proc/driver/nvidia`. That addresses the observed WoA/ARM64 shim, but on generic x86_64 Linux or WSL a spoofed or shimmed `nvidia-smi` that returns a plausible NVIDIA product name can still enable gateway/sandbox GPU passthrough without an independent kernel-driver signal.
    • Recommendation: Either require an additional real-driver signal for generic Linux/WSL before enabling GPU passthrough, or explicitly accept this as a security tradeoff and add a negative regression for any x86_64 spoof shape that should be rejected. Keep the current denylist as a universal reject.
    • Evidence: `nvidiaHostLooksGenuine()` returns true when `process.arch !== "arm64"`, and the test `trusts x86_64 generic firmware even when /proc/driver/nvidia/ is absent` locks in the permissive behavior.
  • Inference test monolith grows substantially (src/lib/inference/nim.test.ts:1): The already-large NIM inference test file gains another broad GPU spoofing and compatibility matrix, including process platform/architecture monkeypatching and fs monkeypatching. Keeping primary trust checks, unified-memory fallback checks, Spark regression guards, and unrelated NIM tests together makes future security-sensitive regressions harder to review.
    • Recommendation: Move the GPU spoofing, generic-firmware, ARM64 kernel-interface, and unified-memory fallback cases into a focused GPU detection test file, or extract shared fixtures/helpers and offset the growth so `nim.test.ts` does not keep expanding.
    • Evidence: Deterministic monolith analysis reports `src/lib/inference/nim.test.ts` baseLines 1361, headLines 1698, delta +337. Although `gpu-trust.test.ts` was added, most new detection matrix coverage remains in `nim.test.ts`.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/lib/inference/nim.test.ts (1)

431-461: ⚡ Quick win

Add vendor-prefixed denylist cases to the forged-strict-path safety-net test.

This table only asserts bare JMJWOA-Generic-* names. Add prefixed variants (for example, NVIDIA JMJWOA-Generic-GPU) so the denylist guard is also covered when strict identity is forged-valid.

♻️ Proposed test expansion
     it.each([
       "JMJWOA-Generic-GPU",
       "JMJWOA-Generic-NPU",
       "JMJWOA-Generic-Future",
+      "NVIDIA JMJWOA-Generic-GPU",
+      "NVIDIA JMJWOA-Generic-NPU",
+      "NVIDIA JMJWOA-Generic-Future",
     ])(
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/inference/nim.test.ts` around lines 431 - 461, Update the table in
the test "denylist rejects %s even when strict probe somehow validates (`#3988`)"
to include vendor-prefixed variants of the denylisted names (e.g., "NVIDIA
JMJWOA-Generic-GPU", "NVIDIA JMJWOA-Generic-NPU", "NVIDIA
JMJWOA-Generic-Future") alongside the existing bare names so the denylist guard
is exercised even when isStrictNvidiaIdentityProbe and strictNvidiaIdentitiesCsv
return a forged-valid identity; modify the array passed to it.each (the list
consumed by the test using runCapture, loadNimWithMockedRunner, and
nimModule.detectGpu()) to add these prefixed strings.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/lib/inference/nim.test.ts`:
- Around line 431-461: Update the table in the test "denylist rejects %s even
when strict probe somehow validates (`#3988`)" to include vendor-prefixed variants
of the denylisted names (e.g., "NVIDIA JMJWOA-Generic-GPU", "NVIDIA
JMJWOA-Generic-NPU", "NVIDIA JMJWOA-Generic-Future") alongside the existing bare
names so the denylist guard is exercised even when isStrictNvidiaIdentityProbe
and strictNvidiaIdentitiesCsv return a forged-valid identity; modify the array
passed to it.each (the list consumed by the test using runCapture,
loadNimWithMockedRunner, and nimModule.detectGpu()) to add these prefixed
strings.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5ca5e453-ef2b-451d-bc73-36026f077f40

📥 Commits

Reviewing files that changed from the base of the PR and between 0c108ae and 90b7f3e.

📒 Files selected for processing (2)
  • src/lib/inference/nim.test.ts
  • src/lib/inference/nim.ts

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ⚠️ No requested jobs ran

Run: 26566101305
Target ref: 03a59efa16a6c624a83ed032ee834eeac8cc15de
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job Result
gpu-e2e ⏭️ skipped

…prerequisite

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ⚠️ No requested jobs ran

Run: 26566937322
Target ref: f34aad30881999804a5fcd2bc02f38d2db56badd
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job Result
gpu-e2e ⏭️ skipped

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ⚠️ No requested jobs ran

Run: 26577649867
Target ref: 0d95658a20bd5b92fa8e51d6af60d1519fbe6aa7
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job Result
gpu-e2e ⏭️ skipped

@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Actionable comments posted: 0

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ⚠️ No requested jobs ran

Run: 26578700615
Target ref: 8edf5d9e5aed4bd2d911b8671b95a323f6feb392
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job Result
gpu-e2e ⏭️ skipped

@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Actionable comments posted: 0

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ⚠️ No requested jobs ran

Run: 26581181276
Target ref: b3c4c5b897db146cc197009631c9abc5d408da18
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job Result
gpu-e2e ⏭️ skipped

@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Actionable comments posted: 0

…omments

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ⚠️ No requested jobs ran

Run: 26581982181
Target ref: 3db4ebd2426baa44c543962fe0c538e6f61f3cd8
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job Result
gpu-e2e ⏭️ skipped

@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Actionable comments posted: 0

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/reference/commands.mdx`:
- Around line 285-286: Rewrite the passive/incomplete sentence into active voice
and place each sentence on its own line: replace "NVIDIA GPU drivers installed
and working." with "Ensure NVIDIA GPU drivers are installed and working." and
move the remainder ("On generic NVIDIA hosts this means `nvidia-smi` must
succeed; on Jetson/Tegra hosts shipping without `nvidia-smi`, the devicetree
firmware fallback substitutes.") onto a separate line so each sentence occupies
its own source line.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f65c37f8-8b42-4c7e-8ad0-29f10779182f

📥 Commits

Reviewing files that changed from the base of the PR and between 3db4ebd and b7c035d.

📒 Files selected for processing (2)
  • docs/reference/commands.mdx
  • src/lib/inference/nim.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/inference/nim.test.ts

Comment thread docs/reference/commands.mdx Outdated
@laitingsheng laitingsheng changed the title fix(preflight): cross-check NVIDIA GPU via strict-identity probe fix(preflight): gate NVIDIA detection on JMJWOA denylist + ARM64 kernel-interface check May 28, 2026
… bullet

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Actionable comments posted: 0

… tests

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Actionable comments posted: 0

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ⚠️ No requested jobs ran

Run: 26584692989
Target ref: cc30e1c0e2ccda2a3b23d4cdb64e8a6707579967
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job Result
gpu-e2e ⏭️ skipped

…path

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ⚠️ No requested jobs ran

Run: 26647769480
Target ref: 3ed93b3be22699adf87e148eb657e8b77fd85bb0
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job Result
gpu-e2e ⏭️ skipped

@jyaunches jyaunches added R2 v0.0.56 Release target and removed v0.0.55 labels May 29, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26668800443
Target ref: 4218159b9ca885871c4cee618827060c47847603
Workflow ref: main
Requested jobs: gpu-e2e,cloud-onboard-e2e
Summary: 1 passed, 0 failed, 1 skipped

Job Result
cloud-onboard-e2e ✅ success
gpu-e2e ⏭️ skipped

@cv cv merged commit 0d7c48b into main May 30, 2026
29 checks passed
@cv cv deleted the fix/3988-strict-nvidia-identity-gate branch May 30, 2026 01:04
@wscurran wscurran added the bug-fix PR fixes a bug or regression label Jun 3, 2026
yimoj added a commit to yimoj/NemoClaw that referenced this pull request Jun 3, 2026
… proof state

Two grouped GPU trust/proof/status fixes, rebased onto current main.

NVIDIA#4565 — accept real Windows-ARM N1X (WSL2 + Docker Desktop) GPUs without
reopening the Snapdragon false positive (NVIDIA#3988/NVIDIA#4424). detectGpu() still
rejects a denylisted JMJWOA-Generic-* name by default; the only escape is the
ARM64 WSL Docker Desktop prover, which runs one bounded Docker --gpus CUDA
workload. The proof image is now the arch-correct cuda-sample:vectoradd-cuda12.5.0
(a genuine aarch64 binary running a real CUDA kernel) instead of cuda-sample:nbody,
whose arm64 manifest entry actually ships an x86-64 ELF and therefore fails with
"exec format error" on the very N1X target this feature accepts. An explicit
exec-format-error diagnostic now distinguishes an image-architecture problem
from a missing GPU. A real GPU passes; the Snapdragon nvidia-smi shim (no usable
CUDA device) stays fail-closed.

NVIDIA#4231 — nemoclaw status reflects CUDA proof, not just config. The direct
sandbox GPU verifier returns a SandboxGpuProofResult (verified/unverified/failed)
keyed on cuInit(0)=0, persisted to the registry and rendered by status as
"(CUDA verified)" / "(CUDA unverified)" / "(last CUDA proof failed: …)". A zero
exit that printed a non-zero cuInit code (swallowed exit) is treated as failed,
not verified. The proof is captured by the verifyGpuSandboxAfterReady wrapper
(net-zero onboard.ts) and cleared on snapshot clone so a restored sandbox cannot
inherit another sandbox's "CUDA verified" state. CUDA failures print Jetson
/dev/nvmap + video/render group remediation.

Fail-closed CPU fallback with explicit --no-gpu guidance is preserved on every
proof-failure path. Captured stderr in runCaptureEx so Docker/CUDA diagnostics
are no longer dropped. The default ARM64 prover only swallows MODULE_NOT_FOUND
and rethrows internal initialization errors.

Fixes NVIDIA#4565
Fixes NVIDIA#4231

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
yimoj added a commit to yimoj/NemoClaw that referenced this pull request Jun 3, 2026
… proof state

Two grouped GPU trust/proof/status fixes, rebased onto current main.

NVIDIA#4565 — accept real Windows-ARM N1X (WSL2 + Docker Desktop) GPUs without
reopening the Snapdragon false positive (NVIDIA#3988/NVIDIA#4424). detectGpu() still
rejects a denylisted JMJWOA-Generic-* name by default; the only escape is the
ARM64 WSL Docker Desktop prover, which runs one bounded Docker --gpus CUDA
workload. The proof image is now the arch-correct cuda-sample:vectoradd-cuda12.5.0
(a genuine aarch64 binary running a real CUDA kernel) instead of cuda-sample:nbody,
whose arm64 manifest entry actually ships an x86-64 ELF and therefore fails with
"exec format error" on the very N1X target this feature accepts. An explicit
exec-format-error diagnostic now distinguishes an image-architecture problem
from a missing GPU. A real GPU passes; the Snapdragon nvidia-smi shim (no usable
CUDA device) stays fail-closed.

NVIDIA#4231 — nemoclaw status reflects CUDA proof, not just config. The direct
sandbox GPU verifier returns a SandboxGpuProofResult (verified/unverified/failed)
keyed on cuInit(0)=0, persisted to the registry and rendered by status as
"(CUDA verified)" / "(CUDA unverified)" / "(last CUDA proof failed: …)". A zero
exit that printed a non-zero cuInit code (swallowed exit) is treated as failed,
not verified. The proof is captured by the verifyGpuSandboxAfterReady wrapper
(net-zero onboard.ts) and cleared on snapshot clone so a restored sandbox cannot
inherit another sandbox's "CUDA verified" state. CUDA failures print Jetson
/dev/nvmap + video/render group remediation.

Fail-closed CPU fallback with explicit --no-gpu guidance is preserved on every
proof-failure path. Captured stderr in runCaptureEx so Docker/CUDA diagnostics
are no longer dropped. The default ARM64 prover only swallows MODULE_NOT_FOUND
and rethrows internal initialization errors.

Fixes NVIDIA#4565
Fixes NVIDIA#4231

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
@wscurran wscurran removed the fix label Jun 3, 2026
cv pushed a commit that referenced this pull request Jun 3, 2026
… proof state (#4599)

## Summary

Two grouped GPU trust/proof/status fixes. `nemoclaw` now accepts real
Windows-ARM N1X (WSL2 + Docker Desktop) GPUs when a bounded Docker
`--gpus` CUDA proof succeeds (#4565), and `nemoclaw status` reports
proven CUDA usability instead of treating any configured GPU as healthy
(#4231).

## Related Issue

Fixes #4565
Fixes #4231

## Changes

- **#4565 — accept N1X without reopening the Snapdragon false positive
(#3988/#4424):** `detectGpu()` still rejects a denylisted
`JMJWOA-Generic-*` name by default; the only escape is
`createArm64WslDockerDesktopGpuProver`, which runs one bounded `docker
run --gpus all …` CUDA workload on ARM64 Docker Desktop WSL hosts. **The
proof image is `nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0`**
(a genuine aarch64 binary running a real CUDA kernel — device alloc +
add + result verification). The previous `cuda-sample:nbody` image was
wrong for this ARM64-only path: its arm64 manifest entry actually ships
an **x86-64 ELF**, so it fails with `exec format error` on the exact N1X
hardware this feature targets (reported in-thread). Only a real GPU
passes, so N1X is accepted while the Snapdragon nvidia-smi shim (no
usable CUDA device) stays fail-closed. The proof timeout is bounded
(default 180s, `NEMOCLAW_WSL_GPU_PROOF_TIMEOUT_MS` override) and
failures keep the CPU fallback with `--no-gpu` guidance. An explicit
`exec format error` diagnostic now distinguishes an image-architecture
problem from a missing GPU.
- **#4231 — status reflects CUDA proof, not just config:** the direct
sandbox GPU verifier returns a `SandboxGpuProofResult`
(`verified`/`unverified`/`failed`) keyed on the `cuInit(0)=0` usability
proof instead of silently swallowing optional-proof failures. A zero
exit that still printed a non-zero `cuInit(0)` code (a wrapper that
swallowed the real exit) is treated as **failed**, not verified. The
result is persisted to the sandbox registry and rendered by `nemoclaw
status` as `(CUDA verified)` / `(CUDA unverified)` / `(last CUDA proof
failed: …)`. CUDA failures print Jetson `/dev/nvmap` + `video`/`render`
group remediation. The proof is captured by the existing
`verifyGpuSandboxAfterReady` wrapper (so `src/lib/onboard.ts` is
unchanged / net-zero), and **cleared on snapshot clone** so a restored
sandbox cannot inherit another sandbox's `CUDA verified` state.
- Fail-closed CPU fallback and explicit `--no-gpu` guidance preserved on
every proof-failure path.
- Captured stderr in `runCaptureEx` so Docker/CUDA failure diagnostics
are no longer dropped.
- The default ARM64 prover only swallows `MODULE_NOT_FOUND` and rethrows
internal initialization errors (earlier CodeRabbit nit).

## Type of Change

- [x] Code change (feature, bug fix, or refactor)

## Verification

- [x] Rebased onto current `upstream/main`; resolved conflicts in
`status.ts`/`status-snapshot.ts`/`status.test.ts` (upstream extracted
the snapshot/report code into `status-snapshot.ts`) and threaded the
proof result through the `#4509` `verifyGpuSandboxAfterReady` wrapper.
- [x] Targeted GPU/status/registry/snapshot suites green
(`wsl-docker-desktop-gpu`, `nim`, `sandbox-gpu-preflight`,
`docker-gpu-local-inference`, `status`, `registry`, `snapshot*`).
- [x] `npm test` (cli project): only pre-existing, environment-only
failures remain (`test/cli.test.ts`, `test/ssrf-parity.test.ts`,
`config-sync`/`nemoclaw-start` root-ownership tests —
file-mode/ownership/network checks unrelated to this change; none touch
the modified files).
- [x] `codex review --base upstream/main` clean after addressing two P2
findings (stale proof on snapshot clone; require `cuInit(0)=0` before
verifying).
- [x] Tests added or updated for new or changed behavior.
- [x] No secrets, API keys, or credentials committed.
- [x] `npx prek` pre-commit/pre-push hooks pass (format, lint,
typecheck).

## Notes

- The proof-image bug was diagnosed from the image manifest + `file` on
the extracted binary (the `nbody` arm64 tag contains an x86-64 ELF; the
`vectoradd-cuda12.5.0` arm64 tag contains a real aarch64 binary). No
live Windows-ARM/WSL GPU hardware was available on the triage host, so
the N1X run was not reproduced live — see the in-thread reply for the
exact commands and evidence.
- Both issues were reproduced hermetically (no GPU hardware):
`detectGpu` proof gating via injected prover, and the verifier/status
proof classification via fixtures, confirming the pre-fix reject (#4565)
and misleading "enabled" (#4231) before fixing.

---
Signed-off-by: Yimo Jiang <yimoj@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Persistent per-sandbox CUDA proof tracking and reporting (verified /
unverified / failed) with human-readable status lines and
platform-specific remediation guidance.
* ARM64 WSL Docker Desktop GPU verification path with configurable
timeout and clearer diagnostics.
* **Bug Fixes**
* Snapshot restore no longer inherits a source sandbox’s GPU proof
status.
* **Tests**
* Updated unit and E2E GPU tests to validate CUDA usability states
instead of a generic GPU-enabled marker.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression v0.0.56 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[WSL2][Onboard] preflight false-positive: Snapdragon iGPU reported as "NVIDIA GPU detected" on Windows ARM

4 participants