feat(rosetta): add NVIDIA Nemotron model-family contract (closes #1590) by noahgift · Pull Request #1660 · paiml/aprender

noahgift · 2026-05-13T16:07:44Z

Summary

Closes #1590. Adds `contracts/model-families/nemotron.yaml` so apr-cookbook architecture-demos flips Nemotron from `status: blocked` → covered.

Why no engine change

Nemotron-LM dense releases are Llama-derivative — `Llama-3.1-Nemotron-70B-Instruct` is an SFT/RLHF tune over Llama-3.1, and Nemotron-Mini / Minitron distilled variants share the architecture. `from_model_type("nemotron")` already returns `Architecture::Llama` (tensor_expectation.rs:142), and the kernel_explain alias table maps `nemotron → LlamaForCausalLM`. YAML-only PR.

Sizes covered

Variant	params	hidden	layers	heads	kv_heads	inter	vocab	rope_theta
4b	4B	3072	32	24	8	9216	256000	10000
8b	8B	4096	40	32	8	11520	131072	10000
70b	70B	8192	80	64	8	28672	128256	500000

References: `nvidia/Llama-3.1-Nemotron-70B-Instruct-HF`, `nvidia/Mistral-NeMo-Minitron-8B-Base` config.json.

Out of scope

Nemotron-H (hybrid Transformer+SSM) — separate architecture
Nemotron-4 (distinct activation/norm) — separate variant

Test plan

`pv validate contracts/model-families/nemotron.yaml` — 0 errors
FALSIFY-PARITY-002 passes
CI: workspace-test

🤖 Generated with Claude Code

…w-major (was [K,N]); MODEL-1 → 100% (PMAT-CODE-SHIP-007-F32-GEMV-LAYOUT-FIX) §74 localized the SHIP-007 PARITY-GATE bug to f32_gemv_into via PR-B's stage-bisection scaffold (CPU vs GPU per-stage statistics analysis). The F32 GEMV PTX kernel was reading weights with TRANSPOSED layout interpretation: Bug: kernel assumed A is K-rows × N-cols row-major (A[i,j] at i*N+j), but actual ML weights are stored [output_dim=N, input_dim=K] row-major (A[i,j] at i*K+j per PyTorch/SafeTensors/GGUF convention and PMAT-333 F32 dequantization output). Symptom: GPU read transposed weights → computed y = A^T @ x instead of y = A @ x → systematically anti-correlated logits (cos=-0.005190 vs CPU, top-10 divergences all sign-flipped, CPU mean=-2.42 vs GPU mean=0.013). Fix: rewrite the inner loop to iterate along the K dimension within row block_id: row_base = a_ptr + block_id * K * 4 thread reads A[block_id, t], A[block_id, t+32], ... instead of: col_base = a_ptr + block_id * 4 thread reads A[t, block_id], A[t+32, block_id], ... Empirical discharge (canonical 7B teacher, lambda-vector RTX 4090, default graphed path): PARITY-GATE: PASS (no error from forward_gpu_resident) Throughput @ 128-tok 5-iter decode: 124.6 tok/s AC-SHIP1-007 floor: 30 tok/s Headroom: 4.15× over floor TTFT: 8.39 ms p50 latency: 1016 ms Before PR-E: PARITY-GATE FAILED cos=-0.005190 Throughput (with SKIP_PARITY_GATE=1 + SKIP_FP8_WARMUP=1): 5.6 tok/s (§63) / 54.5 tok/s (§73) GPU CANNOT serve this model After PR-E: PARITY-GATE PASS, default path, NO workarounds 124.6 tok/s, 4.15× over floor Ship-% impact: MODEL-1 ship %: **99% → 100%** 10 of 10 AC-SHIP1-* LIVE-DISCHARGED: SHIP-001 (§72) SHIP-002 (§61) SHIP-003 (§72) SHIP-004 (§72) SHIP-005 (§71) SHIP-006 (§61.8) SHIP-007 (this PR) SHIP-008 (§61) SHIP-009 (§72) SHIP-010 (§72) MODEL-2 ship %: unchanged at 57% (independent track). Cascade arc closeout: §63 → §73 → PR-A (#1648) → PR-B (#1649) → §74 (#1650) → PR-E (this). One PR shipped in 1 day after §73's '3-5 PR / 3-5 day' estimate. Auxiliary change: logits.rs adds APR_LM_HEAD_FORCE_QTYPE env-var probe kept as a diagnostic tool (zero behavior change when unset). Test plan: - [x] cargo build --release -p apr-cli --bin apr --features cuda → clean - [x] apr bench (default path, 128-tok 5-iter) → 124.6 tok/s, passed: true - [x] apr parity → PARITY-GATE PASS - [ ] CI tests (workspace-test on per-PR runner) Refs: - §74 SHIP-007 bug localized (PR #1650) - §73 SHIP-007 cascade reduction (PR #1647) - contracts/apr-ship-007-gpu-stage-bisection-v1.yaml (PR-A #1648 contract) - PR #1649 (PR-B GPU stage dump scaffold) - AC-SHIP1-007 (spec §5) - evidence/section-75-ship-007-discharged-2026-05-13/ Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…07 contract violation (PMAT-CODE-SHIP-007-PR-E-FALSIFY-007-CLEAN) The env-var bisection probe added in PR-E (this branch) introduced a `_ =>` catch-all inside a `match` expression that referenced `WeightQuantType` in its arm values. The `falsify_007_no_catch_all_ in_dispatch_sites` contract test's 30-line walk-back heuristic flagged this as a violation, even though the match was on `&str` (env var value), not on `WeightQuantType`. The probe was a bisection tool used to identify the bug location during §74. Now that §75 has shipped the actual fix and the probe is no longer needed, removing it cleans up the contract violation. The remaining PR-E change is solely the F32 GEMV PTX kernel layout fix in `crates/aprender-gpu/src/kernels/gemv/mod.rs` — that's the actual bug fix. Test verified: cargo test -p aprender-serve --lib \ quantize::contract_tests::tests::falsify_007_no_catch_all_in_dispatch_sites → 1 passed Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds `contracts/model-families/nemotron.yaml` so apr-cookbook architecture-demos flips Nemotron from `status: blocked` → covered. Nemotron-LM dense releases are Llama-derivative — Llama-3.1-Nemotron-70B is an SFT/RLHF tune over meta-llama/Llama-3.1-70B-Instruct, and Nemotron-Mini-4B-Base / Mistral-NeMo-Minitron-8B are distilled Llama-style models. All use the standard `LlamaForCausalLM` tensor naming and GQA + RoPE + SwiGLU + RMSNorm constraints. `from_model_type("nemotron")` already returns `Architecture::Llama` (tensor_expectation.rs:142), so no engine change needed — YAML only. Size variants: - 4b (Nemotron-Mini-4B-Base — note 256k vocab, RoPE θ=10000) - 8b (Mistral-NeMo-Minitron-8B — 131k vocab, RoPE θ=10000) - 70b (Llama-3.1-Nemotron-70B — 128k vocab, RoPE θ=500000) Verified: - `pv validate contracts/model-families/nemotron.yaml` → 0 errors - FALSIFY-PARITY-002 (`test_every_model_family_yaml_has_architecture`) passes. Out of scope: Nemotron-H (hybrid Transformer+SSM) and Nemotron-4 (uses distinct activation/norm) — separate architecture variants.

noahgift enabled auto-merge (squash) May 13, 2026 16:07

noahgift and others added 4 commits May 14, 2026 07:52

ci: trigger fresh workflow run for flake-class test re-execution

fc1ed96

noahgift force-pushed the fix/1590-nemotron-rosetta-family branch from aefebb4 to de0e4b8 Compare May 14, 2026 05:54

noahgift merged commit 50c4ade into main May 14, 2026
26 of 30 checks passed

noahgift deleted the fix/1590-nemotron-rosetta-family branch May 14, 2026 07:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rosetta): add NVIDIA Nemotron model-family contract (closes #1590)#1660

feat(rosetta): add NVIDIA Nemotron model-family contract (closes #1590)#1660
noahgift merged 4 commits into
mainfrom
fix/1590-nemotron-rosetta-family

noahgift commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 13, 2026

Summary

Why no engine change

Sizes covered

Out of scope

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant