Skip to content

docs(spec): SHIP-TWO-001 §74 — SHIP-007 bug LOCALIZED to LM head F32 GEMV via PR-B stage bisection#1650

Merged
noahgift merged 4 commits into
mainfrom
docs/section-74-ship-007-bug-localized
May 13, 2026
Merged

docs(spec): SHIP-TWO-001 §74 — SHIP-007 bug LOCALIZED to LM head F32 GEMV via PR-B stage bisection#1650
noahgift merged 4 commits into
mainfrom
docs/section-74-ship-007-bug-localized

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

§74 reports the empirical localization of SHIP-007's PARITY-GATE bug to the LM head F32 GEMV dispatch, using PR-B's APR_GPU_STAGE_DUMP scaffold (PR #1649).

Empirical localization

Stage mean rms Verdict
GPU post_ffn_residual @ L27 0.022 26.12 Sane
GPU final_norm 0.037 2.84 Sane
GPU lm_head 0.013 2.40 Mean-centered (suspicious)
CPU lm_head -2.42 2.11 Qwen's negative-bias signature

Mean differs by 2.43. Cosine(CPU, GPU) = -0.005190. Top-10 divergences ALL sign-flipped.

Bug is in dispatch_lm_head_and_downloadf32_gemv_into.

Why F32 (not Q6K)?

PMAT-333 dequantizes ALL weights to F32 on GPU upload (28.3 GB total). WeightQuantType::from_size(2,179,989,504, 152064, 3584) returns F32 (matches 152064 × 3584 × 4 exactly). GPU dispatches f32_gemv_into; CPU dispatches fused_matmul_into on Q6K bytes.

Cascade arc

§73 (reduced) → PR-A #1648 (contract) → PR-B #1649 (stage scaffold) → §74 (localized) → PR-E (fix + discharge → MODEL-1 100%)

Total scope: "5-10 PR / 1-2 week" (§63) → "1 PR / 1-3 days" (PR-E).

Methodology lesson #21 NEW

Stage-by-stage numerical analysis can localize a bug class without per-element diffing. Per-stage stats (rms, mean) is the scalpel; per-element diff is the heavy hammer.

Ship-%

  • MODEL-1: unchanged at 99% (Layer 2 localized; PR-E for fix)
  • MODEL-2: unchanged at 57%

Test plan

  • Empirical run on lambda-vector RTX 4090
  • CPU + GPU lm_head + intermediate dumps captured
  • Numerical analysis localizes to LM head F32 GEMV
  • Spec v3.19.0 → v3.20.0
  • PR-E (fix + discharge proof) — next session

Refs

🤖 Generated with Claude Code

…GEMV via PR-B stage bisection (PMAT-CODE-SHIP-TWO-SECTION-74)

§73 reduced the SHIP-007 cascade to 1 layer (Layer 2 parity). PR-A
(#1648) shipped the contract scaffold. PR-B (#1649) shipped the
APR_GPU_STAGE_DUMP diagnostic surface. §74 reports the empirical
localization result on lambda-vector RTX 4090.

Bisection method:
  SKIP_CUDA_GRAPH=1 APR_GPU_STAGE_DUMP=/tmp/ship-007-gpu-stages
    apr parity <canonical 7B Q4_K_M GGUF>

Captures (single BOS token, position 0):
  - GPU embedding (host-side embed_into output)
  - GPU post_ffn_residual @ layer 27 (end of 28-layer stack)
  - GPU final_norm (post-output-RMSNorm)
  - GPU lm_head (logits)
  - CPU lm_head (logits, from parity_gate)

Empirical values (canonical 7B teacher):
  GPU post_ffn_residual L27: rms=26.12, mean=0.022 → sane
  GPU final_norm:            rms=2.84,  mean=0.037 → sane
  GPU lm_head:               mean=0.013, stdev=2.40 → mean-centered (suspicious)
  CPU lm_head:               mean=-2.42, stdev=2.11 → Qwen's negative-bias signature

  cos(CPU, GPU) = -0.005190 (byte-identical to §73's signature)
  Top-10 divergences ALL sign-flipped

Localization: bug is in LM head dispatch
  (dispatch_lm_head_and_download → f32_gemv_into).

Why F32 (not Q6K)?
  PMAT-333 dequantizes ALL weights to F32 on GPU upload (28.3 GB).
  WeightQuantType::from_size(2,179,989,504, 152064, 3584) → F32
  (matches 152064 × 3584 × 4 bytes exactly).
  GPU dispatches f32_gemv_into; CPU dispatches fused_matmul_into.

The F32 GEMV PTX kernel is the localized bug surface. Per
memory project_ship_007_attention_parity_investigation.md:
"bug is layout/stride/buffer, NOT arithmetic. Negative cosine
-0.005 = systematic anti-correlation." Matches.

Cascade arc closeout:
  §73 (cascade reduced) → PR-A #1648 (contract) → PR-B #1649
  (stage scaffold + dumps) → §74 (bug localized) → PR-E (fix +
  discharge).

Total scope reduction: "5-10 PR / 1-2 week" (§63) → "1 PR /
1-3 days" (PR-E).

Methodology lesson #21 NEW: stage-by-stage numerical analysis
can localize a bug class without per-element diffing. Per-stage
stats (rms, mean) is the scalpel; per-element diff is the heavy
hammer.

Spec v3.19.0 → v3.20.0.

Ship-% impact: MODEL-1 unchanged at 99%. PR-E remaining.
MODEL-2 unchanged at 57%.

Refs:
- contracts/apr-ship-007-gpu-stage-bisection-v1.yaml (PR-A scaffold)
- §73 SHIP-007 cascade scope reduction
- PR #1649 (PR-B GPU stage dump scaffold)
- evidence/section-74-ship-007-bisection-2026-05-13/findings.json
- crates/aprender-serve/src/cuda/executor/weight.rs:724 (f32_gemv_into entry)
- crates/aprender-serve/src/cuda/executor/layers/logits.rs:30 (qtype detection)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 13, 2026 07:21
@noahgift noahgift merged commit 8246932 into main May 13, 2026
10 checks passed
@noahgift noahgift deleted the docs/section-74-ship-007-bug-localized branch May 13, 2026 09:05
noahgift added a commit that referenced this pull request May 13, 2026
…P-TWO-SECTION-75)

PR-E (#1651) shipped the single-file F32 GEMV PTX layout fix. SHIP-007
LIVE-DISCHARGED. All 10 AC-SHIP1-* now LIVE on canonical 7B Qwen2.5-
Coder-Instruct Q4_K_M teacher.

10/10 LIVE-discharge table:
  SHIP-001  §72  apr run <safetensors> exit 0
  SHIP-002  §61  apr run "def fib(n):" valid Python (#1609)
  SHIP-003  §72  apr diff 20 tensors at cos_sim=1.000000
  SHIP-004  §72  llama-cli exit 0, 133.1 gen tok/s
  SHIP-005  §71  HumanEval pass@1 = 86.59% (gx10 164-run)
  SHIP-006  §61.8 apr qa 12-gate aggregate PASS (#1615)
  SHIP-007  §75  PARITY-GATE PASS + 124.6 tok/s @ 128-tok (this section)
  SHIP-008  §61  apr run SHIP-008 USER → 256-token ChatML (#1614)
  SHIP-009  §72  apr inspect license/provenance fields
  SHIP-010  §72  sha256 match 0a854098…

Empirical discharge proof for SHIP-007:
  apr bench <canonical 7B APR> --iterations 5 --max-tokens 128
  → tokens_per_second: 124.6
  → AC-SHIP1-007 floor: 30 → headroom 4.15×
  → PARITY-GATE: PASS (no error)
  → Default path (CUDA graphed), no SKIP_PARITY_GATE, no APR_SKIP_FP8_WARMUP

Cascade arc closeout:
  §63 2026-05-11 → SHIP-007 framed as 3-layer cascade
  §73 2026-05-12 → re-measurement: only parity layer blocks
  §74 2026-05-13 → bug LOCALIZED to F32 GEMV via PR-B stage bisection
  §75 2026-05-13 → PR-E layout fix → MODEL-1 100%

§73's '3-5 PR / 3-5 day' estimate. Actual: 4 PRs (#1648 contract,

Methodology lesson #22 NEW: symptom analysis (sign-flipped top-K
divergences + CPU/GPU mean mismatch + sane intermediates) →
bug class localization in O(1). Methodology lessons compose;
each makes the next cheaper.

Ship-% movement:
  MODEL-1 ship %: 99% → 100% 🎉
  MODEL-2 ship %: unchanged at 57% (independent track,
    gated on step 5g.3 val_loss < 9.38).

Spec version: 3.19.0 → 3.21.0 (post-§72/73 stack at 3.18.0;
§74 at 3.20.0; §75 here at 3.21.0).

Out of scope (future work):
- MODEL-2 ship % path (independent track, separate cascade)
- Publish-readiness gates (GATE-SHIP-001/002/003 still need green CI +
  post-publish QA per feedback_post_publish_qa_required.md)
- HumanEval/MBPP benchmark improvements beyond §71's 86.59%

Refs:
- §74 SHIP-007 localization (PR #1650)
- §73 SHIP-007 cascade reduction (PR #1647)
- PR #1648 (contract scaffold), #1649 (PR-B stage dump)
- PR #1651 (PR-E F32 GEMV layout fix)
- AC-SHIP1-007 (spec §5)
- evidence/section-75-ship-007-discharged-2026-05-13/

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 13, 2026
…w-major (was [K,N]); MODEL-1 → 100% (PMAT-CODE-SHIP-007-F32-GEMV-LAYOUT-FIX)

§74 localized the SHIP-007 PARITY-GATE bug to f32_gemv_into via PR-B's
stage-bisection scaffold (CPU vs GPU per-stage statistics analysis).
The F32 GEMV PTX kernel was reading weights with TRANSPOSED layout
interpretation:

Bug: kernel assumed A is K-rows × N-cols row-major (A[i,j] at i*N+j),
     but actual ML weights are stored [output_dim=N, input_dim=K]
     row-major (A[i,j] at i*K+j per PyTorch/SafeTensors/GGUF convention
     and PMAT-333 F32 dequantization output).

Symptom: GPU read transposed weights → computed y = A^T @ x instead
         of y = A @ x → systematically anti-correlated logits
         (cos=-0.005190 vs CPU, top-10 divergences all sign-flipped,
         CPU mean=-2.42 vs GPU mean=0.013).

Fix: rewrite the inner loop to iterate along the K dimension within
     row block_id:
       row_base = a_ptr + block_id * K * 4
       thread reads A[block_id, t], A[block_id, t+32], ...
     instead of:
       col_base = a_ptr + block_id * 4
       thread reads A[t, block_id], A[t+32, block_id], ...

Empirical discharge (canonical 7B teacher, lambda-vector RTX 4090,
default graphed path):

  PARITY-GATE: PASS (no error from forward_gpu_resident)
  Throughput @ 128-tok 5-iter decode: 124.6 tok/s
  AC-SHIP1-007 floor: 30 tok/s
  Headroom: 4.15× over floor
  TTFT: 8.39 ms
  p50 latency: 1016 ms

Before PR-E:
  PARITY-GATE FAILED cos=-0.005190
  Throughput (with SKIP_PARITY_GATE=1 + SKIP_FP8_WARMUP=1): 5.6 tok/s (§63) / 54.5 tok/s (§73)
  GPU CANNOT serve this model

After PR-E:
  PARITY-GATE PASS, default path, NO workarounds
  124.6 tok/s, 4.15× over floor

Ship-% impact:
  MODEL-1 ship %: **99% → 100%**
  10 of 10 AC-SHIP1-* LIVE-DISCHARGED:
    SHIP-001 (§72)  SHIP-002 (§61)  SHIP-003 (§72)
    SHIP-004 (§72)  SHIP-005 (§71)  SHIP-006 (§61.8)
    SHIP-007 (this PR)  SHIP-008 (§61)  SHIP-009 (§72)
    SHIP-010 (§72)

  MODEL-2 ship %: unchanged at 57% (independent track).

Cascade arc closeout: §63 → §73 → PR-A (#1648) → PR-B (#1649)
→ §74 (#1650) → PR-E (this). One PR shipped in 1 day after §73's
'3-5 PR / 3-5 day' estimate.

Auxiliary change: logits.rs adds APR_LM_HEAD_FORCE_QTYPE env-var
probe kept as a diagnostic tool (zero behavior change when unset).

Test plan:
- [x] cargo build --release -p apr-cli --bin apr --features cuda → clean
- [x] apr bench (default path, 128-tok 5-iter) → 124.6 tok/s, passed: true
- [x] apr parity → PARITY-GATE PASS
- [ ] CI tests (workspace-test on per-PR runner)

Refs:
- §74 SHIP-007 bug localized (PR #1650)
- §73 SHIP-007 cascade reduction (PR #1647)
- contracts/apr-ship-007-gpu-stage-bisection-v1.yaml (PR-A #1648 contract)
- PR #1649 (PR-B GPU stage dump scaffold)
- AC-SHIP1-007 (spec §5)
- evidence/section-75-ship-007-discharged-2026-05-13/

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 13, 2026
…loses #1595) (#1657)

* fix(aprender-gpu): SHIP-007 PR-E — F32 GEMV PTX kernel reads [N,K] row-major (was [K,N]); MODEL-1 → 100% (PMAT-CODE-SHIP-007-F32-GEMV-LAYOUT-FIX)

§74 localized the SHIP-007 PARITY-GATE bug to f32_gemv_into via PR-B's
stage-bisection scaffold (CPU vs GPU per-stage statistics analysis).
The F32 GEMV PTX kernel was reading weights with TRANSPOSED layout
interpretation:

Bug: kernel assumed A is K-rows × N-cols row-major (A[i,j] at i*N+j),
     but actual ML weights are stored [output_dim=N, input_dim=K]
     row-major (A[i,j] at i*K+j per PyTorch/SafeTensors/GGUF convention
     and PMAT-333 F32 dequantization output).

Symptom: GPU read transposed weights → computed y = A^T @ x instead
         of y = A @ x → systematically anti-correlated logits
         (cos=-0.005190 vs CPU, top-10 divergences all sign-flipped,
         CPU mean=-2.42 vs GPU mean=0.013).

Fix: rewrite the inner loop to iterate along the K dimension within
     row block_id:
       row_base = a_ptr + block_id * K * 4
       thread reads A[block_id, t], A[block_id, t+32], ...
     instead of:
       col_base = a_ptr + block_id * 4
       thread reads A[t, block_id], A[t+32, block_id], ...

Empirical discharge (canonical 7B teacher, lambda-vector RTX 4090,
default graphed path):

  PARITY-GATE: PASS (no error from forward_gpu_resident)
  Throughput @ 128-tok 5-iter decode: 124.6 tok/s
  AC-SHIP1-007 floor: 30 tok/s
  Headroom: 4.15× over floor
  TTFT: 8.39 ms
  p50 latency: 1016 ms

Before PR-E:
  PARITY-GATE FAILED cos=-0.005190
  Throughput (with SKIP_PARITY_GATE=1 + SKIP_FP8_WARMUP=1): 5.6 tok/s (§63) / 54.5 tok/s (§73)
  GPU CANNOT serve this model

After PR-E:
  PARITY-GATE PASS, default path, NO workarounds
  124.6 tok/s, 4.15× over floor

Ship-% impact:
  MODEL-1 ship %: **99% → 100%**
  10 of 10 AC-SHIP1-* LIVE-DISCHARGED:
    SHIP-001 (§72)  SHIP-002 (§61)  SHIP-003 (§72)
    SHIP-004 (§72)  SHIP-005 (§71)  SHIP-006 (§61.8)
    SHIP-007 (this PR)  SHIP-008 (§61)  SHIP-009 (§72)
    SHIP-010 (§72)

  MODEL-2 ship %: unchanged at 57% (independent track).

Cascade arc closeout: §63 → §73 → PR-A (#1648) → PR-B (#1649)
→ §74 (#1650) → PR-E (this). One PR shipped in 1 day after §73's
'3-5 PR / 3-5 day' estimate.

Auxiliary change: logits.rs adds APR_LM_HEAD_FORCE_QTYPE env-var
probe kept as a diagnostic tool (zero behavior change when unset).

Test plan:
- [x] cargo build --release -p apr-cli --bin apr --features cuda → clean
- [x] apr bench (default path, 128-tok 5-iter) → 124.6 tok/s, passed: true
- [x] apr parity → PARITY-GATE PASS
- [ ] CI tests (workspace-test on per-PR runner)

Refs:
- §74 SHIP-007 bug localized (PR #1650)
- §73 SHIP-007 cascade reduction (PR #1647)
- contracts/apr-ship-007-gpu-stage-bisection-v1.yaml (PR-A #1648 contract)
- PR #1649 (PR-B GPU stage dump scaffold)
- AC-SHIP1-007 (spec §5)
- evidence/section-75-ship-007-discharged-2026-05-13/

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* ci: trigger fresh workflow run for flake-class test re-execution

* fix(aprender-serve): remove APR_LM_HEAD_FORCE_QTYPE probe — FALSIFY-007 contract violation (PMAT-CODE-SHIP-007-PR-E-FALSIFY-007-CLEAN)

The env-var bisection probe added in PR-E (this branch) introduced a
`_ =>` catch-all inside a `match` expression that referenced
`WeightQuantType` in its arm values. The `falsify_007_no_catch_all_
in_dispatch_sites` contract test's 30-line walk-back heuristic flagged
this as a violation, even though the match was on `&str` (env var
value), not on `WeightQuantType`.

The probe was a bisection tool used to identify the bug location
during §74. Now that §75 has shipped the actual fix and the probe is
no longer needed, removing it cleans up the contract violation.

The remaining PR-E change is solely the F32 GEMV PTX kernel layout
fix in `crates/aprender-gpu/src/kernels/gemv/mod.rs` — that's the
actual bug fix.

Test verified:
  cargo test -p aprender-serve --lib \
      quantize::contract_tests::tests::falsify_007_no_catch_all_in_dispatch_sites
  → 1 passed

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(aprender-contracts): add actually_verified field on KaniHarness (closes #1595)

When `kani_harnesses[].actually_verified: true`, `pv score` D3 lifts the
strategy weight to 1.0 regardless of strategy (bounded_int / stub_float /
compositional). Rationale: the static-readiness 0.9 cap reflects
uncertainty about whether the harness actually proves anything; once CI
runs `cargo kani` green (e.g. apr-cookbook PR #421's kani-gate), the
runtime witness supplants the static signal.

Schema change:
  KaniHarness gets `actually_verified: Option<bool>` (default None;
  back-compat with existing contracts).

Scoring change:
  scoring::mod::strategy_weight() short-circuits to 1.0 when
  actually_verified == Some(true), before the strategy table lookup.

Tests:
  - kani_actually_verified_lifts_bounded_int_to_full_score
  - kani_actually_verified_false_keeps_strategy_default
  Both pass; 1392 prior tests unaffected.

Updates the explicit `KaniHarness { ... }` literal in
gates_extended_tests.rs to include the new field (None).

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 13, 2026
…T-CODE-V0-33-0-RELEASE-PREP)

🎉 v0.33.0 marks **MODEL-1 SHIP % = 100%** for SHIP-TWO-001.

All 10 AC-SHIP1-* falsifiers are LIVE-discharged on the canonical
7B Qwen2.5-Coder-Instruct Q4_K_M teacher (lambda-vector RTX 4090,
--features cuda).

This release prep PR ships:
1. CHANGELOG.md [0.33.0] entry with §69-§75 highlights:
   - 🎉 MODEL-1 SHIP % = 100% (all 10 AC-SHIP1-* LIVE)
   - Fixed: SHIP-007 F32 GEMV PTX layout (PR #1651, §75) — 124.6 tok/s
   - Fixed: SHIP-005 HumanEval RC3 (PR #1635, §70/§71) — pass@1 86.59%
   - Added: APR_EVAL_DEBUG=1 diagnostic surface (PR #1634)
   - Added: APR_GPU_STAGE_DUMP=<dir> diagnostic surface (PR #1649)
   - Added: MBPP harness H4 fix (PR #1645)
   - Added: 2 new falsifiable contracts (apr-eval-humaneval-harness-
     invariant v1.1.0, apr-ship-007-gpu-stage-bisection v1.0.0)
   - Methodology lessons #16-22 captured in MEMORY.md
   - Spec: v3.13.0 → v3.21.0 across §67-§75

2. Workspace version bump:
   - [workspace.package].version: 0.32.0 → 0.33.0
   - Root [package].version (aprender facade crate): 0.32.0 → 0.33.0
   - 28 sub-crate version literals: 0.32.0 → 0.33.0

3. `cargo check -p aprender` → clean (workspace builds at 0.33.0).

Out of scope for this PR (separate steps after #1651/1652 land + this
PR lands):
- Tag release `v0.33.0` on main
- Cascade publish to crates.io (per memory project_ship_two_001_v0_32_0_release.md
  — 15 user-facing crates + 7 internal-tier in topological dependency
  order; uses `make publish CRATE=<name>`)
- Post-publish QA per `feedback_post_publish_qa_required.md` —
  `cargo install aprender --force` + `/dogfood` GO verdict required
  before declaring release done (v0.31.1 was yanked for skipping this)
- GitHub Release with §75 narrative
- HF artifact verification (paiml/qwen2.5-coder-7b-apache-q4k-v1 sha256
  already verified by §72 SHIP-010 LIVE evidence; double-check before
  release announcement)

This PR ships ONLY the version-bump + CHANGELOG. Publishing is the
next step after merge.

Refs:
- §75 MODEL-1 100% (PR #1652)
- §74 SHIP-007 bug localized (PR #1650)
- §73 SHIP-007 cascade reduction (PR #1647)
- §72 5-AC LIVE cascade (PR #1646)
- §71 SHIP-005 LIVE-DISCHARGED (PR #1642)
- §70 RC3 fix (PR #1636)
- §69 Q4K hypothesis falsified (PR #1633)
- PR #1635 RC3 prepend
- PR #1634 diagnostic surface + contract
- PR #1648 SHIP-007 contract scaffold
- PR #1649 SHIP-007 PR-B stage dump
- PR #1651 SHIP-007 PR-E F32 GEMV layout fix

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 13, 2026
…1592, #1594) (#1662)

* fix(aprender-gpu): SHIP-007 PR-E — F32 GEMV PTX kernel reads [N,K] row-major (was [K,N]); MODEL-1 → 100% (PMAT-CODE-SHIP-007-F32-GEMV-LAYOUT-FIX)

§74 localized the SHIP-007 PARITY-GATE bug to f32_gemv_into via PR-B's
stage-bisection scaffold (CPU vs GPU per-stage statistics analysis).
The F32 GEMV PTX kernel was reading weights with TRANSPOSED layout
interpretation:

Bug: kernel assumed A is K-rows × N-cols row-major (A[i,j] at i*N+j),
     but actual ML weights are stored [output_dim=N, input_dim=K]
     row-major (A[i,j] at i*K+j per PyTorch/SafeTensors/GGUF convention
     and PMAT-333 F32 dequantization output).

Symptom: GPU read transposed weights → computed y = A^T @ x instead
         of y = A @ x → systematically anti-correlated logits
         (cos=-0.005190 vs CPU, top-10 divergences all sign-flipped,
         CPU mean=-2.42 vs GPU mean=0.013).

Fix: rewrite the inner loop to iterate along the K dimension within
     row block_id:
       row_base = a_ptr + block_id * K * 4
       thread reads A[block_id, t], A[block_id, t+32], ...
     instead of:
       col_base = a_ptr + block_id * 4
       thread reads A[t, block_id], A[t+32, block_id], ...

Empirical discharge (canonical 7B teacher, lambda-vector RTX 4090,
default graphed path):

  PARITY-GATE: PASS (no error from forward_gpu_resident)
  Throughput @ 128-tok 5-iter decode: 124.6 tok/s
  AC-SHIP1-007 floor: 30 tok/s
  Headroom: 4.15× over floor
  TTFT: 8.39 ms
  p50 latency: 1016 ms

Before PR-E:
  PARITY-GATE FAILED cos=-0.005190
  Throughput (with SKIP_PARITY_GATE=1 + SKIP_FP8_WARMUP=1): 5.6 tok/s (§63) / 54.5 tok/s (§73)
  GPU CANNOT serve this model

After PR-E:
  PARITY-GATE PASS, default path, NO workarounds
  124.6 tok/s, 4.15× over floor

Ship-% impact:
  MODEL-1 ship %: **99% → 100%**
  10 of 10 AC-SHIP1-* LIVE-DISCHARGED:
    SHIP-001 (§72)  SHIP-002 (§61)  SHIP-003 (§72)
    SHIP-004 (§72)  SHIP-005 (§71)  SHIP-006 (§61.8)
    SHIP-007 (this PR)  SHIP-008 (§61)  SHIP-009 (§72)
    SHIP-010 (§72)

  MODEL-2 ship %: unchanged at 57% (independent track).

Cascade arc closeout: §63 → §73 → PR-A (#1648) → PR-B (#1649)
→ §74 (#1650) → PR-E (this). One PR shipped in 1 day after §73's
'3-5 PR / 3-5 day' estimate.

Auxiliary change: logits.rs adds APR_LM_HEAD_FORCE_QTYPE env-var
probe kept as a diagnostic tool (zero behavior change when unset).

Test plan:
- [x] cargo build --release -p apr-cli --bin apr --features cuda → clean
- [x] apr bench (default path, 128-tok 5-iter) → 124.6 tok/s, passed: true
- [x] apr parity → PARITY-GATE PASS
- [ ] CI tests (workspace-test on per-PR runner)

Refs:
- §74 SHIP-007 bug localized (PR #1650)
- §73 SHIP-007 cascade reduction (PR #1647)
- contracts/apr-ship-007-gpu-stage-bisection-v1.yaml (PR-A #1648 contract)
- PR #1649 (PR-B GPU stage dump scaffold)
- AC-SHIP1-007 (spec §5)
- evidence/section-75-ship-007-discharged-2026-05-13/

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* ci: trigger fresh workflow run for flake-class test re-execution

* fix(aprender-serve): remove APR_LM_HEAD_FORCE_QTYPE probe — FALSIFY-007 contract violation (PMAT-CODE-SHIP-007-PR-E-FALSIFY-007-CLEAN)

The env-var bisection probe added in PR-E (this branch) introduced a
`_ =>` catch-all inside a `match` expression that referenced
`WeightQuantType` in its arm values. The `falsify_007_no_catch_all_
in_dispatch_sites` contract test's 30-line walk-back heuristic flagged
this as a violation, even though the match was on `&str` (env var
value), not on `WeightQuantType`.

The probe was a bisection tool used to identify the bug location
during §74. Now that §75 has shipped the actual fix and the probe is
no longer needed, removing it cleans up the contract violation.

The remaining PR-E change is solely the F32 GEMV PTX kernel layout
fix in `crates/aprender-gpu/src/kernels/gemv/mod.rs` — that's the
actual bug fix.

Test verified:
  cargo test -p aprender-serve --lib \
      quantize::contract_tests::tests::falsify_007_no_catch_all_in_dispatch_sites
  → 1 passed

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(rosetta): OLMo + StableLM + GPTBigCode model-family contracts (closes #1591, #1592, #1594)

Three Llama-derivative / GPT-2-derivative families share an `Architecture`
variant with their parent — none need a new variant or a custom tensor
mapper. Engine change is a single match arm extension in
`from_model_type`:

- OLMo / OLMo-2 (allenai/OLMo*) → `Architecture::Llama`
- StableLM (stabilityai/stablelm*) → `Architecture::Llama`
- GPTBigCode (StarCoder1 / SantaCoder / tiny_starcoder_py) →
  `Architecture::Gpt2`

OLMo and OLMo-2 share `LlamaForCausalLM` tensor naming. StableLM
likewise — partial-RoPE and per-checkpoint norm variation are runtime
concerns, not tensor-name concerns. GPTBigCode uses GPT-2 Conv1D layout
with Multi-Query Attention (single shared K/V head); MQA semantics
affect cache shape and inference dispatch but not tensor-name
resolution, so the Gpt2 mapper handles names.

Three YAMLs added:
- `contracts/model-families/olmo.yaml` (1B / 7B / OLMo-2 7B / OLMo-2 13B)
- `contracts/model-families/stablelm.yaml` (1.6B / 3B / Zephyr-3B)
- `contracts/model-families/gpt_bigcode.yaml` (tiny / SantaCoder / StarCoder1 15.5B)

`from_model_type` extended:
- `"olmo" | "olmo2" | "stablelm" | "stablelm_epoch" | "stablelm_alpha"`
  → `Self::Llama` (joins existing smollm / granite / nemotron list)
- `"gpt_bigcode" | "gpt-bigcode"` → `Self::Gpt2` (joins existing
  starcoder / starcoder2 / bigcode list)

Verified:
- `pv validate` clean on all three YAMLs
- FALSIFY-PARITY-002 (`test_every_model_family_yaml_has_architecture`)
  passes

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 13, 2026
…T-CODE-V0-33-0-RELEASE-PREP) (#1653)

🎉 v0.33.0 marks **MODEL-1 SHIP % = 100%** for SHIP-TWO-001.

All 10 AC-SHIP1-* falsifiers are LIVE-discharged on the canonical
7B Qwen2.5-Coder-Instruct Q4_K_M teacher (lambda-vector RTX 4090,
--features cuda).

This release prep PR ships:
1. CHANGELOG.md [0.33.0] entry with §69-§75 highlights:
   - 🎉 MODEL-1 SHIP % = 100% (all 10 AC-SHIP1-* LIVE)
   - Fixed: SHIP-007 F32 GEMV PTX layout (PR #1651, §75) — 124.6 tok/s
   - Fixed: SHIP-005 HumanEval RC3 (PR #1635, §70/§71) — pass@1 86.59%
   - Added: APR_EVAL_DEBUG=1 diagnostic surface (PR #1634)
   - Added: APR_GPU_STAGE_DUMP=<dir> diagnostic surface (PR #1649)
   - Added: MBPP harness H4 fix (PR #1645)
   - Added: 2 new falsifiable contracts (apr-eval-humaneval-harness-
     invariant v1.1.0, apr-ship-007-gpu-stage-bisection v1.0.0)
   - Methodology lessons #16-22 captured in MEMORY.md
   - Spec: v3.13.0 → v3.21.0 across §67-§75

2. Workspace version bump:
   - [workspace.package].version: 0.32.0 → 0.33.0
   - Root [package].version (aprender facade crate): 0.32.0 → 0.33.0
   - 28 sub-crate version literals: 0.32.0 → 0.33.0

3. `cargo check -p aprender` → clean (workspace builds at 0.33.0).

Out of scope for this PR (separate steps after #1651/1652 land + this
PR lands):
- Tag release `v0.33.0` on main
- Cascade publish to crates.io (per memory project_ship_two_001_v0_32_0_release.md
  — 15 user-facing crates + 7 internal-tier in topological dependency
  order; uses `make publish CRATE=<name>`)
- Post-publish QA per `feedback_post_publish_qa_required.md` —
  `cargo install aprender --force` + `/dogfood` GO verdict required
  before declaring release done (v0.31.1 was yanked for skipping this)
- GitHub Release with §75 narrative
- HF artifact verification (paiml/qwen2.5-coder-7b-apache-q4k-v1 sha256
  already verified by §72 SHIP-010 LIVE evidence; double-check before
  release announcement)

This PR ships ONLY the version-bump + CHANGELOG. Publishing is the
next step after merge.

Refs:
- §75 MODEL-1 100% (PR #1652)
- §74 SHIP-007 bug localized (PR #1650)
- §73 SHIP-007 cascade reduction (PR #1647)
- §72 5-AC LIVE cascade (PR #1646)
- §71 SHIP-005 LIVE-DISCHARGED (PR #1642)
- §70 RC3 fix (PR #1636)
- §69 Q4K hypothesis falsified (PR #1633)
- PR #1635 RC3 prepend
- PR #1634 diagnostic surface + contract
- PR #1648 SHIP-007 contract scaffold
- PR #1649 SHIP-007 PR-B stage dump
- PR #1651 SHIP-007 PR-E F32 GEMV layout fix

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 13, 2026
…1593) (#1661)

* fix(aprender-gpu): SHIP-007 PR-E — F32 GEMV PTX kernel reads [N,K] row-major (was [K,N]); MODEL-1 → 100% (PMAT-CODE-SHIP-007-F32-GEMV-LAYOUT-FIX)

§74 localized the SHIP-007 PARITY-GATE bug to f32_gemv_into via PR-B's
stage-bisection scaffold (CPU vs GPU per-stage statistics analysis).
The F32 GEMV PTX kernel was reading weights with TRANSPOSED layout
interpretation:

Bug: kernel assumed A is K-rows × N-cols row-major (A[i,j] at i*N+j),
     but actual ML weights are stored [output_dim=N, input_dim=K]
     row-major (A[i,j] at i*K+j per PyTorch/SafeTensors/GGUF convention
     and PMAT-333 F32 dequantization output).

Symptom: GPU read transposed weights → computed y = A^T @ x instead
         of y = A @ x → systematically anti-correlated logits
         (cos=-0.005190 vs CPU, top-10 divergences all sign-flipped,
         CPU mean=-2.42 vs GPU mean=0.013).

Fix: rewrite the inner loop to iterate along the K dimension within
     row block_id:
       row_base = a_ptr + block_id * K * 4
       thread reads A[block_id, t], A[block_id, t+32], ...
     instead of:
       col_base = a_ptr + block_id * 4
       thread reads A[t, block_id], A[t+32, block_id], ...

Empirical discharge (canonical 7B teacher, lambda-vector RTX 4090,
default graphed path):

  PARITY-GATE: PASS (no error from forward_gpu_resident)
  Throughput @ 128-tok 5-iter decode: 124.6 tok/s
  AC-SHIP1-007 floor: 30 tok/s
  Headroom: 4.15× over floor
  TTFT: 8.39 ms
  p50 latency: 1016 ms

Before PR-E:
  PARITY-GATE FAILED cos=-0.005190
  Throughput (with SKIP_PARITY_GATE=1 + SKIP_FP8_WARMUP=1): 5.6 tok/s (§63) / 54.5 tok/s (§73)
  GPU CANNOT serve this model

After PR-E:
  PARITY-GATE PASS, default path, NO workarounds
  124.6 tok/s, 4.15× over floor

Ship-% impact:
  MODEL-1 ship %: **99% → 100%**
  10 of 10 AC-SHIP1-* LIVE-DISCHARGED:
    SHIP-001 (§72)  SHIP-002 (§61)  SHIP-003 (§72)
    SHIP-004 (§72)  SHIP-005 (§71)  SHIP-006 (§61.8)
    SHIP-007 (this PR)  SHIP-008 (§61)  SHIP-009 (§72)
    SHIP-010 (§72)

  MODEL-2 ship %: unchanged at 57% (independent track).

Cascade arc closeout: §63 → §73 → PR-A (#1648) → PR-B (#1649)
→ §74 (#1650) → PR-E (this). One PR shipped in 1 day after §73's
'3-5 PR / 3-5 day' estimate.

Auxiliary change: logits.rs adds APR_LM_HEAD_FORCE_QTYPE env-var
probe kept as a diagnostic tool (zero behavior change when unset).

Test plan:
- [x] cargo build --release -p apr-cli --bin apr --features cuda → clean
- [x] apr bench (default path, 128-tok 5-iter) → 124.6 tok/s, passed: true
- [x] apr parity → PARITY-GATE PASS
- [ ] CI tests (workspace-test on per-PR runner)

Refs:
- §74 SHIP-007 bug localized (PR #1650)
- §73 SHIP-007 cascade reduction (PR #1647)
- contracts/apr-ship-007-gpu-stage-bisection-v1.yaml (PR-A #1648 contract)
- PR #1649 (PR-B GPU stage dump scaffold)
- AC-SHIP1-007 (spec §5)
- evidence/section-75-ship-007-discharged-2026-05-13/

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* ci: trigger fresh workflow run for flake-class test re-execution

* fix(aprender-serve): remove APR_LM_HEAD_FORCE_QTYPE probe — FALSIFY-007 contract violation (PMAT-CODE-SHIP-007-PR-E-FALSIFY-007-CLEAN)

The env-var bisection probe added in PR-E (this branch) introduced a
`_ =>` catch-all inside a `match` expression that referenced
`WeightQuantType` in its arm values. The `falsify_007_no_catch_all_
in_dispatch_sites` contract test's 30-line walk-back heuristic flagged
this as a violation, even though the match was on `&str` (env var
value), not on `WeightQuantType`.

The probe was a bisection tool used to identify the bug location
during §74. Now that §75 has shipped the actual fix and the probe is
no longer needed, removing it cleans up the contract violation.

The remaining PR-E change is solely the F32 GEMV PTX kernel layout
fix in `crates/aprender-gpu/src/kernels/gemv/mod.rs` — that's the
actual bug fix.

Test verified:
  cargo test -p aprender-serve --lib \
      quantize::contract_tests::tests::falsify_007_no_catch_all_in_dispatch_sites
  → 1 passed

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(rosetta): add BigCode StarCoder2 model-family contract (closes #1593)

Adds `contracts/model-families/starcoder2.yaml` so apr-cookbook
architecture-demos flips StarCoder2 from `status: blocked` → covered.

StarCoder2 is mapped to `Architecture::Gpt2` in `from_model_type`
(tensor_expectation.rs:130) and aliased in
`kernel_explain/resolve.rs:28`, mirroring the GPT-2 Conv1D tensor
naming. Runtime details differ (RoPE / GQA / sliding-window / GELU+LN
vs GPT-2 absolute / MHA), but tensor names follow the existing pattern,
so the existing GPT-2 mapper handles names correctly. Engine support
for the RoPE+GQA bits on the GPT-2 path is gated separately. YAML-only PR.

Size variants from HF config.json (`bigcode/starcoder2-{3b,7b,15b}`):
- 3b:  hidden=3072  layers=30  heads=24  kv=2   inter=12288
- 7b:  hidden=4608  layers=32  heads=36  kv=4   inter=18432
- 15b: hidden=6144  layers=40  heads=48  kv=4   inter=24576

All sizes share the 49152-token BigCode vocab and 16k context.

Verified:
- `pv validate contracts/model-families/starcoder2.yaml` → 0 errors
- FALSIFY-PARITY-002 passes.

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 13, 2026
…1659)

* fix(aprender-gpu): SHIP-007 PR-E — F32 GEMV PTX kernel reads [N,K] row-major (was [K,N]); MODEL-1 → 100% (PMAT-CODE-SHIP-007-F32-GEMV-LAYOUT-FIX)

§74 localized the SHIP-007 PARITY-GATE bug to f32_gemv_into via PR-B's
stage-bisection scaffold (CPU vs GPU per-stage statistics analysis).
The F32 GEMV PTX kernel was reading weights with TRANSPOSED layout
interpretation:

Bug: kernel assumed A is K-rows × N-cols row-major (A[i,j] at i*N+j),
     but actual ML weights are stored [output_dim=N, input_dim=K]
     row-major (A[i,j] at i*K+j per PyTorch/SafeTensors/GGUF convention
     and PMAT-333 F32 dequantization output).

Symptom: GPU read transposed weights → computed y = A^T @ x instead
         of y = A @ x → systematically anti-correlated logits
         (cos=-0.005190 vs CPU, top-10 divergences all sign-flipped,
         CPU mean=-2.42 vs GPU mean=0.013).

Fix: rewrite the inner loop to iterate along the K dimension within
     row block_id:
       row_base = a_ptr + block_id * K * 4
       thread reads A[block_id, t], A[block_id, t+32], ...
     instead of:
       col_base = a_ptr + block_id * 4
       thread reads A[t, block_id], A[t+32, block_id], ...

Empirical discharge (canonical 7B teacher, lambda-vector RTX 4090,
default graphed path):

  PARITY-GATE: PASS (no error from forward_gpu_resident)
  Throughput @ 128-tok 5-iter decode: 124.6 tok/s
  AC-SHIP1-007 floor: 30 tok/s
  Headroom: 4.15× over floor
  TTFT: 8.39 ms
  p50 latency: 1016 ms

Before PR-E:
  PARITY-GATE FAILED cos=-0.005190
  Throughput (with SKIP_PARITY_GATE=1 + SKIP_FP8_WARMUP=1): 5.6 tok/s (§63) / 54.5 tok/s (§73)
  GPU CANNOT serve this model

After PR-E:
  PARITY-GATE PASS, default path, NO workarounds
  124.6 tok/s, 4.15× over floor

Ship-% impact:
  MODEL-1 ship %: **99% → 100%**
  10 of 10 AC-SHIP1-* LIVE-DISCHARGED:
    SHIP-001 (§72)  SHIP-002 (§61)  SHIP-003 (§72)
    SHIP-004 (§72)  SHIP-005 (§71)  SHIP-006 (§61.8)
    SHIP-007 (this PR)  SHIP-008 (§61)  SHIP-009 (§72)
    SHIP-010 (§72)

  MODEL-2 ship %: unchanged at 57% (independent track).

Cascade arc closeout: §63 → §73 → PR-A (#1648) → PR-B (#1649)
→ §74 (#1650) → PR-E (this). One PR shipped in 1 day after §73's
'3-5 PR / 3-5 day' estimate.

Auxiliary change: logits.rs adds APR_LM_HEAD_FORCE_QTYPE env-var
probe kept as a diagnostic tool (zero behavior change when unset).

Test plan:
- [x] cargo build --release -p apr-cli --bin apr --features cuda → clean
- [x] apr bench (default path, 128-tok 5-iter) → 124.6 tok/s, passed: true
- [x] apr parity → PARITY-GATE PASS
- [ ] CI tests (workspace-test on per-PR runner)

Refs:
- §74 SHIP-007 bug localized (PR #1650)
- §73 SHIP-007 cascade reduction (PR #1647)
- contracts/apr-ship-007-gpu-stage-bisection-v1.yaml (PR-A #1648 contract)
- PR #1649 (PR-B GPU stage dump scaffold)
- AC-SHIP1-007 (spec §5)
- evidence/section-75-ship-007-discharged-2026-05-13/

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* ci: trigger fresh workflow run for flake-class test re-execution

* fix(aprender-serve): remove APR_LM_HEAD_FORCE_QTYPE probe — FALSIFY-007 contract violation (PMAT-CODE-SHIP-007-PR-E-FALSIFY-007-CLEAN)

The env-var bisection probe added in PR-E (this branch) introduced a
`_ =>` catch-all inside a `match` expression that referenced
`WeightQuantType` in its arm values. The `falsify_007_no_catch_all_
in_dispatch_sites` contract test's 30-line walk-back heuristic flagged
this as a violation, even though the match was on `&str` (env var
value), not on `WeightQuantType`.

The probe was a bisection tool used to identify the bug location
during §74. Now that §75 has shipped the actual fix and the probe is
no longer needed, removing it cleans up the contract violation.

The remaining PR-E change is solely the F32 GEMV PTX kernel layout
fix in `crates/aprender-gpu/src/kernels/gemv/mod.rs` — that's the
actual bug fix.

Test verified:
  cargo test -p aprender-serve --lib \
      quantize::contract_tests::tests::falsify_007_no_catch_all_in_dispatch_sites
  → 1 passed

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(rosetta): add IBM Granite model-family contract (closes #1588)

Adds `contracts/model-families/granite.yaml` so apr-cookbook's
architecture-demos spec flips Granite from `status: blocked` to live.

Granite 3.x dense models follow LLaMA-3 architecture (GQA + RoPE + SwiGLU
+ RMSNorm) with the IBM 49152-token vocab and tied embeddings. No engine
change needed — `from_model_type("granite" | "granite3")` already returns
`Architecture::Llama`, and `kernel_explain/resolve.rs` already aliases
`granite → GraniteForCausalLM`.

Size variants: 2b (granite-3.1-2b-base) and 8b (granite-3.1-8b-base).
MoE variants (granite-3.0-3b-a800m-*) use a separate
GraniteMoeForCausalLM architecture and are out of scope.

Verified:
- `pv validate contracts/model-families/granite.yaml` → 0 errors
- FALSIFY-PARITY-002 (`test_every_model_family_yaml_has_architecture`)
  passes — the family is recognized by `from_model_type` → `Self::Llama`.

References: granite-3.1-2b-base / granite-3.1-8b-base HF config.json.

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 14, 2026
…closes #1623 part 2) (#1658)

* fix(aprender-gpu): SHIP-007 PR-E — F32 GEMV PTX kernel reads [N,K] row-major (was [K,N]); MODEL-1 → 100% (PMAT-CODE-SHIP-007-F32-GEMV-LAYOUT-FIX)

§74 localized the SHIP-007 PARITY-GATE bug to f32_gemv_into via PR-B's
stage-bisection scaffold (CPU vs GPU per-stage statistics analysis).
The F32 GEMV PTX kernel was reading weights with TRANSPOSED layout
interpretation:

Bug: kernel assumed A is K-rows × N-cols row-major (A[i,j] at i*N+j),
     but actual ML weights are stored [output_dim=N, input_dim=K]
     row-major (A[i,j] at i*K+j per PyTorch/SafeTensors/GGUF convention
     and PMAT-333 F32 dequantization output).

Symptom: GPU read transposed weights → computed y = A^T @ x instead
         of y = A @ x → systematically anti-correlated logits
         (cos=-0.005190 vs CPU, top-10 divergences all sign-flipped,
         CPU mean=-2.42 vs GPU mean=0.013).

Fix: rewrite the inner loop to iterate along the K dimension within
     row block_id:
       row_base = a_ptr + block_id * K * 4
       thread reads A[block_id, t], A[block_id, t+32], ...
     instead of:
       col_base = a_ptr + block_id * 4
       thread reads A[t, block_id], A[t+32, block_id], ...

Empirical discharge (canonical 7B teacher, lambda-vector RTX 4090,
default graphed path):

  PARITY-GATE: PASS (no error from forward_gpu_resident)
  Throughput @ 128-tok 5-iter decode: 124.6 tok/s
  AC-SHIP1-007 floor: 30 tok/s
  Headroom: 4.15× over floor
  TTFT: 8.39 ms
  p50 latency: 1016 ms

Before PR-E:
  PARITY-GATE FAILED cos=-0.005190
  Throughput (with SKIP_PARITY_GATE=1 + SKIP_FP8_WARMUP=1): 5.6 tok/s (§63) / 54.5 tok/s (§73)
  GPU CANNOT serve this model

After PR-E:
  PARITY-GATE PASS, default path, NO workarounds
  124.6 tok/s, 4.15× over floor

Ship-% impact:
  MODEL-1 ship %: **99% → 100%**
  10 of 10 AC-SHIP1-* LIVE-DISCHARGED:
    SHIP-001 (§72)  SHIP-002 (§61)  SHIP-003 (§72)
    SHIP-004 (§72)  SHIP-005 (§71)  SHIP-006 (§61.8)
    SHIP-007 (this PR)  SHIP-008 (§61)  SHIP-009 (§72)
    SHIP-010 (§72)

  MODEL-2 ship %: unchanged at 57% (independent track).

Cascade arc closeout: §63 → §73 → PR-A (#1648) → PR-B (#1649)
→ §74 (#1650) → PR-E (this). One PR shipped in 1 day after §73's
'3-5 PR / 3-5 day' estimate.

Auxiliary change: logits.rs adds APR_LM_HEAD_FORCE_QTYPE env-var
probe kept as a diagnostic tool (zero behavior change when unset).

Test plan:
- [x] cargo build --release -p apr-cli --bin apr --features cuda → clean
- [x] apr bench (default path, 128-tok 5-iter) → 124.6 tok/s, passed: true
- [x] apr parity → PARITY-GATE PASS
- [ ] CI tests (workspace-test on per-PR runner)

Refs:
- §74 SHIP-007 bug localized (PR #1650)
- §73 SHIP-007 cascade reduction (PR #1647)
- contracts/apr-ship-007-gpu-stage-bisection-v1.yaml (PR-A #1648 contract)
- PR #1649 (PR-B GPU stage dump scaffold)
- AC-SHIP1-007 (spec §5)
- evidence/section-75-ship-007-discharged-2026-05-13/

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* ci: trigger fresh workflow run for flake-class test re-execution

* fix(aprender-serve): remove APR_LM_HEAD_FORCE_QTYPE probe — FALSIFY-007 contract violation (PMAT-CODE-SHIP-007-PR-E-FALSIFY-007-CLEAN)

The env-var bisection probe added in PR-E (this branch) introduced a
`_ =>` catch-all inside a `match` expression that referenced
`WeightQuantType` in its arm values. The `falsify_007_no_catch_all_
in_dispatch_sites` contract test's 30-line walk-back heuristic flagged
this as a violation, even though the match was on `&str` (env var
value), not on `WeightQuantType`.

The probe was a bisection tool used to identify the bug location
during §74. Now that §75 has shipped the actual fix and the probe is
no longer needed, removing it cleans up the contract violation.

The remaining PR-E change is solely the F32 GEMV PTX kernel layout
fix in `crates/aprender-gpu/src/kernels/gemv/mod.rs` — that's the
actual bug fix.

Test verified:
  cargo test -p aprender-serve --lib \
      quantize::contract_tests::tests::falsify_007_no_catch_all_in_dispatch_sites
  → 1 passed

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(aprender-serve): prepare_tokens_apr — no chat-wrap on base models (closes #1623 part 2)

`prepare_tokens_apr` was auto-wrapping ALL APR models with a chat template
when the model:
  - had a known architecture (qwen2 / llama / mistral / phi), OR
  - had `<|im_start|>` in vocab (ChatML special tokens), OR
  - had `instruct` in filename

That's too broad. Base completion models like qwen2.5-coder-0.5b (base,
not instruct) carry the Qwen tokenizer — which includes ChatML special
tokens in vocab — but should NOT be chat-wrapped. The over-trigger
produced garbage-looking output for base models.

Fix mirrors the GGUF path (GH-278): only wrap when the model actually has
a `tokenizer.chat_template` in metadata, OR when filename hints
`instruct` / `-chat`. Architecture and vocab-token heuristics removed.

Reported in #1623 (the Coursera capstone investigation) — confirmed
`apr run ... '2+2=' --temperature 0 --no-gpu` produces coherent output
on base qwen2.5-coder-0.5b after this fix.

All 6 prepare_tokens tests still pass.

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 14, 2026
) (#1656)

* fix(aprender-gpu): SHIP-007 PR-E — F32 GEMV PTX kernel reads [N,K] row-major (was [K,N]); MODEL-1 → 100% (PMAT-CODE-SHIP-007-F32-GEMV-LAYOUT-FIX)

§74 localized the SHIP-007 PARITY-GATE bug to f32_gemv_into via PR-B's
stage-bisection scaffold (CPU vs GPU per-stage statistics analysis).
The F32 GEMV PTX kernel was reading weights with TRANSPOSED layout
interpretation:

Bug: kernel assumed A is K-rows × N-cols row-major (A[i,j] at i*N+j),
     but actual ML weights are stored [output_dim=N, input_dim=K]
     row-major (A[i,j] at i*K+j per PyTorch/SafeTensors/GGUF convention
     and PMAT-333 F32 dequantization output).

Symptom: GPU read transposed weights → computed y = A^T @ x instead
         of y = A @ x → systematically anti-correlated logits
         (cos=-0.005190 vs CPU, top-10 divergences all sign-flipped,
         CPU mean=-2.42 vs GPU mean=0.013).

Fix: rewrite the inner loop to iterate along the K dimension within
     row block_id:
       row_base = a_ptr + block_id * K * 4
       thread reads A[block_id, t], A[block_id, t+32], ...
     instead of:
       col_base = a_ptr + block_id * 4
       thread reads A[t, block_id], A[t+32, block_id], ...

Empirical discharge (canonical 7B teacher, lambda-vector RTX 4090,
default graphed path):

  PARITY-GATE: PASS (no error from forward_gpu_resident)
  Throughput @ 128-tok 5-iter decode: 124.6 tok/s
  AC-SHIP1-007 floor: 30 tok/s
  Headroom: 4.15× over floor
  TTFT: 8.39 ms
  p50 latency: 1016 ms

Before PR-E:
  PARITY-GATE FAILED cos=-0.005190
  Throughput (with SKIP_PARITY_GATE=1 + SKIP_FP8_WARMUP=1): 5.6 tok/s (§63) / 54.5 tok/s (§73)
  GPU CANNOT serve this model

After PR-E:
  PARITY-GATE PASS, default path, NO workarounds
  124.6 tok/s, 4.15× over floor

Ship-% impact:
  MODEL-1 ship %: **99% → 100%**
  10 of 10 AC-SHIP1-* LIVE-DISCHARGED:
    SHIP-001 (§72)  SHIP-002 (§61)  SHIP-003 (§72)
    SHIP-004 (§72)  SHIP-005 (§71)  SHIP-006 (§61.8)
    SHIP-007 (this PR)  SHIP-008 (§61)  SHIP-009 (§72)
    SHIP-010 (§72)

  MODEL-2 ship %: unchanged at 57% (independent track).

Cascade arc closeout: §63 → §73 → PR-A (#1648) → PR-B (#1649)
→ §74 (#1650) → PR-E (this). One PR shipped in 1 day after §73's
'3-5 PR / 3-5 day' estimate.

Auxiliary change: logits.rs adds APR_LM_HEAD_FORCE_QTYPE env-var
probe kept as a diagnostic tool (zero behavior change when unset).

Test plan:
- [x] cargo build --release -p apr-cli --bin apr --features cuda → clean
- [x] apr bench (default path, 128-tok 5-iter) → 124.6 tok/s, passed: true
- [x] apr parity → PARITY-GATE PASS
- [ ] CI tests (workspace-test on per-PR runner)

Refs:
- §74 SHIP-007 bug localized (PR #1650)
- §73 SHIP-007 cascade reduction (PR #1647)
- contracts/apr-ship-007-gpu-stage-bisection-v1.yaml (PR-A #1648 contract)
- PR #1649 (PR-B GPU stage dump scaffold)
- AC-SHIP1-007 (spec §5)
- evidence/section-75-ship-007-discharged-2026-05-13/

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* ci: trigger fresh workflow run for flake-class test re-execution

* fix(aprender-serve): remove APR_LM_HEAD_FORCE_QTYPE probe — FALSIFY-007 contract violation (PMAT-CODE-SHIP-007-PR-E-FALSIFY-007-CLEAN)

The env-var bisection probe added in PR-E (this branch) introduced a
`_ =>` catch-all inside a `match` expression that referenced
`WeightQuantType` in its arm values. The `falsify_007_no_catch_all_
in_dispatch_sites` contract test's 30-line walk-back heuristic flagged
this as a violation, even though the match was on `&str` (env var
value), not on `WeightQuantType`.

The probe was a bisection tool used to identify the bug location
during §74. Now that §75 has shipped the actual fix and the probe is
no longer needed, removing it cleans up the contract violation.

The remaining PR-E change is solely the F32 GEMV PTX kernel layout
fix in `crates/aprender-gpu/src/kernels/gemv/mod.rs` — that's the
actual bug fix.

Test verified:
  cargo test -p aprender-serve --lib \
      quantize::contract_tests::tests::falsify_007_no_catch_all_in_dispatch_sites
  → 1 passed

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(gguf): Q5_0/Q5_1 dequant layout matches GGML reference (closes #1623)

The Q5_0 and Q5_1 dequantizers in aprender-core were emitting values in
interleaved order [v0, v1, v0, v1, ...] and using wrong high-bit indices
(i*2 / i*2+1). GGML / llama.cpp layout is:

  for j in 0..16:
    y[j]       = low0 (qs[j] & 0x0F) | (qh bit j      << 4)
    y[j + 16]  = low1 (qs[j] >> 4)   | (qh bit j+16   << 4)

Two halves, NOT interleaved. High bit for element j uses bit j; for
element j+16 uses bit j+16.

Existing tests only checked length and finite-ness — never the layout.
Adds two GGML-reference layout tests (`test_dequantize_q5_0_ggml_layout`,
`test_dequantize_q5_1_ggml_layout`) that fail under the buggy code and
pass under the fix.

Reported in #1623 from a Coursera capstone using mixed-quant GGUF.

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 14, 2026
… (10/10 AC-SHIP1-* LIVE-DISCHARGED) (#1651)

* fix(aprender-gpu): SHIP-007 PR-E — F32 GEMV PTX kernel reads [N,K] row-major (was [K,N]); MODEL-1 → 100% (PMAT-CODE-SHIP-007-F32-GEMV-LAYOUT-FIX)

§74 localized the SHIP-007 PARITY-GATE bug to f32_gemv_into via PR-B's
stage-bisection scaffold (CPU vs GPU per-stage statistics analysis).
The F32 GEMV PTX kernel was reading weights with TRANSPOSED layout
interpretation:

Bug: kernel assumed A is K-rows × N-cols row-major (A[i,j] at i*N+j),
     but actual ML weights are stored [output_dim=N, input_dim=K]
     row-major (A[i,j] at i*K+j per PyTorch/SafeTensors/GGUF convention
     and PMAT-333 F32 dequantization output).

Symptom: GPU read transposed weights → computed y = A^T @ x instead
         of y = A @ x → systematically anti-correlated logits
         (cos=-0.005190 vs CPU, top-10 divergences all sign-flipped,
         CPU mean=-2.42 vs GPU mean=0.013).

Fix: rewrite the inner loop to iterate along the K dimension within
     row block_id:
       row_base = a_ptr + block_id * K * 4
       thread reads A[block_id, t], A[block_id, t+32], ...
     instead of:
       col_base = a_ptr + block_id * 4
       thread reads A[t, block_id], A[t+32, block_id], ...

Empirical discharge (canonical 7B teacher, lambda-vector RTX 4090,
default graphed path):

  PARITY-GATE: PASS (no error from forward_gpu_resident)
  Throughput @ 128-tok 5-iter decode: 124.6 tok/s
  AC-SHIP1-007 floor: 30 tok/s
  Headroom: 4.15× over floor
  TTFT: 8.39 ms
  p50 latency: 1016 ms

Before PR-E:
  PARITY-GATE FAILED cos=-0.005190
  Throughput (with SKIP_PARITY_GATE=1 + SKIP_FP8_WARMUP=1): 5.6 tok/s (§63) / 54.5 tok/s (§73)
  GPU CANNOT serve this model

After PR-E:
  PARITY-GATE PASS, default path, NO workarounds
  124.6 tok/s, 4.15× over floor

Ship-% impact:
  MODEL-1 ship %: **99% → 100%**
  10 of 10 AC-SHIP1-* LIVE-DISCHARGED:
    SHIP-001 (§72)  SHIP-002 (§61)  SHIP-003 (§72)
    SHIP-004 (§72)  SHIP-005 (§71)  SHIP-006 (§61.8)
    SHIP-007 (this PR)  SHIP-008 (§61)  SHIP-009 (§72)
    SHIP-010 (§72)

  MODEL-2 ship %: unchanged at 57% (independent track).

Cascade arc closeout: §63 → §73 → PR-A (#1648) → PR-B (#1649)
→ §74 (#1650) → PR-E (this). One PR shipped in 1 day after §73's
'3-5 PR / 3-5 day' estimate.

Auxiliary change: logits.rs adds APR_LM_HEAD_FORCE_QTYPE env-var
probe kept as a diagnostic tool (zero behavior change when unset).

Test plan:
- [x] cargo build --release -p apr-cli --bin apr --features cuda → clean
- [x] apr bench (default path, 128-tok 5-iter) → 124.6 tok/s, passed: true
- [x] apr parity → PARITY-GATE PASS
- [ ] CI tests (workspace-test on per-PR runner)

Refs:
- §74 SHIP-007 bug localized (PR #1650)
- §73 SHIP-007 cascade reduction (PR #1647)
- contracts/apr-ship-007-gpu-stage-bisection-v1.yaml (PR-A #1648 contract)
- PR #1649 (PR-B GPU stage dump scaffold)
- AC-SHIP1-007 (spec §5)
- evidence/section-75-ship-007-discharged-2026-05-13/

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* ci: trigger fresh workflow run for flake-class test re-execution

* fix(aprender-serve): remove APR_LM_HEAD_FORCE_QTYPE probe — FALSIFY-007 contract violation (PMAT-CODE-SHIP-007-PR-E-FALSIFY-007-CLEAN)

The env-var bisection probe added in PR-E (this branch) introduced a
`_ =>` catch-all inside a `match` expression that referenced
`WeightQuantType` in its arm values. The `falsify_007_no_catch_all_
in_dispatch_sites` contract test's 30-line walk-back heuristic flagged
this as a violation, even though the match was on `&str` (env var
value), not on `WeightQuantType`.

The probe was a bisection tool used to identify the bug location
during §74. Now that §75 has shipped the actual fix and the probe is
no longer needed, removing it cleans up the contract violation.

The remaining PR-E change is solely the F32 GEMV PTX kernel layout
fix in `crates/aprender-gpu/src/kernels/gemv/mod.rs` — that's the
actual bug fix.

Test verified:
  cargo test -p aprender-serve --lib \
      quantize::contract_tests::tests::falsify_007_no_catch_all_in_dispatch_sites
  → 1 passed

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 14, 2026
…P-TWO-SECTION-75) (#1652)

PR-E (#1651) shipped the single-file F32 GEMV PTX layout fix. SHIP-007
LIVE-DISCHARGED. All 10 AC-SHIP1-* now LIVE on canonical 7B Qwen2.5-
Coder-Instruct Q4_K_M teacher.

10/10 LIVE-discharge table:
  SHIP-001  §72  apr run <safetensors> exit 0
  SHIP-002  §61  apr run "def fib(n):" valid Python (#1609)
  SHIP-003  §72  apr diff 20 tensors at cos_sim=1.000000
  SHIP-004  §72  llama-cli exit 0, 133.1 gen tok/s
  SHIP-005  §71  HumanEval pass@1 = 86.59% (gx10 164-run)
  SHIP-006  §61.8 apr qa 12-gate aggregate PASS (#1615)
  SHIP-007  §75  PARITY-GATE PASS + 124.6 tok/s @ 128-tok (this section)
  SHIP-008  §61  apr run SHIP-008 USER → 256-token ChatML (#1614)
  SHIP-009  §72  apr inspect license/provenance fields
  SHIP-010  §72  sha256 match 0a854098…

Empirical discharge proof for SHIP-007:
  apr bench <canonical 7B APR> --iterations 5 --max-tokens 128
  → tokens_per_second: 124.6
  → AC-SHIP1-007 floor: 30 → headroom 4.15×
  → PARITY-GATE: PASS (no error)
  → Default path (CUDA graphed), no SKIP_PARITY_GATE, no APR_SKIP_FP8_WARMUP

Cascade arc closeout:
  §63 2026-05-11 → SHIP-007 framed as 3-layer cascade
  §73 2026-05-12 → re-measurement: only parity layer blocks
  §74 2026-05-13 → bug LOCALIZED to F32 GEMV via PR-B stage bisection
  §75 2026-05-13 → PR-E layout fix → MODEL-1 100%

§73's '3-5 PR / 3-5 day' estimate. Actual: 4 PRs (#1648 contract,

Methodology lesson #22 NEW: symptom analysis (sign-flipped top-K
divergences + CPU/GPU mean mismatch + sane intermediates) →
bug class localization in O(1). Methodology lessons compose;
each makes the next cheaper.

Ship-% movement:
  MODEL-1 ship %: 99% → 100% 🎉
  MODEL-2 ship %: unchanged at 57% (independent track,
    gated on step 5g.3 val_loss < 9.38).

Spec version: 3.19.0 → 3.21.0 (post-§72/73 stack at 3.18.0;
§74 at 3.20.0; §75 here at 3.21.0).

Out of scope (future work):
- MODEL-2 ship % path (independent track, separate cascade)
- Publish-readiness gates (GATE-SHIP-001/002/003 still need green CI +
  post-publish QA per feedback_post_publish_qa_required.md)
- HumanEval/MBPP benchmark improvements beyond §71's 86.59%

Refs:
- §74 SHIP-007 localization (PR #1650)
- §73 SHIP-007 cascade reduction (PR #1647)
- PR #1648 (contract scaffold), #1649 (PR-B stage dump)
- PR #1651 (PR-E F32 GEMV layout fix)
- AC-SHIP1-007 (spec §5)
- evidence/section-75-ship-007-discharged-2026-05-13/

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 14, 2026
…w-major (was [K,N]); MODEL-1 → 100% (PMAT-CODE-SHIP-007-F32-GEMV-LAYOUT-FIX)

§74 localized the SHIP-007 PARITY-GATE bug to f32_gemv_into via PR-B's
stage-bisection scaffold (CPU vs GPU per-stage statistics analysis).
The F32 GEMV PTX kernel was reading weights with TRANSPOSED layout
interpretation:

Bug: kernel assumed A is K-rows × N-cols row-major (A[i,j] at i*N+j),
     but actual ML weights are stored [output_dim=N, input_dim=K]
     row-major (A[i,j] at i*K+j per PyTorch/SafeTensors/GGUF convention
     and PMAT-333 F32 dequantization output).

Symptom: GPU read transposed weights → computed y = A^T @ x instead
         of y = A @ x → systematically anti-correlated logits
         (cos=-0.005190 vs CPU, top-10 divergences all sign-flipped,
         CPU mean=-2.42 vs GPU mean=0.013).

Fix: rewrite the inner loop to iterate along the K dimension within
     row block_id:
       row_base = a_ptr + block_id * K * 4
       thread reads A[block_id, t], A[block_id, t+32], ...
     instead of:
       col_base = a_ptr + block_id * 4
       thread reads A[t, block_id], A[t+32, block_id], ...

Empirical discharge (canonical 7B teacher, lambda-vector RTX 4090,
default graphed path):

  PARITY-GATE: PASS (no error from forward_gpu_resident)
  Throughput @ 128-tok 5-iter decode: 124.6 tok/s
  AC-SHIP1-007 floor: 30 tok/s
  Headroom: 4.15× over floor
  TTFT: 8.39 ms
  p50 latency: 1016 ms

Before PR-E:
  PARITY-GATE FAILED cos=-0.005190
  Throughput (with SKIP_PARITY_GATE=1 + SKIP_FP8_WARMUP=1): 5.6 tok/s (§63) / 54.5 tok/s (§73)
  GPU CANNOT serve this model

After PR-E:
  PARITY-GATE PASS, default path, NO workarounds
  124.6 tok/s, 4.15× over floor

Ship-% impact:
  MODEL-1 ship %: **99% → 100%**
  10 of 10 AC-SHIP1-* LIVE-DISCHARGED:
    SHIP-001 (§72)  SHIP-002 (§61)  SHIP-003 (§72)
    SHIP-004 (§72)  SHIP-005 (§71)  SHIP-006 (§61.8)
    SHIP-007 (this PR)  SHIP-008 (§61)  SHIP-009 (§72)
    SHIP-010 (§72)

  MODEL-2 ship %: unchanged at 57% (independent track).

Cascade arc closeout: §63 → §73 → PR-A (#1648) → PR-B (#1649)
→ §74 (#1650) → PR-E (this). One PR shipped in 1 day after §73's
'3-5 PR / 3-5 day' estimate.

Auxiliary change: logits.rs adds APR_LM_HEAD_FORCE_QTYPE env-var
probe kept as a diagnostic tool (zero behavior change when unset).

Test plan:
- [x] cargo build --release -p apr-cli --bin apr --features cuda → clean
- [x] apr bench (default path, 128-tok 5-iter) → 124.6 tok/s, passed: true
- [x] apr parity → PARITY-GATE PASS
- [ ] CI tests (workspace-test on per-PR runner)

Refs:
- §74 SHIP-007 bug localized (PR #1650)
- §73 SHIP-007 cascade reduction (PR #1647)
- contracts/apr-ship-007-gpu-stage-bisection-v1.yaml (PR-A #1648 contract)
- PR #1649 (PR-B GPU stage dump scaffold)
- AC-SHIP1-007 (spec §5)
- evidence/section-75-ship-007-discharged-2026-05-13/

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 14, 2026
… (#1660)

* fix(aprender-gpu): SHIP-007 PR-E — F32 GEMV PTX kernel reads [N,K] row-major (was [K,N]); MODEL-1 → 100% (PMAT-CODE-SHIP-007-F32-GEMV-LAYOUT-FIX)

§74 localized the SHIP-007 PARITY-GATE bug to f32_gemv_into via PR-B's
stage-bisection scaffold (CPU vs GPU per-stage statistics analysis).
The F32 GEMV PTX kernel was reading weights with TRANSPOSED layout
interpretation:

Bug: kernel assumed A is K-rows × N-cols row-major (A[i,j] at i*N+j),
     but actual ML weights are stored [output_dim=N, input_dim=K]
     row-major (A[i,j] at i*K+j per PyTorch/SafeTensors/GGUF convention
     and PMAT-333 F32 dequantization output).

Symptom: GPU read transposed weights → computed y = A^T @ x instead
         of y = A @ x → systematically anti-correlated logits
         (cos=-0.005190 vs CPU, top-10 divergences all sign-flipped,
         CPU mean=-2.42 vs GPU mean=0.013).

Fix: rewrite the inner loop to iterate along the K dimension within
     row block_id:
       row_base = a_ptr + block_id * K * 4
       thread reads A[block_id, t], A[block_id, t+32], ...
     instead of:
       col_base = a_ptr + block_id * 4
       thread reads A[t, block_id], A[t+32, block_id], ...

Empirical discharge (canonical 7B teacher, lambda-vector RTX 4090,
default graphed path):

  PARITY-GATE: PASS (no error from forward_gpu_resident)
  Throughput @ 128-tok 5-iter decode: 124.6 tok/s
  AC-SHIP1-007 floor: 30 tok/s
  Headroom: 4.15× over floor
  TTFT: 8.39 ms
  p50 latency: 1016 ms

Before PR-E:
  PARITY-GATE FAILED cos=-0.005190
  Throughput (with SKIP_PARITY_GATE=1 + SKIP_FP8_WARMUP=1): 5.6 tok/s (§63) / 54.5 tok/s (§73)
  GPU CANNOT serve this model

After PR-E:
  PARITY-GATE PASS, default path, NO workarounds
  124.6 tok/s, 4.15× over floor

Ship-% impact:
  MODEL-1 ship %: **99% → 100%**
  10 of 10 AC-SHIP1-* LIVE-DISCHARGED:
    SHIP-001 (§72)  SHIP-002 (§61)  SHIP-003 (§72)
    SHIP-004 (§72)  SHIP-005 (§71)  SHIP-006 (§61.8)
    SHIP-007 (this PR)  SHIP-008 (§61)  SHIP-009 (§72)
    SHIP-010 (§72)

  MODEL-2 ship %: unchanged at 57% (independent track).

Cascade arc closeout: §63 → §73 → PR-A (#1648) → PR-B (#1649)
→ §74 (#1650) → PR-E (this). One PR shipped in 1 day after §73's
'3-5 PR / 3-5 day' estimate.

Auxiliary change: logits.rs adds APR_LM_HEAD_FORCE_QTYPE env-var
probe kept as a diagnostic tool (zero behavior change when unset).

Test plan:
- [x] cargo build --release -p apr-cli --bin apr --features cuda → clean
- [x] apr bench (default path, 128-tok 5-iter) → 124.6 tok/s, passed: true
- [x] apr parity → PARITY-GATE PASS
- [ ] CI tests (workspace-test on per-PR runner)

Refs:
- §74 SHIP-007 bug localized (PR #1650)
- §73 SHIP-007 cascade reduction (PR #1647)
- contracts/apr-ship-007-gpu-stage-bisection-v1.yaml (PR-A #1648 contract)
- PR #1649 (PR-B GPU stage dump scaffold)
- AC-SHIP1-007 (spec §5)
- evidence/section-75-ship-007-discharged-2026-05-13/

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* ci: trigger fresh workflow run for flake-class test re-execution

* fix(aprender-serve): remove APR_LM_HEAD_FORCE_QTYPE probe — FALSIFY-007 contract violation (PMAT-CODE-SHIP-007-PR-E-FALSIFY-007-CLEAN)

The env-var bisection probe added in PR-E (this branch) introduced a
`_ =>` catch-all inside a `match` expression that referenced
`WeightQuantType` in its arm values. The `falsify_007_no_catch_all_
in_dispatch_sites` contract test's 30-line walk-back heuristic flagged
this as a violation, even though the match was on `&str` (env var
value), not on `WeightQuantType`.

The probe was a bisection tool used to identify the bug location
during §74. Now that §75 has shipped the actual fix and the probe is
no longer needed, removing it cleans up the contract violation.

The remaining PR-E change is solely the F32 GEMV PTX kernel layout
fix in `crates/aprender-gpu/src/kernels/gemv/mod.rs` — that's the
actual bug fix.

Test verified:
  cargo test -p aprender-serve --lib \
      quantize::contract_tests::tests::falsify_007_no_catch_all_in_dispatch_sites
  → 1 passed

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(rosetta): add NVIDIA Nemotron model-family contract (closes #1590)

Adds `contracts/model-families/nemotron.yaml` so apr-cookbook
architecture-demos flips Nemotron from `status: blocked` → covered.

Nemotron-LM dense releases are Llama-derivative — Llama-3.1-Nemotron-70B
is an SFT/RLHF tune over meta-llama/Llama-3.1-70B-Instruct, and
Nemotron-Mini-4B-Base / Mistral-NeMo-Minitron-8B are distilled
Llama-style models. All use the standard `LlamaForCausalLM` tensor
naming and GQA + RoPE + SwiGLU + RMSNorm constraints.

`from_model_type("nemotron")` already returns `Architecture::Llama`
(tensor_expectation.rs:142), so no engine change needed — YAML only.

Size variants:
- 4b (Nemotron-Mini-4B-Base — note 256k vocab, RoPE θ=10000)
- 8b (Mistral-NeMo-Minitron-8B — 131k vocab, RoPE θ=10000)
- 70b (Llama-3.1-Nemotron-70B — 128k vocab, RoPE θ=500000)

Verified:
- `pv validate contracts/model-families/nemotron.yaml` → 0 errors
- FALSIFY-PARITY-002 (`test_every_model_family_yaml_has_architecture`)
  passes.

Out of scope: Nemotron-H (hybrid Transformer+SSM) and Nemotron-4 (uses
distinct activation/norm) — separate architecture variants.

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant