fix(aprender-gpu): SHIP-007 PR-E — F32 GEMV layout fix → MODEL-1 100% (10/10 AC-SHIP1-* LIVE-DISCHARGED)#1651
Merged
Merged
Conversation
…w-major (was [K,N]); MODEL-1 → 100% (PMAT-CODE-SHIP-007-F32-GEMV-LAYOUT-FIX)
§74 localized the SHIP-007 PARITY-GATE bug to f32_gemv_into via PR-B's
stage-bisection scaffold (CPU vs GPU per-stage statistics analysis).
The F32 GEMV PTX kernel was reading weights with TRANSPOSED layout
interpretation:
Bug: kernel assumed A is K-rows × N-cols row-major (A[i,j] at i*N+j),
but actual ML weights are stored [output_dim=N, input_dim=K]
row-major (A[i,j] at i*K+j per PyTorch/SafeTensors/GGUF convention
and PMAT-333 F32 dequantization output).
Symptom: GPU read transposed weights → computed y = A^T @ x instead
of y = A @ x → systematically anti-correlated logits
(cos=-0.005190 vs CPU, top-10 divergences all sign-flipped,
CPU mean=-2.42 vs GPU mean=0.013).
Fix: rewrite the inner loop to iterate along the K dimension within
row block_id:
row_base = a_ptr + block_id * K * 4
thread reads A[block_id, t], A[block_id, t+32], ...
instead of:
col_base = a_ptr + block_id * 4
thread reads A[t, block_id], A[t+32, block_id], ...
Empirical discharge (canonical 7B teacher, lambda-vector RTX 4090,
default graphed path):
PARITY-GATE: PASS (no error from forward_gpu_resident)
Throughput @ 128-tok 5-iter decode: 124.6 tok/s
AC-SHIP1-007 floor: 30 tok/s
Headroom: 4.15× over floor
TTFT: 8.39 ms
p50 latency: 1016 ms
Before PR-E:
PARITY-GATE FAILED cos=-0.005190
Throughput (with SKIP_PARITY_GATE=1 + SKIP_FP8_WARMUP=1): 5.6 tok/s (§63) / 54.5 tok/s (§73)
GPU CANNOT serve this model
After PR-E:
PARITY-GATE PASS, default path, NO workarounds
124.6 tok/s, 4.15× over floor
Ship-% impact:
MODEL-1 ship %: **99% → 100%**
10 of 10 AC-SHIP1-* LIVE-DISCHARGED:
SHIP-001 (§72) SHIP-002 (§61) SHIP-003 (§72)
SHIP-004 (§72) SHIP-005 (§71) SHIP-006 (§61.8)
SHIP-007 (this PR) SHIP-008 (§61) SHIP-009 (§72)
SHIP-010 (§72)
MODEL-2 ship %: unchanged at 57% (independent track).
Cascade arc closeout: §63 → §73 → PR-A (#1648) → PR-B (#1649)
→ §74 (#1650) → PR-E (this). One PR shipped in 1 day after §73's
'3-5 PR / 3-5 day' estimate.
Auxiliary change: logits.rs adds APR_LM_HEAD_FORCE_QTYPE env-var
probe kept as a diagnostic tool (zero behavior change when unset).
Test plan:
- [x] cargo build --release -p apr-cli --bin apr --features cuda → clean
- [x] apr bench (default path, 128-tok 5-iter) → 124.6 tok/s, passed: true
- [x] apr parity → PARITY-GATE PASS
- [ ] CI tests (workspace-test on per-PR runner)
Refs:
- §74 SHIP-007 bug localized (PR #1650)
- §73 SHIP-007 cascade reduction (PR #1647)
- contracts/apr-ship-007-gpu-stage-bisection-v1.yaml (PR-A #1648 contract)
- PR #1649 (PR-B GPU stage dump scaffold)
- AC-SHIP1-007 (spec §5)
- evidence/section-75-ship-007-discharged-2026-05-13/
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4 tasks
4 tasks
noahgift
added a commit
that referenced
this pull request
May 13, 2026
…P-TWO-SECTION-75) PR-E (#1651) shipped the single-file F32 GEMV PTX layout fix. SHIP-007 LIVE-DISCHARGED. All 10 AC-SHIP1-* now LIVE on canonical 7B Qwen2.5- Coder-Instruct Q4_K_M teacher. 10/10 LIVE-discharge table: SHIP-001 §72 apr run <safetensors> exit 0 SHIP-002 §61 apr run "def fib(n):" valid Python (#1609) SHIP-003 §72 apr diff 20 tensors at cos_sim=1.000000 SHIP-004 §72 llama-cli exit 0, 133.1 gen tok/s SHIP-005 §71 HumanEval pass@1 = 86.59% (gx10 164-run) SHIP-006 §61.8 apr qa 12-gate aggregate PASS (#1615) SHIP-007 §75 PARITY-GATE PASS + 124.6 tok/s @ 128-tok (this section) SHIP-008 §61 apr run SHIP-008 USER → 256-token ChatML (#1614) SHIP-009 §72 apr inspect license/provenance fields SHIP-010 §72 sha256 match 0a854098… Empirical discharge proof for SHIP-007: apr bench <canonical 7B APR> --iterations 5 --max-tokens 128 → tokens_per_second: 124.6 → AC-SHIP1-007 floor: 30 → headroom 4.15× → PARITY-GATE: PASS (no error) → Default path (CUDA graphed), no SKIP_PARITY_GATE, no APR_SKIP_FP8_WARMUP Cascade arc closeout: §63 2026-05-11 → SHIP-007 framed as 3-layer cascade §73 2026-05-12 → re-measurement: only parity layer blocks §74 2026-05-13 → bug LOCALIZED to F32 GEMV via PR-B stage bisection §75 2026-05-13 → PR-E layout fix → MODEL-1 100% §73's '3-5 PR / 3-5 day' estimate. Actual: 4 PRs (#1648 contract, Methodology lesson #22 NEW: symptom analysis (sign-flipped top-K divergences + CPU/GPU mean mismatch + sane intermediates) → bug class localization in O(1). Methodology lessons compose; each makes the next cheaper. Ship-% movement: MODEL-1 ship %: 99% → 100% 🎉 MODEL-2 ship %: unchanged at 57% (independent track, gated on step 5g.3 val_loss < 9.38). Spec version: 3.19.0 → 3.21.0 (post-§72/73 stack at 3.18.0; §74 at 3.20.0; §75 here at 3.21.0). Out of scope (future work): - MODEL-2 ship % path (independent track, separate cascade) - Publish-readiness gates (GATE-SHIP-001/002/003 still need green CI + post-publish QA per feedback_post_publish_qa_required.md) - HumanEval/MBPP benchmark improvements beyond §71's 86.59% Refs: - §74 SHIP-007 localization (PR #1650) - §73 SHIP-007 cascade reduction (PR #1647) - PR #1648 (contract scaffold), #1649 (PR-B stage dump) - PR #1651 (PR-E F32 GEMV layout fix) - AC-SHIP1-007 (spec §5) - evidence/section-75-ship-007-discharged-2026-05-13/ Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…07 contract violation (PMAT-CODE-SHIP-007-PR-E-FALSIFY-007-CLEAN)
The env-var bisection probe added in PR-E (this branch) introduced a
`_ =>` catch-all inside a `match` expression that referenced
`WeightQuantType` in its arm values. The `falsify_007_no_catch_all_
in_dispatch_sites` contract test's 30-line walk-back heuristic flagged
this as a violation, even though the match was on `&str` (env var
value), not on `WeightQuantType`.
The probe was a bisection tool used to identify the bug location
during §74. Now that §75 has shipped the actual fix and the probe is
no longer needed, removing it cleans up the contract violation.
The remaining PR-E change is solely the F32 GEMV PTX kernel layout
fix in `crates/aprender-gpu/src/kernels/gemv/mod.rs` — that's the
actual bug fix.
Test verified:
cargo test -p aprender-serve --lib \
quantize::contract_tests::tests::falsify_007_no_catch_all_in_dispatch_sites
→ 1 passed
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 13, 2026
…T-CODE-V0-33-0-RELEASE-PREP) 🎉 v0.33.0 marks **MODEL-1 SHIP % = 100%** for SHIP-TWO-001. All 10 AC-SHIP1-* falsifiers are LIVE-discharged on the canonical 7B Qwen2.5-Coder-Instruct Q4_K_M teacher (lambda-vector RTX 4090, --features cuda). This release prep PR ships: 1. CHANGELOG.md [0.33.0] entry with §69-§75 highlights: - 🎉 MODEL-1 SHIP % = 100% (all 10 AC-SHIP1-* LIVE) - Fixed: SHIP-007 F32 GEMV PTX layout (PR #1651, §75) — 124.6 tok/s - Fixed: SHIP-005 HumanEval RC3 (PR #1635, §70/§71) — pass@1 86.59% - Added: APR_EVAL_DEBUG=1 diagnostic surface (PR #1634) - Added: APR_GPU_STAGE_DUMP=<dir> diagnostic surface (PR #1649) - Added: MBPP harness H4 fix (PR #1645) - Added: 2 new falsifiable contracts (apr-eval-humaneval-harness- invariant v1.1.0, apr-ship-007-gpu-stage-bisection v1.0.0) - Methodology lessons #16-22 captured in MEMORY.md - Spec: v3.13.0 → v3.21.0 across §67-§75 2. Workspace version bump: - [workspace.package].version: 0.32.0 → 0.33.0 - Root [package].version (aprender facade crate): 0.32.0 → 0.33.0 - 28 sub-crate version literals: 0.32.0 → 0.33.0 3. `cargo check -p aprender` → clean (workspace builds at 0.33.0). Out of scope for this PR (separate steps after #1651/1652 land + this PR lands): - Tag release `v0.33.0` on main - Cascade publish to crates.io (per memory project_ship_two_001_v0_32_0_release.md — 15 user-facing crates + 7 internal-tier in topological dependency order; uses `make publish CRATE=<name>`) - Post-publish QA per `feedback_post_publish_qa_required.md` — `cargo install aprender --force` + `/dogfood` GO verdict required before declaring release done (v0.31.1 was yanked for skipping this) - GitHub Release with §75 narrative - HF artifact verification (paiml/qwen2.5-coder-7b-apache-q4k-v1 sha256 already verified by §72 SHIP-010 LIVE evidence; double-check before release announcement) This PR ships ONLY the version-bump + CHANGELOG. Publishing is the next step after merge. Refs: - §75 MODEL-1 100% (PR #1652) - §74 SHIP-007 bug localized (PR #1650) - §73 SHIP-007 cascade reduction (PR #1647) - §72 5-AC LIVE cascade (PR #1646) - §71 SHIP-005 LIVE-DISCHARGED (PR #1642) - §70 RC3 fix (PR #1636) - §69 Q4K hypothesis falsified (PR #1633) - PR #1635 RC3 prepend - PR #1634 diagnostic surface + contract - PR #1648 SHIP-007 contract scaffold - PR #1649 SHIP-007 PR-B stage dump - PR #1651 SHIP-007 PR-E F32 GEMV layout fix Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 13, 2026
…T-CODE-V0-33-0-RELEASE-PREP) (#1653) 🎉 v0.33.0 marks **MODEL-1 SHIP % = 100%** for SHIP-TWO-001. All 10 AC-SHIP1-* falsifiers are LIVE-discharged on the canonical 7B Qwen2.5-Coder-Instruct Q4_K_M teacher (lambda-vector RTX 4090, --features cuda). This release prep PR ships: 1. CHANGELOG.md [0.33.0] entry with §69-§75 highlights: - 🎉 MODEL-1 SHIP % = 100% (all 10 AC-SHIP1-* LIVE) - Fixed: SHIP-007 F32 GEMV PTX layout (PR #1651, §75) — 124.6 tok/s - Fixed: SHIP-005 HumanEval RC3 (PR #1635, §70/§71) — pass@1 86.59% - Added: APR_EVAL_DEBUG=1 diagnostic surface (PR #1634) - Added: APR_GPU_STAGE_DUMP=<dir> diagnostic surface (PR #1649) - Added: MBPP harness H4 fix (PR #1645) - Added: 2 new falsifiable contracts (apr-eval-humaneval-harness- invariant v1.1.0, apr-ship-007-gpu-stage-bisection v1.0.0) - Methodology lessons #16-22 captured in MEMORY.md - Spec: v3.13.0 → v3.21.0 across §67-§75 2. Workspace version bump: - [workspace.package].version: 0.32.0 → 0.33.0 - Root [package].version (aprender facade crate): 0.32.0 → 0.33.0 - 28 sub-crate version literals: 0.32.0 → 0.33.0 3. `cargo check -p aprender` → clean (workspace builds at 0.33.0). Out of scope for this PR (separate steps after #1651/1652 land + this PR lands): - Tag release `v0.33.0` on main - Cascade publish to crates.io (per memory project_ship_two_001_v0_32_0_release.md — 15 user-facing crates + 7 internal-tier in topological dependency order; uses `make publish CRATE=<name>`) - Post-publish QA per `feedback_post_publish_qa_required.md` — `cargo install aprender --force` + `/dogfood` GO verdict required before declaring release done (v0.31.1 was yanked for skipping this) - GitHub Release with §75 narrative - HF artifact verification (paiml/qwen2.5-coder-7b-apache-q4k-v1 sha256 already verified by §72 SHIP-010 LIVE evidence; double-check before release announcement) This PR ships ONLY the version-bump + CHANGELOG. Publishing is the next step after merge. Refs: - §75 MODEL-1 100% (PR #1652) - §74 SHIP-007 bug localized (PR #1650) - §73 SHIP-007 cascade reduction (PR #1647) - §72 5-AC LIVE cascade (PR #1646) - §71 SHIP-005 LIVE-DISCHARGED (PR #1642) - §70 RC3 fix (PR #1636) - §69 Q4K hypothesis falsified (PR #1633) - PR #1635 RC3 prepend - PR #1634 diagnostic surface + contract - PR #1648 SHIP-007 contract scaffold - PR #1649 SHIP-007 PR-B stage dump - PR #1651 SHIP-007 PR-E F32 GEMV layout fix Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 14, 2026
…P-TWO-SECTION-75) (#1652) PR-E (#1651) shipped the single-file F32 GEMV PTX layout fix. SHIP-007 LIVE-DISCHARGED. All 10 AC-SHIP1-* now LIVE on canonical 7B Qwen2.5- Coder-Instruct Q4_K_M teacher. 10/10 LIVE-discharge table: SHIP-001 §72 apr run <safetensors> exit 0 SHIP-002 §61 apr run "def fib(n):" valid Python (#1609) SHIP-003 §72 apr diff 20 tensors at cos_sim=1.000000 SHIP-004 §72 llama-cli exit 0, 133.1 gen tok/s SHIP-005 §71 HumanEval pass@1 = 86.59% (gx10 164-run) SHIP-006 §61.8 apr qa 12-gate aggregate PASS (#1615) SHIP-007 §75 PARITY-GATE PASS + 124.6 tok/s @ 128-tok (this section) SHIP-008 §61 apr run SHIP-008 USER → 256-token ChatML (#1614) SHIP-009 §72 apr inspect license/provenance fields SHIP-010 §72 sha256 match 0a854098… Empirical discharge proof for SHIP-007: apr bench <canonical 7B APR> --iterations 5 --max-tokens 128 → tokens_per_second: 124.6 → AC-SHIP1-007 floor: 30 → headroom 4.15× → PARITY-GATE: PASS (no error) → Default path (CUDA graphed), no SKIP_PARITY_GATE, no APR_SKIP_FP8_WARMUP Cascade arc closeout: §63 2026-05-11 → SHIP-007 framed as 3-layer cascade §73 2026-05-12 → re-measurement: only parity layer blocks §74 2026-05-13 → bug LOCALIZED to F32 GEMV via PR-B stage bisection §75 2026-05-13 → PR-E layout fix → MODEL-1 100% §73's '3-5 PR / 3-5 day' estimate. Actual: 4 PRs (#1648 contract, Methodology lesson #22 NEW: symptom analysis (sign-flipped top-K divergences + CPU/GPU mean mismatch + sane intermediates) → bug class localization in O(1). Methodology lessons compose; each makes the next cheaper. Ship-% movement: MODEL-1 ship %: 99% → 100% 🎉 MODEL-2 ship %: unchanged at 57% (independent track, gated on step 5g.3 val_loss < 9.38). Spec version: 3.19.0 → 3.21.0 (post-§72/73 stack at 3.18.0; §74 at 3.20.0; §75 here at 3.21.0). Out of scope (future work): - MODEL-2 ship % path (independent track, separate cascade) - Publish-readiness gates (GATE-SHIP-001/002/003 still need green CI + post-publish QA per feedback_post_publish_qa_required.md) - HumanEval/MBPP benchmark improvements beyond §71's 86.59% Refs: - §74 SHIP-007 localization (PR #1650) - §73 SHIP-007 cascade reduction (PR #1647) - PR #1648 (contract scaffold), #1649 (PR-B stage dump) - PR #1651 (PR-E F32 GEMV layout fix) - AC-SHIP1-007 (spec §5) - evidence/section-75-ship-007-discharged-2026-05-13/ Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🎉 MODEL-1 SHIP-007 LIVE-DISCHARGED → MODEL-1 99% → 100%
After §74's empirical localization to F32 GEMV, this single-file PTX layout fix discharges SHIP-007 and completes MODEL-1.
Root cause
F32 GEMV kernel assumed weight matrix is
[K rows × N cols]row-major (A[i,j]ati*N+j). Actual ML weight convention is[output_dim=N, input_dim=K]row-major (A[i,j]ati*K+j).Kernel was reading TRANSPOSED weights → computed
y = A^T @ xinstead ofy = A @ x→ cos = -0.005, top-10 divergences all sign-flipped.Fix
crates/aprender-gpu/src/kernels/gemv/mod.rs— rewrite inner loop to iterate K within rowblock_id:row_base = a_ptr + block_id * K * 4A[block_id, t],A[block_id, t+32], …instead of the column-iteration interpretation that assumed
[K,N]layout.Empirical discharge
Default path, no
SKIP_PARITY_GATE, noAPR_SKIP_FP8_WARMUP.apr benchreturnspassed: true.§17.5 chain post-PR-E
10 of 10 AC-SHIP1- LIVE-DISCHARGED.*
Ship-% movement
Cascade arc
§63 (3-layer cascade framing) → §73 (cascade reduced to 1) → PR-A #1648 (contract) → PR-B #1649 (stage scaffold + dumps) → §74 #1650 (bug LOCALIZED) → PR-E (this) (1-file fix + discharge)
Per §73's original estimate, this was "3-5 PR / 3-5 days". Actual: 1 PR in 1 day after the bisection scaffolding shipped.
Auxiliary change
crates/aprender-serve/src/cuda/executor/layers/logits.rsaddsAPR_LM_HEAD_FORCE_QTYPEenv-var probe (kept as a diagnostic tool; zero behavior change when unset).Test plan
cargo build --release -p apr-cli --bin apr --features cuda→ cleanapr benchdefault path, 128-tok 5-iter → 124.6 tok/s,passed: trueapr parity→ PARITY-GATE PASS (no error)evidence/section-75-ship-007-discharged-2026-05-13/Refs
contracts/apr-ship-007-gpu-stage-bisection-v1.yaml(PR feat(contracts): SHIP-007 GPU-vs-CPU stage-bisection scaffold (PR-A) #1648 contract)evidence/section-75-ship-007-discharged-2026-05-13/findings.json🤖 Generated with Claude Code