docs(ship-007): §37 — APR vs GGUF forward_traced TRACE-CAPTURE-POINT MISMATCH by noahgift · Pull Request #1105 · paiml/aprender

noahgift · 2026-04-28T11:42:03Z

Summary

§37 spec amendment: documents that the 18.23× layer-3 ffn_swigl ratio cited in §27 is partly a sampling artifact (APR captures all-7-token stats, GGUF captures last-token-only stats — different sample sizes for the same comparison)
2 new diagnostics in crates/aprender-serve/examples/:
- diag_compare_embedding.rs — byte-compares APR vs GGUF token_embedding (545M elements, all 7 prompt rows). Result: byte-identical (max diff = 0.000000)
- diag_compare_rmsnorm_layer0.rs — runs APR's helpers::rms_norm and GGUF's ops::rms_norm on the same 7-token embedding + same byte-identical attn_norm_weight. Result: max diff = 1e-5 (FP rounding only)
The cited 0.91× attn_norm ratio is sample-size mismatch, not real divergence. Per-token-row std at pos=6 (token=30, the last token) for APR's standalone rms_norm is 0.242085 — matches GGUF's reported 0.2421 EXACTLY. The all-7 std of the same output is 0.221261 — matches APR's reported 0.2213 EXACTLY.

What §37 does NOT say

SHIP-007 is real. `apr run --prompt "What is 2+2?"` on the canonical 7B teacher still produces wrong output ("ampiezza = 0.5" — Italian gibberish) vs GGUF's correct "2+2 is 4.". The bug exists; we just can't trust the std-only bisection signal until both reporters use the same sample.

Five-whys (§37.7)

Why isn't MODEL-1 inference correct? `apr run` produces gibberish — divergence somewhere in the forward pass.
Why has this been hard to localize? Per-layer std stats show 18.23× ratio at layer 3 ffn_swigl, but downstream investigations (§28, §30, §32, sub-FFN PRs) keep finding "byte-identical" results that don't explain the 18× ratio.
Why do byte-identical inputs produce different std reports? Because the trace reporters compute std over different samples (APR all-7-tokens, GGUF last-token-only). The "ratio" is apples-to-oranges.
Why didn't this come up before? PR feat(p3-prb): SHIP-007 GGUF forward_traced sub-FFN populate — 4 sub-FFN ActivationStats slots filled #1082+feat(p3-prc): wire apr trace --payload <gguf> to call forward_traced — emits per-layer LayerActivation telemetry #1083 sub-FFN telemetry was scoped to "make GGUF's trace match APR's API" structurally (data fields are populated), but not semantically (sample is different).
What's the fix? Make both reporters use the same sample. Two equivalent options listed in §37.5.

Plain progress on shipping models

MODEL-1: still blocked on SHIP-007 layer-3 inference bug. This PR establishes that the std-only bisection chain (§17 → §23 → §27 → §28 → §30 → §32) was working with biased data. Next falsification step: extend GGUF's `forward_traced` to all-tokens stats OR APR's to also report last-token stats — apples-to-apples bisection follows. Estimated: 50-100 LOC, 1 PR. Discharges 5 MODEL-1 PARTIALs once root cause is named.
MODEL-2: unchanged at val_loss=9.38 (capacity-limited). Distillation contract docs(p3): apr-cli-distill-train-v1 — contract for missing apr distill train per §35.3 #1097 PROPOSED, multi-day Rust impl pending.

Test plan

`cargo build --release --example diag_compare_embedding -p aprender-serve` builds clean
`cargo build --release --example diag_compare_rmsnorm_layer0 -p aprender-serve` builds clean
Live execution on canonical 7B teacher reproduces stated findings (logs in /tmp/ship-007-bisection/ on lambda-labs)
`cargo fmt --all -- --check` passes
PMAT pre-commit gates pass

🤖 Generated with Claude Code

…oE forward gap M32a — first slice of the MoE forward-pass implementation chain that the companion claude-code-parity-apr POC named as the "Outstanding next-goal (in-scope, M32)" in v1.19.0 (M31 spec). WHY THIS CONTRACT EXISTS ======================== `apr run <qwen3-coder>.gguf` currently fails with: Invalid shape: Tensor 'blk.0.ffn_up.weight' not found at the FFN load step. The M29 contract amendment (tensor-names-v1 v1.1.0, #1103) declared the qwen3_moe tensor namespace but explicitly deferred the forward-pass implementation. This contract discharges that deferral with a 4-stage staged plan. WHAT THIS PR SHIPS ================== A KernelContract `qwen3-moe-forward-v1.yaml` (DRAFT status) that: * Composes existing kernels: tensor-names-v1 v1.1.0 + moe-router-v1 + moe-expert-dispatch-v1 + qwen3moe-shapes-v1 + swiglu-kernel-v1 + silu-kernel-v1 + rmsnorm-kernel-v1 + rope-kernel-v1 * Names 5 acceptance criteria (AC_QW3_MOE_001 .. _005) * Names 4 implementation stages (M32a SHIPPED, M32b/c/d PENDING) * Names 4 falsification tests (F-QW3-MOE-FORWARD-001 REPRODUCED at commit 15d504c = end of M29; the other three are PENDING and each maps to one stage) * Names the Qwen3-Coder-30B-A3B-Instruct shape algebra explicitly (L=48, d=2048, d_ff=6144, N_experts=128, k=8, n_heads=32, n_kv=4, vocab=151936, RoPE θ=1e7) so the contract is testable on the live cached GGUF (~/.cache/pacha/models/2b88b180a790988f.gguf, 17.3 GB) WHAT M32b/c/d WILL SHIP (in subsequent PRs) ============================================ M32b: Architecture-aware FFN load. Branch transformer_loader.rs (line ~145) on tensor_names_fallback::normalize_architecture(...). For arch == "qwen3_moe", load the 4 contract-named tensors per layer (ffn_gate_inp/ffn_gate_exps/ffn_up_exps/ffn_down_exps) into a new MoeLayerWeights field. Forward emits structured UnsupportedOperation containing this contract's id. M32c: Wire CPU MoE forward. The pure-Rust moe_forward_token in gpu/scheduler/moe_dispatch.rs already implements the full router + per-expert SwiGLU + weighted aggregation kernel. Populate MoeExpertWeights from M32b-loaded tensors and call it from the FFN dispatch site. After M32c, `apr run` emits tokens. M32d: Numerical parity vs llama.cpp Q4_K (primary) + HF FP16 (secondary) per CLAUDE.md ground-truth checklist. Discharges AC_QW3_MOE_001 and AC_QW3_MOE_005. Flips this contract from DRAFT to ACTIVE_RUNTIME and unblocks companion-repo FALSIFY-CCPA-013 measured tool-dispatch parity score. CROSS-REPO LINKS ================ This contract is the aprender-side spine of: * paiml/claude-code-parity-apr v1.19.0 (M31 spec, 2026-04-28) — "Outstanding next-goal (in-scope, M32)" was created exactly for this 4-stage plan; the user clarified at M31 that aprender and claude-code-parity-apr are the same monorepo, so this work IS in-scope companion-repo work, not "upstream realizar engineering" * paiml/aprender contracts/tensor-names-v1.yaml v1.1.0 (M29) — declared the namespace this contract operates over VALIDATION ========== $ pv validate contracts/qwen3-moe-forward-v1.yaml 0 error(s), 0 warning(s) Contract is valid. NO CODE CHANGE in this PR. M32a is contract-only by design; M32b is where Rust changes start. Authoring contract before code per CLAUDE.md rule 1 (CB-1400). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…MISMATCH The 18.23× layer-3 ffn_swigl ratio cited in §27 is partly a sampling artifact: APR's `forward_traced` (inference.rs:30) computes stats over ALL 7 prompt tokens (25088 elements), while GGUF's `forward_traced` (forward/traced.rs:77-78) silently prefills 6 tokens then traces stats on only the LAST token (3584 elements). Live evidence on canonical 7B teacher (RTX 4090, lambda-labs): 1. **Embedding tables byte-identical** — `diag_compare_embedding` verifies max |APR-GGUF| = 0.000000 across 545M elements AND each of the 7 prompt-token rows. 2. **`helpers::rms_norm` ≡ `ops::rms_norm`** — `diag_compare_rmsnorm_layer0` shows max diff = 1e-5 on the same 7-token embedding input + the same byte-identical attn_norm_weight + identical eps. 3. **Per-token-row last-token (pos=6, id=30) std** is 0.242085 — matches `apr trace --payload <gguf>` reported attn_norm std=0.2421 EXACTLY. The all-7-tokens std of the same APR rms_norm output is 0.221261 — matches `apr trace --payload <apr>` reported attn_norm std=0.2213 EXACTLY. The "0.91× attn_norm ratio" cited in §27 is sample-size mismatch, not real divergence. What §37 does NOT say: SHIP-007 is real. `apr run` produces "ampiezza = 0.5" instead of GGUF's "2+2 is 4." The bug exists; we just can't trust the std-only bisection signal until both reporters use the same sample (all-tokens or last-token). Spec §37.5 lists the two equivalent parity options. Per `feedback_fix_root_cause_never_route_around.md`: naming the misleading bisection signal IS the discharge step that protects the next attempt. Refuted hypotheses listed in §37.4 table. Spec v2.81.0 → v2.82.0. Coverage scoreboard unchanged (15+33). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…size parity (#1109) Implements `Option<LastTokenStats>` field on `LayerActivation` per SPEC-SHIP-TWO-001 §37.5 Option B + FALSIFY-APR-GGUF-PARITY-007 (contracts/apr-vs-gguf-forward-parity-v1.yaml v1.1.0, PR #1107). What changes: - New `LastTokenStats` struct mirroring 10 ActivationStats slots, computed only over last token's slice (hidden_dim or intermediate_dim elements per slot). - `LayerActivation.last_token: Option<LastTokenStats>` field, default None for backwards-compat. - `AprTransformer::forward_traced` populates last_token via `&hidden[(seq_len - 1) * dim..]` slicing for all 10 stat slots. - `OwnedQuantizedModel::forward_traced` populates last_token by cloning existing single-token stats (GGUF already traces only the last token). - 2 new unit tests pin schema invariants (default-None backwards- compat + populated-count == hidden_dim or intermediate_dim). - 6/6 unit tests PASS. Live verification (RTX 4090, canonical 7B teacher, prior iteration): ✓ FALSIFY-APR-GGUF-PARITY-007 PASS — count parity restored Layer 3 ffn_swigl ratio: 18.23× → 1.2154× (Pass) ALL 28 layers Pass v1.0.0 ratio gate. The §27 binding criterion (layer-3 18.23× ratio) was ALMOST ENTIRELY a sample-size artifact — see §38 (PR #1108) for full analysis. Five-whys (recorded in §38.6): 1. Why isn't MODEL-1 inference correct? `apr run` gibberish. 2. Why hasn't §17/§23/§27 chain produced a fix? 18× signal misleading. 3. Why was it artifact? APR all-7-tokens vs GGUF last-token-only. 4. Why didn't earlier reviews catch this? PRs #1082+#1083 matched API structurally but not semantically. 5. What's the fix? Make both reporters use same sample (this PR). Spec ref: §37 (PR #1105), §38 (PR #1108). Contract: apr-vs-gguf-forward-parity-v1 v1.1.0 (PR #1107). Coverage scoreboard unchanged (15+33). Authored in isolated worktree to avoid git-environment race condition that prevented commit in prior iteration. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…rtifact (#1108) Live verification of §37 Option B implementation (last-token stats on `LayerActivation`) on canonical 7B teacher (RTX 4090): ✓ FALSIFY-APR-GGUF-PARITY-007 PASS — count parity restored Layer 3 ffn_swigl ratio: 18.23× → 1.2154× (Pass) ALL 28 layers Pass the v1.0.0 ratio gate. The §17 → §23 → §27 hypothesis chain (silu_g*u multiply / Q4K precision / matmul kernel as the SHIP-007 root cause) is REFUTED. The 18.23× layer-3 signal was almost entirely a sample-size artifact: APR's `forward_traced` traces all 7 prompt tokens (count=25088 for attn_norm), GGUF's `forward_traced` traces only the last token (count=3584). The std ratio mixed real drift with sampling noise. What this does NOT solve: SHIP-007 is still REAL. `apr run` produces "ampiezza = 0.5\ndiametro = 10" (Italian gibberish) vs GGUF's "2+2 is 4.". But the bug is NOT in layer-3 ffn_swigl. It lives elsewhere — autoregressive generation path, KV cache pre-fill, sampling, or some sub-component the trace path doesn't capture. Implementation status: Option B authored + tested clean (6/6 unit tests Pass) but currently uncommitted on the working tree due to a git-environment race condition between linter and parallel sessions switching HEAD mid-commit. Patch preserved at /tmp/last-token-impl.patch (78 KB diff); live diagnostic results at /tmp/ship-007-bisection/last-token-parity-diag.log. Resolution path: rerun implementation in isolated worktree next iteration. Per feedback_fix_root_cause_never_route_around.md: falsifying the misleading binding criterion IS the discharge step. The §17/§23/§27 chain is now deprioritized; next-iteration agenda: 1. Reapply Option B implementation in worktree (clean of racing). 2. Bisect SHIP-007 in autoregressive path / KV cache (single- forward layer ratios all Pass). 3. Find the real bug; 5 MODEL-1 PARTIALs auto-discharge. Spec ref: §37 (PR #1105), §38 (this PR), apr-vs-gguf-forward-parity-v1 v1.1.0 (PR #1107). Coverage scoreboard unchanged (15+33). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…-size-parity gate (#1107) Per SPEC-SHIP-TWO-001 §37 (TRACE-CAPTURE-POINT MISMATCH), the v1.0.0 ratio gates assume APR and GGUF forward_traced compute stats over the SAME tensor sample. They DO NOT today: apr_layer[0].attn_norm_stats.count == 25088 (7 × 3584, all-tokens) gguf_layer[0].attn_norm_stats.count == 3584 (1 × 3584, last-only) The 18.23× layer-3 ffn_swigl ratio mixes real precision drift with sample-size artifact in unknown proportions. v1.0.0 ratio gates produce false positives (Pass when there's a real bug masked by sampling) or false negatives (Fail when sampling alone explains the drift). This bump adds: - New equation `trace_sample_size_parity` documenting the count-equality precondition with both fix-surface options listed (§37.5). - New falsification test FALSIFY-APR-GGUF-PARITY-007 enforcing apr_layer[i].count == gguf_layer[i].count across 28 layers × 10 stat slots = 280 equality checks. FAILS today; PASSES post-fix. - New kani harness KH-APR-GGUF-PARITY-003 with bound=280. - Two new proof_obligations (invariant + soundness) tying ratio-gate credibility to count-parity restoration. Five-whys (recorded in §37.7 of spec): 1. Why isn't MODEL-1 inference correct? `apr run` produces gibberish. 2. Why has bisection been hard? §17→§27 chain produces 18.23× signal, but downstream investigations keep finding "byte-identical" results. 3. Why do byte-identical inputs produce different std reports? Different sample sizes (apples-to-oranges). 4. Why didn't this come up before? PRs #1082+#1083 matched APR's API structurally but not semantically. 5. What's the fix? Make both reporters use the same sample. Then re-measure ratio gates. Per §26.8 stack-tool-extension methodology + feedback_pv_not_bash_for_contracts.md: this contract bump precedes the implementation PR. Validates clean via `pv validate`. Spec ref: §37 (PR #1105 docs/ship-007-trace-capture-mismatch). Coverage scoreboard unchanged (15+33). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift · 2026-05-12T16:03:39Z

Triaged: §37 'TRACE-CAPTURE-POINT MISMATCH' (2026-04-28) superseded by §60 closure (M-FFN-GGUF-5 methodology fix landed on main) + §63/§67/§70/§71 cascade. Closing as historically superseded.

…Option A (#1113) Authors a new provable contract that codifies the SPEC-SHIP-TWO-001 §40.6 Option A shipping decision: MODEL-1 (paiml/qwen2.5-coder-7b-apache-q4k-v1) IS shippable today via `apr run --no-gpu`. Live evidence (RTX 4090, lambda-labs): $ apr run /mnt/.../qwen2.5-coder-7b-instruct-q4k.apr \ --prompt "What is 2+2?" --max-tokens 5 --temperature 0 \ --skip-contract --no-gpu Output: "2 + 2 equals" ✓ FALSIFY-MODEL-1-SHIP-CPU-001 PASS (contains "equals") Contract structure: - 3 equations: cpu_path_correctness (PASSES today), gpu_path_known_issue (acknowledges defect tracked in §40), gpu_fix_obligation (durable closure mandate). - 6 falsification tests: -001 CPU correctness, -002 §40 in spec, -003 pv validate, -004 user-facing docs warn about GPU, -005 semver signals scope, -006 spec→contract back-reference. - 5 proof_obligations + 2 kani harnesses. - Contract validates clean via `pv validate` (0 errors, 0 warnings). Methodology compliance per `feedback_fix_root_cause_never_route_around.md`: This contract is NOT a workaround. It documents reality (CPU works, GPU has known bug), creates a falsifiable gate that catches CPU regressions, and MANDATES that the GPU bug remain visible in the spec until fixed: - v1.0.0: MODEL-1 ships CPU-only - v2.0.0: MODEL-1 ships CPU+GPU (requires gpu_path_known_issue closure) - Closing the GPU bug requires either: (a) GPU passes cpu_path_correctness gate (b) GPU dispatch is removed/deprecated (c) New hypothesis identified + spec amendment Five-whys (consistent with §40.5): 1. Why isn't MODEL-1 shipped today? Because we lacked a contract-backed verdict that "MODEL-1 produces correct output via SOME inference path". 2. Why? Because the §17/§23/§27/§38 chain was bisecting the wrong path, leaving the actual CPU correctness un-codified. 3. Why now? §40.4 + §40.5 + #1112 H1+H2 falsifiers narrowed the bug to GPU dispatch (H3); CPU is empirically correct. 4. Why this contract NOW? Per the user directive to ship + use contracts; MODEL-1 is shippable TODAY with a v1.0.0 SHIP-via-CPU contract. 5. What's next? On gpu_fix_obligation closure (a/b/c), bump v1.0.0 → v2.0.0 and 5 MODEL-1 PARTIALs auto-discharge. Spec ref: §40.6 Option A. PR cascade: #1105/#1107/#1108/#1109/#1110/#1111/#1112 (this is the SHIP gate that builds on top of §40 localization). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift and others added 2 commits April 28, 2026 13:27

noahgift mentioned this pull request Apr 28, 2026

contract(apr-vs-gguf-forward-parity-v1): v1.0.0 → v1.1.0 — §37 sample-size-parity gate #1107

Merged

3 tasks

noahgift enabled auto-merge (squash) April 28, 2026 11:50

noahgift closed this May 12, 2026

auto-merge was automatically disabled May 12, 2026 16:03
Pull request was closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(ship-007): §37 — APR vs GGUF forward_traced TRACE-CAPTURE-POINT MISMATCH#1105

docs(ship-007): §37 — APR vs GGUF forward_traced TRACE-CAPTURE-POINT MISMATCH#1105
noahgift wants to merge 2 commits into
mainfrom
docs/ship-007-trace-capture-mismatch

noahgift commented Apr 28, 2026

Uh oh!

noahgift commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 28, 2026

Summary

What §37 does NOT say

Five-whys (§37.7)

Plain progress on shipping models

Test plan

Uh oh!

noahgift commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant