docs(ship-007): §39 — apr run ≠ apr trace --payload (production vs F32-reference forward paths) by noahgift · Pull Request #1110 · paiml/aprender

noahgift · 2026-04-28T12:22:40Z

Summary

§39 spec amendment documenting that `apr run` and `apr trace --payload` use DIFFERENT forward paths producing DIFFERENT first-token logits.

Live evidence (RTX 4090, canonical 7B teacher)

apr trace --payload  →  top-1 = token 220 (" ")
apr run --temperature 0 --max-tokens 5  →  output = "ampiezza = 1"

For same model + same prompt + same greedy sampling, both reports MUST come from identical first-token logits. They don't.

Root cause (per source)

`crates/aprender-serve/src/apr_transformer/inference.rs:38`:
```
// Note: Q4K layers not used in traced forward (uses F32 for accuracy)
```

`forward_traced` deliberately uses F32-only reference path. `apr run` uses production Q4K-fused path. They produce different logits.

Critical implication for prior work

§38's finding "all layers Pass with parity" is correct, but it shows F32-reference path is per-layer-correct — NOT that SHIP-007 is fixed
The §17/§23/§27 hypothesis chain bisected the WRONG path (F32-reference, not production) — officially deprioritized
SHIP-007 lives in the gap between F32-reference and production Q4K-fused paths

Five-whys (codified §39.5)

Why does `apr run` produce "ampiezza = 1" at temp=0? First sampled token differs from trace top-1.
Why differ? Different forward paths.
Why is there F32-only reference? Per comment, "for accuracy".
Why doesn't production match F32 reference modulo Q4K rounding? The bug. Hypotheses in §39.3.
What's the fix? §39.4 diagnostic localizes.

Plain progress on shipping models

MODEL-1: SHIP-007's location dramatically narrowed — production Q4K-fused path, NOT layer-3 ffn_swigl. Next iteration: author `diag_logits_apr_run_vs_apr_trace.rs` (1-2 hour task) that captures logits via both paths and diffs element-wise.
MODEL-2: unchanged at val_loss=9.38. Awaiting distill-train impl per docs(p3): apr-cli-distill-train-v1 — contract for missing apr distill train per §35.3 #1097.

Methodology adherence

Live verification on canonical 7B teacher ✓
Five-whys in §39.5 ✓
Provable falsification step proposed in §39.4 ✓
Per `feedback_fix_root_cause_never_route_around.md`: naming the actual location (production path vs reference path) is the discharge step ✓

Test plan

PMAT pre-commit gates pass
Authored in worktree (no git racing)
Stacks cleanly on main; references PR docs(ship-007): §37 — APR vs GGUF forward_traced TRACE-CAPTURE-POINT MISMATCH #1105/contract(apr-vs-gguf-forward-parity-v1): v1.0.0 → v1.1.0 — §37 sample-size-parity gate #1107/docs(ship-007): §38 — layer-3 18.23× ratio FALSIFIED as sample-size artifact #1108/feat(apr-trace): §37 Option B — last_token stats for APR/GGUF sample-size parity #1109

🤖 Generated with Claude Code

…d paths Live evidence on canonical 7B teacher (RTX 4090): apr trace --payload → top-1 = token 220 (" ") apr run --temperature 0 → first output = "ampiezza" (NOT 220) For same model + same prompt + same greedy sampling, both reports MUST come from identical first-token logits. They don't. Per `crates/aprender-serve/src/apr_transformer/inference.rs:38`: // Note: Q4K layers not used in traced forward (uses F32 for accuracy) So `forward_traced` deliberately uses F32-only reference path. `apr run` uses the production Q4K-fused path. They produce different logits. This means: - §38's finding "all layers Pass with parity" is correct, but it shows F32-reference path is per-layer-correct — not that SHIP-007 is fixed. - The §17/§23/§27 hypothesis chain bisected the WRONG path (F32-reference, not production). Officially deprioritized. - SHIP-007 lives in the gap between F32-reference and production Q4K-fused paths. Hypotheses for the gap (§39.3): 1. Q4K-fused matmul kernel produces non-correct logits at some layer (only layer-0 q-proj tested in §30). 2. KV cache pre-fill path differs from no-cache reference. 3. RoPE applied differently in production vs reference path. 4. CUDA/wgpu/FP8 dispatch routing differences. Falsifiable next step (§39.4): author `diag_logits_apr_run_vs_apr_trace.rs` that captures first-token logits via both paths and diffs element-wise. 1-2 hour task, produces falsifiable result. Five-whys (§39.5): 1. Why does `apr run` produce "ampiezza = 1" at temp=0? First sampled token differs from trace top-1. 2. Why differ? Different forward paths. 3. Why is there F32-only reference? Per comment, "for accuracy". 4. Why doesn't production match F32 reference modulo Q4K rounding? The bug. Hypotheses listed. 5. What's the fix? §39.4 diagnostic localizes; root-cause follows. Spec v2.83.0 → v2.84.0. Coverage scoreboard unchanged (15+33). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift · 2026-05-12T15:58:55Z

Triaged stale (last touched 2026-04). §39 doc — superseded by §63+§67 SHIP-007 progress already on main. Closing as superseded.

…Option A (#1113) Authors a new provable contract that codifies the SPEC-SHIP-TWO-001 §40.6 Option A shipping decision: MODEL-1 (paiml/qwen2.5-coder-7b-apache-q4k-v1) IS shippable today via `apr run --no-gpu`. Live evidence (RTX 4090, lambda-labs): $ apr run /mnt/.../qwen2.5-coder-7b-instruct-q4k.apr \ --prompt "What is 2+2?" --max-tokens 5 --temperature 0 \ --skip-contract --no-gpu Output: "2 + 2 equals" ✓ FALSIFY-MODEL-1-SHIP-CPU-001 PASS (contains "equals") Contract structure: - 3 equations: cpu_path_correctness (PASSES today), gpu_path_known_issue (acknowledges defect tracked in §40), gpu_fix_obligation (durable closure mandate). - 6 falsification tests: -001 CPU correctness, -002 §40 in spec, -003 pv validate, -004 user-facing docs warn about GPU, -005 semver signals scope, -006 spec→contract back-reference. - 5 proof_obligations + 2 kani harnesses. - Contract validates clean via `pv validate` (0 errors, 0 warnings). Methodology compliance per `feedback_fix_root_cause_never_route_around.md`: This contract is NOT a workaround. It documents reality (CPU works, GPU has known bug), creates a falsifiable gate that catches CPU regressions, and MANDATES that the GPU bug remain visible in the spec until fixed: - v1.0.0: MODEL-1 ships CPU-only - v2.0.0: MODEL-1 ships CPU+GPU (requires gpu_path_known_issue closure) - Closing the GPU bug requires either: (a) GPU passes cpu_path_correctness gate (b) GPU dispatch is removed/deprecated (c) New hypothesis identified + spec amendment Five-whys (consistent with §40.5): 1. Why isn't MODEL-1 shipped today? Because we lacked a contract-backed verdict that "MODEL-1 produces correct output via SOME inference path". 2. Why? Because the §17/§23/§27/§38 chain was bisecting the wrong path, leaving the actual CPU correctness un-codified. 3. Why now? §40.4 + §40.5 + #1112 H1+H2 falsifiers narrowed the bug to GPU dispatch (H3); CPU is empirically correct. 4. Why this contract NOW? Per the user directive to ship + use contracts; MODEL-1 is shippable TODAY with a v1.0.0 SHIP-via-CPU contract. 5. What's next? On gpu_fix_obligation closure (a/b/c), bump v1.0.0 → v2.0.0 and 5 MODEL-1 PARTIALs auto-discharge. Spec ref: §40.6 Option A. PR cascade: #1105/#1107/#1108/#1109/#1110/#1111/#1112 (this is the SHIP gate that builds on top of §40 localization). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift mentioned this pull request Apr 28, 2026

docs(ship-007): §40 — SHIP-007 root cause LOCALIZED to GPU path (CPU is correct, MODEL-1 shippable today) #1111

Merged

4 tasks

noahgift enabled auto-merge (squash) April 28, 2026 13:20

noahgift added 3 commits April 28, 2026 15:22

Merge branch 'main' into docs/ship-007-apr-run-vs-trace-divergence

7b52783

Merge branch 'main' into docs/ship-007-apr-run-vs-trace-divergence

c641258

Merge branch 'main' into docs/ship-007-apr-run-vs-trace-divergence

fe6a96b

noahgift closed this May 12, 2026

auto-merge was automatically disabled May 12, 2026 15:58
Pull request was closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(ship-007): §39 — apr run ≠ apr trace --payload (production vs F32-reference forward paths)#1110

docs(ship-007): §39 — apr run ≠ apr trace --payload (production vs F32-reference forward paths)#1110
noahgift wants to merge 4 commits into
mainfrom
docs/ship-007-apr-run-vs-trace-divergence

noahgift commented Apr 28, 2026

Uh oh!

noahgift commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 28, 2026

Summary

Live evidence (RTX 4090, canonical 7B teacher)

Root cause (per source)

Critical implication for prior work

Five-whys (codified §39.5)

Plain progress on shipping models

Methodology adherence

Test plan

Uh oh!

noahgift commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant