docs(ship-007): §15.4 falsifier RESULT — attention kernel ruled out as root cause (spec v2.59.0 → v2.60.0) by noahgift · Pull Request #1062 · paiml/aprender

noahgift · 2026-04-26T06:12:51Z

Summary

Records the result of the §15.4 falsifier test (PR #1061) in spec §15:

The GQA-7:1 incremental_attention_gpu kernel is NOT the SHIP-007 root cause.

3/3 tests pass on noah-Lambda-Vector RTX 4090:

test ship_007_qwen2_gqa_7_1_cpu_gpu_parity_first_token  ... ok
test ship_007_qwen2_gqa_7_1_cpu_gpu_parity_second_token ... ok
test ship_007_qwen2_gqa_7_1_head_mapping_property       ... ok

Both first-token (no cache) and second-token (1-position populated cache) CPU/GPU outputs are bit-equivalent within FP rounding tolerance for the canonical Qwen2.5-Coder-7B shape (28:4:128:3584).

Eliminated suspects

✅ Q/K/V head-mapping arithmetic (TinyLlama 8:1 + Qwen 7:1 both pass — distinct ratios, distinct head_dim)
✅ Q × K^T per-head correctness
✅ Softmax-weighted V aggregation
✅ Scale factor 1/√head_dim at head_dim=128
✅ Single-position KV cache state-management

Surviving candidates (new §15.5)

The next falsifier target is Q/K/V projection matmul (before attention). After that: o_proj, RMSNorm, FFN, LM head, multi-layer KV cache layout, residual stream.

Section 15 renumbering

Old	New
§15.4 (planned test)	§15.4 (RESULT recorded)
§15.4 footer (next step)	§15.5 (Next Investigation Step — full subsection)
§15.5 (Side-Bug)	§15.6
§15.6 (Blast Radius)	§15.7
§15.7 (Methodology)	§15.8

Spec progression

v2.59.0 → v2.60.0. No coverage tally change (no new discharge).

The remaining 5 MODEL-1 PARTIALs (SHIP-002/005/006/007/008) still transitively block on the eventual SHIP-007 fix, but the root-cause search has materially narrowed.

Test plan

Spec text reads cleanly (manual review)
§15 subsections renumbered consistently
No SATD violations
CI workspace-test green (auto)
ci / gate green (auto)

Files changed

File	Change
`docs/specifications/aprender-train/ship-two-models-spec.md`	v2.59.0 → v2.60.0; §15.4/15.5 rewrite + §15.6/7/8 renumber

🤖 Generated with Claude Code

…led out (spec v2.59.0 → v2.60.0) Updates spec §15 with the result of the §15.4 falsifier test (PR #1061): three CPU vs GPU GQA parity tests on the canonical Qwen2.5-Coder-7B shape (NUM_HEADS=28, NUM_KV_HEADS=4, HEAD_DIM=128, HIDDEN=3584) all PASS on noah-Lambda-Vector RTX 4090. Result documented in §15.4 (now titled "Falsifier Run + RESULT"): test ship_007_qwen2_gqa_7_1_cpu_gpu_parity_first_token ... ok test ship_007_qwen2_gqa_7_1_cpu_gpu_parity_second_token ... ok test ship_007_qwen2_gqa_7_1_head_mapping_property ... ok test result: ok. 3 passed; 0 failed; 0 ignored; This conclusively rules out the GQA-7:1 incremental_attention_gpu kernel as the SHIP-007 root cause. Eliminated suspects: - Q/K/V head-mapping arithmetic (TinyLlama 8:1 + Qwen 7:1 both pass) - Q × K^T per-head correctness - Softmax-weighted V aggregation - Scale factor 1/√head_dim at head_dim=128 - Per-head accumulation across 28 Q heads / 4 KV heads - Single-position KV cache state-management Surviving SHIP-007 root-cause candidates (per new §15.5): - Q/K/V projection matmul (BEFORE attention) ← next falsifier target - o_proj (AFTER attention) - RMSNorm before/after attention or FFN - FFN (gate/up/down + swiglu) - LM head projection - Multi-layer KV cache *layout* (across-layer indexing) - Layer composition / residual stream propagation Section 15 renumbering: §15.4 — Falsifier Run + RESULT (was: planned test) §15.5 — Next Investigation Step (was: §15.4 footer; now a full subsection naming Q/K/V projection matmul as the target) §15.6 — Side-Bug Surfaced During Investigation (was: §15.5) §15.7 — Blast Radius Inventory (was: §15.6) §15.8 — Methodological Note (was: §15.7) Spec v2.59.0 → v2.60.0. No coverage tally change (no new discharge); this is investigation-result recording. The remaining 5 MODEL-1 PARTIALs still transitively block on the eventual SHIP-007 fix, but the root-cause search has been materially narrowed. The §15.4 attention parity test (PR #1061) is now a durable regression guard against the GQA-7:1 attention kernel proper — any future refactor that breaks 7:1-specific behavior flips these tests red. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…cause — spec v2.59.0 → v2.61.0 Live `apr trace --payload` on the canonical paiml/qwen2.5-coder-7b-apache-q4k-v1 teacher (noah-Lambda-Vector RTX 4090, 2026-04-26) ran twice on CPU with the same prompt "What is 2+2?", same encoded tokens [3838, 374, 220, 17, 10, 17, 30], same embedded BPE tokenizer: APR teacher → top-1 token=220 (" "), logit=16.7368 ← WRONG GGUF teacher → " 2+2 is 4." ← CORRECT Combined with §15.4 (PR #1061 — GPU GQA-7:1 attention parity tests all PASS), this eliminates: GPU stack, GQA attention kernel, tokenizer, loader-side data layout, Q4K dequantization, RMSNorm, embedding lookup. Surviving suspects are all in the APR-format CPU forward path: - Layer-composition glue in forward_single_with_scratch - Multi-layer KV cache layout (across-layer indexing) - Position embedding (RoPE) layout / sin/cos cache - LM head projection §16.4 specifies the falsifiable next investigation step: `apr trace --payload --layer 0` bisection across 28 layers. 1-2 sessions task, not multi-PR. Whatever fix lands also discharges all 5 transitively-blocked MODEL-1 PARTIALs (SHIP-002/005/006/007/008) per §15.7's blast-radius inventory. Spec v2.59.0 → v2.61.0 (jumps v2.60.0; reserved for #1062 conflict-merge). No coverage tally change — investigation-recording amendment, not rule promotion. Methodological continuation per feedback_apr_trace_not_eprintln.md: zero eprintln! added, exact same `apr trace --payload` primitive used in §15. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…cause — spec v2.59.0 → v2.61.0 (#1063) Live `apr trace --payload` on the canonical paiml/qwen2.5-coder-7b-apache-q4k-v1 teacher (noah-Lambda-Vector RTX 4090, 2026-04-26) ran twice on CPU with the same prompt "What is 2+2?", same encoded tokens [3838, 374, 220, 17, 10, 17, 30], same embedded BPE tokenizer: APR teacher → top-1 token=220 (" "), logit=16.7368 ← WRONG GGUF teacher → " 2+2 is 4." ← CORRECT Combined with §15.4 (PR #1061 — GPU GQA-7:1 attention parity tests all PASS), this eliminates: GPU stack, GQA attention kernel, tokenizer, loader-side data layout, Q4K dequantization, RMSNorm, embedding lookup. Surviving suspects are all in the APR-format CPU forward path: - Layer-composition glue in forward_single_with_scratch - Multi-layer KV cache layout (across-layer indexing) - Position embedding (RoPE) layout / sin/cos cache - LM head projection §16.4 specifies the falsifiable next investigation step: `apr trace --payload --layer 0` bisection across 28 layers. 1-2 sessions task, not multi-PR. Whatever fix lands also discharges all 5 transitively-blocked MODEL-1 PARTIALs (SHIP-002/005/006/007/008) per §15.7's blast-radius inventory. Spec v2.59.0 → v2.61.0 (jumps v2.60.0; reserved for #1062 conflict-merge). No coverage tally change — investigation-recording amendment, not rule promotion. Methodological continuation per feedback_apr_trace_not_eprintln.md: zero eprintln! added, exact same `apr trace --payload` primitive used in §15. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…irmed APR-side at inference.rs:160-164 — spec v2.71.0 → v2.72.0 (#1084) Live evidence on noah-Lambda-Vector RTX 4090 2026-04-27. Built apr from PR #1083 branch (commits 77c016b + c657968 + f249464 from PR A+B+C cascade). Ran `apr trace --payload` on canonical 7B teacher in BOTH formats with identical prompt + tokenizer. Result: | Layer | APR ffn_swigl std | GGUF ffn_swigl std | Ratio | |------:|------------------:|-------------------:|------:| | 3 | 1.2216 | 0.0670 | 18.23x | §26.4 binding criterion threshold: ≥10x → APR-side bug. **Observed 18.23x — 8x past the threshold, decisive verdict.** The investigation chain that started in §15.4 (GPU GQA elimination) has reached its conclusion at §27: §15.4 → §16 → §17 → §23 → §27 (this) "Whole forward path" → "GPU eliminated" → "(layer=3, FFN sub-block)" → "(layer=3, ffn_swigl)" → "**APR-side at inference.rs:160-164**" Cascade-damping signature confirmed: - Layers 0-2: ratio ~1.1x (normal) - Layer 3: 18.23x (anomaly) - Layers 4-5: 3.3-4.5x (cascade) - Layer 6+: ~1x (recovered) This is consistent with a localized perturbation (off-by-one, buffer aliasing, or F32-vs-Q4K dequant defect at layer-3- specifically) rather than persistent residual-stream corruption. Per §17.5, SHIP-007 fix discharges 5 MODEL-1 PARTIALs at once (SHIP-002/005/006/007/008). §26.5 expected coverage flip: 33+12 → 28+17 when fix lands. §27 does NOT discharge by itself — it locates the bug for fixing. Next investigation reads `inference.rs:160-164` and tests 4 hypotheses: 1. Off-by-one slice indexing 2. Buffer aliasing (scratch reuse pattern) 3. F32-vs-Q4K dequant defect at layer-3 input range 4. Activation overflow (SiLU saturation amplifies multiply) Methodology held throughout: zero eprintln!, zero route-arounds, apr is canonical (§26.8), all instrumentation via `apr trace --payload`. Lambda-labs lane pre-authorized. Evidence persisted to evidence/ship-007-apr-vs-gguf-2026-04-27/: - apr-trace.txt (13.5 KB) - gguf-trace.txt (13.7 KB) - binding-criterion-summary.json Note: §27 reproduction requires PR #1081 + #1082 + #1083 cascade to merge first (the apr trace --payload <gguf> wiring is in PR C). Evidence was generated with a local build of PR #1083 branch. Spec v2.71.0 → v2.72.0. Coverage flip pending fix. Spec: SPEC-SHIP-TWO-001 §26.4 P3 verdict References: - §15.4 (PR #1062) — GPU GQA eliminated - §16 (PR #1063) — APR CPU isolated - §17 (PR #1064) — layer-3 FFN sub-block - §23 (PR #1075) — layer-3 ffn_swigl named - §26.8 (PR #1079) — apr-is-canonical methodology rule - PR #1081 (P3 PR A scaffold) - PR #1082 (P3 PR B sub-FFN populate) - PR #1083 (P3 PR C CLI wiring) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) April 26, 2026 06:12

noahgift mentioned this pull request Apr 26, 2026

docs(ship-007): §16 APR forward CPU path isolated as root cause #1063

Merged

4 tasks

Merge branch 'main' into docs/ship-007-15-4-falsifier-result

1825753

noahgift mentioned this pull request Apr 26, 2026

contract(trace-ffn-sub-block-v1): pre-commit schema for sub-FFN telemetry (SHIP-007 load-bearing) #1065

Merged

4 tasks

noahgift merged commit 9908b06 into main Apr 26, 2026
10 checks passed

noahgift deleted the docs/ship-007-15-4-falsifier-result branch April 26, 2026 07:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(ship-007): §15.4 falsifier RESULT — attention kernel ruled out as root cause (spec v2.59.0 → v2.60.0)#1062

docs(ship-007): §15.4 falsifier RESULT — attention kernel ruled out as root cause (spec v2.59.0 → v2.60.0)#1062
noahgift merged 2 commits into
mainfrom
docs/ship-007-15-4-falsifier-result

noahgift commented Apr 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 26, 2026

Summary

Eliminated suspects

Surviving candidates (new §15.5)

Section 15 renumbering

Spec progression

Test plan

Files changed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant