docs(ship-007): layer-3 sub-FFN bisection — divergence pinned at ffn_gate matmul (first statistical site) by noahgift · Pull Request #1099 · paiml/aprender

noahgift · 2026-04-28T07:58:29Z

Summary

After PR #1082 (sub-FFN populate) + PR #1083 (CLI wiring) merged today, ran apr trace --payload on the canonical 7B teacher in both APR and GGUF formats. First time we have side-by-side per-layer sub-FFN stats.

Why this matters for shipping MODEL-1

paiml/qwen2.5-coder-7b-apache-q4k-v1 is published to HuggingFace but its APR backend produces wrong outputs. SHIP-002/005/006/007/008 (5 PARTIALs) all depend on this fix. This bisection narrows the bug surface from "(layer 3, FFN sub-block)" to "(layer 3, ffn_gate matmul output)" — the first statistical divergence point.

Layer-3 result

Stat	APR	GGUF	Ratio
ffn_norm (input)	0.995	1.035	0.96×
ffn_gate (post-matmul)	1.924	1.413	1.36× ← divergence starts
ffn_up	1.335	1.456	0.92×
ffn_silu	0.168	0.037	4.59× silu amp
ffn_swigl	1.222	0.067	18.23× compound
ffn_out	11.459	0.191	60.0× cascade

Paradox

Layer-3 ffn_gate weights are byte-identical APR ≡ GGUF (verified earlier today). Inputs (ffn_norm) agree within 5%. Yet outputs diverge 36%.

Remaining hypothesis

Per-element values of ffn_norm input differ (despite similar aggregate std), produced by cumulative F32 precision drift through layers 0-2 residual connections. Per-element diff at this stage is the next investigation step.

Path to shipping MODEL-1

Once per-element source identified and fixed, the 5 PARTIALs promote to DISCHARGED → MODEL-1 ships cleanly through both APR and GGUF backends → SHIP-TWO-001 MODEL-1 row complete.

Files

evidence/ship-007-layer3-bisection-2026-04-28/findings.md — full analysis
evidence/ship-007-layer3-bisection-2026-04-28/apr-trace.txt
evidence/ship-007-layer3-bisection-2026-04-28/gguf-trace.txt

🤖 Generated with Claude Code

…gate matmul output After PR #1082 (sub-FFN populate) and #1083 (CLI wiring) merged today, ran `apr trace --payload` on canonical 7B teacher in both APR and GGUF formats. First time we have side-by-side per-layer sub-FFN stats. Layer-3 result (1.36× ratio at ffn_gate, amplifies to 60× at ffn_out): | Stat | APR | GGUF | Ratio | |------|----:|-----:|------:| | ffn_norm (input) | 0.995 | 1.035 | 0.96× | | ffn_gate (post-matmul) | 1.924 | 1.413 | 1.36× ← divergence | | ffn_up | 1.335 | 1.456 | 0.92× | | ffn_silu | 0.168 | 0.037 | 4.59× silu amp | | ffn_swigl | 1.222 | 0.067 | 18.23× compound | | ffn_out | 11.459 | 0.191 | 60.0× cascade | Layer-3 ffn_gate is the FIRST sub-FFN site where APR and GGUF aggregate stats diverge significantly. Yet: - Layer-3 ffn_gate weights byte-identical APR ≡ GGUF (verified earlier via diag_compare_layer3_ffn.rs) - ffn_norm inputs agree within 5% on aggregate stats The remaining hypothesis: per-element values of ffn_norm input differ (despite similar std), produced by cumulative F32 precision drift through layers 0-2 residual connections. Per-element diff at this specific stage is the next investigation step. ## Why this matters for shipping MODEL-1 paiml/qwen2.5-coder-7b-apache-q4k-v1 is published to HuggingFace but its APR backend produces wrong outputs. SHIP-002/005/006/007/008 (5 PARTIALs) all depend on this fix. With this bisection: - Bug surface narrowed from "(layer 3, FFN sub-block)" (§17) to "(layer 3, ffn_gate matmul output)" — first statistical divergence - Weights agree → fix not in converter - Aggregate input stats agree → fix in per-element behavior of ffn_norm input or matmul nondeterminism - Once per-element source identified and fixed, the 5 PARTIALs promote to DISCHARGED and MODEL-1 ships cleanly through both APR and GGUF backends Files: - evidence/ship-007-layer3-bisection-2026-04-28/findings.md - evidence/ship-007-layer3-bisection-2026-04-28/apr-trace.txt - evidence/ship-007-layer3-bisection-2026-04-28/gguf-trace.txt Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…thesis for the fix Parsed apr-trace.txt and gguf-trace.txt to compute APR/GGUF std ratio across all sub-stages of layers 0-6. Result: drift accumulates gradually in layers 0-2 (output ratio 1.12 → 1.39 → 1.30) then EXPLODES at layer 3 (output ratio 18.57x). Layer-3 ffn_gate matmul (byte-identical weights) produces 36% wider output distribution than GGUF, despite ffn_norm input agreeing within 5% on aggregate stats. Silu's saturated regime at gate values near -6 amplifies the 36% to 4.6x ffn_silu, then 18.2x ffn_swigl, then 60x ffn_out. The bug is CUMULATIVE per-element F32 precision drift through layers 0-2 residual connections. ## Concrete next investigation step Hypothesis: APR's matmul reduction is parallel (rayon) producing non-deterministic ordering of f32 accumulations. GGUF's may be serial or have fixed deterministic order. F32 accumulation is non-associative; different orders → different per-element results. Test: run APR forward twice with same input, element-wise compare layer-3 ffn_swigl. If non-deterministic across runs, parallel reduction is the source. ## Path to shipping MODEL-1 If hypothesis confirmed: 1. Fix APR matmul reduction order to be deterministic 2. Re-run trace, verify layer-3 ffn_swigl ratio drops below 1.5x 3. Verify SHIP-002/005/006/007/008 PARTIALs flip to DISCHARGED 4. MODEL-1 ships cleanly through both APR and GGUF backends (paiml/qwen2.5-coder-7b-apache-q4k-v1) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…r capture (unblocks SHIP-007 layer-0 bisection) Triggering observation 2026-04-28: SHIP-007's hypothesis space has been narrowed by 5 falsified hypotheses (§28, §28.4(a), §31, §32, #1101). The remaining bug surface is per-element divergence at some specific stage of layer-0 forward. Aggregate stats — already emitted by `apr trace --payload` — are insufficient since they can hide per-element drift behind similar std values. This contract defines the missing infrastructure: `--save-tensor <stage>` flag that captures raw F32 tensor values at chosen forward-pass stages, written as APRT-magic-prefixed binaries that `apr diff --values` can load directly. ## Stages enumerated (19 total) embedding, attn_norm, qkv_matmul, qkv_bias, q_post_rope, k_post_rope, attention, attn_out, post_attn_residual, ffn_norm, ffn_gate, ffn_up, ffn_silu, ffn_swigl, ffn_out, post_ffn_residual, layer_output, final_norm, lm_head ## Falsification tests (8) - 001: --save-tensor flag recognized - 002: determinism (byte-identical across runs) - 003: ffn_gate stage produces expected APR-vs-GGUF diff (corroborates #1099) - 004: APRT header format self-describing - 005: multi-stage comma-list works - 006: NaN preservation - 007: --layer subset compatible - 008: pv validates `pv validate` exits 0 (verified). ## Implementation cost 400-600 LOC + 8 tests, multi-day Rust task. ## Linkage to shipping MODEL-1 Once shipped, the SHIP-007 layer-0 bisection completes in one debug session: run save-tensor in both APR and GGUF formats, apr diff at each stage, pinpoint the first divergent stage as the actual bug surface. SHIP-002/005/006/007/008 (5 PARTIALs) all depend on the SHIP-007 fix. With this tooling, the fix is unblocked. paiml/qwen2.5-coder-7b-apache-q4k-v1 ships cleanly through both APR and GGUF backends → MODEL-1 completes. Status: PROPOSED. Implementation deferred to multi-day Rust task. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…007 layer-0 bisection (ships MODEL-1) (#1102) * feat(apr-code): add --emit-trace flag (M28 — ccpa-trace.jsonl emission) Adds `apr code --emit-trace <path>` flag — when set, after the agent loop completes the runtime writes a 4-record `ccpa-trace.jsonl` file to `<path>` describing the run. Format mirrors the schema at https://github.com/paiml/claude-code-parity-apr/blob/main/contracts/claude-code-parity-apr-v1.yaml § trace_schema. The companion-repo `ccpa measure` subcommand (M26) consumes this file to score apr-code against canonical Claude Code reference fixtures. Records emitted: 1. session_start — synthetic UUIDv7-shaped session_id derived from the start ts; ts is a timestamp string; cwd_sha256 is a 64-char placeholder (the companion-repo differ normalizes these at compare time). 2. user_prompt — turn 0, verbatim text. 3. assistant_turn — turn 1, single Block::Text carrying the agent's final response text. Tool dispatch / hook / skill records are M29+ enrichment follow-ups. 4. session_end — real elapsed_ms + token counts from AgentLoopResult.usage (input_tokens / output_tokens). Real metadata, not stubbed. Plumbing: - commands_enum.rs — new `emit_trace: Option<PathBuf>` field on the Code variant. - dispatch.rs — threads it into batuta::agent::code::cmd_code. - code.rs cmd_code — accepts the new param + plumbs to run_single_prompt. - code.rs run_single_prompt — captures `Instant::now()` at start; after the agent loop returns Ok(r), if the caller passed --emit-trace, calls the new emit_ccpa_trace helper. On write-failure eprintln! a warning but DO NOT fail the agent run. - code.rs emit_ccpa_trace — new helper (~85 LOC) that hand-rolls JSONL via serde_json::json! macros (no new dependency on ccpa_trace types). Tests (4 new in code_tests.rs::emit_trace_tests): - emit_writes_4_jsonl_records_with_correct_kinds - emit_carries_prompt_and_response_text - emit_carries_token_counts_and_elapsed - emit_each_record_has_v1_envelope (per-record back-compat invariant from the ccpa-trace v2 schema) Total in agent::code: 50 → 54 tests passing. Live dogfood: $ apr code --emit-trace /tmp/measured.jsonl \ -p "Show me which CLAUDE.md takes precedence right now" $ cat /tmp/measured.jsonl | jq -r '.kind' session_start user_prompt assistant_turn session_end $ cat /tmp/measured.jsonl | jq -r 'select(.kind=="session_end")' {"v":1,"kind":"session_end","turn":1,"stop_reason":"end_turn", "elapsed_ms":3295,"tokens_in":44,"tokens_out":1024} Real elapsed_ms / token counts populated correctly. Note: the response text from Qwen3-1.7B in the dogfood was gibberish (<think>-loop pre-existing aprender concern, see PMAT-190). The trace format is correct; the model behavior is a separate workstream. The emit-trace flag works regardless of model quality. Refs: - paiml/claude-code-parity-apr#31 (M26 — ccpa measure subcommand that consumes this file) - paiml/claude-code-parity-apr/contracts/claude-code-parity-apr-v1.yaml § trace_schema (the canonical schema) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(apr-code): default to Qwen3-Coder-30B-A3B-Instruct on 24 GB GPUs Recommends and auto-discovers Qwen3-Coder-30B-A3B-Instruct as the default model for `apr code` when present. Aligned with the research write-up at paiml/claude-code-parity-apr / 2026-04-28. What ships: configs/aliases.yaml + new short name `qwen3-coder` → hf://unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF Now `apr pull qwen3-coder` works. crates/aprender-registry/src/aliases.rs + matching entry in the in-memory AliasRegistry (kept in sync with configs/aliases.yaml). crates/aprender-orchestrate/src/agent/manifest.rs + `~/.cache/pacha/models/` added to model_search_dirs so `apr pull`-cached files (content-hashed names) are visible to discovery; pair with a friendly symlink in `~/.apr/models/` for the preferred-name filter to recognize. + new module-level helper `is_preferred_default_model(path)`: case-insensitive substring match against a short list of recommended-default model names. Order: 1. qwen3-coder-30b-a3b 2. qwen3-coder-next 3. qwen2.5-coder-32b 4. qwen2.5-coder-14b + discover_model + sort_candidates updated to insert preferred-name as a sort key BETWEEN validity (still wins overall) and newest-mtime. So when a small recently-pulled model exists alongside the recommended default, the recommended default is selected — fixing the failure mode where Qwen3-1.7B (PMAT-190 thinking-loop bug, emits gibberish) was being auto-picked over a known-good 30B model. Tests (5 new in manifest_tests_discovery.rs, 49 → 54 in agent::manifest): - preferred_default_recognises_qwen3_coder_30b_a3b (any-case, any-quant matching) - preferred_default_rejects_small_fallbacks (1.7B / 1.5B / 1.1B / 7B all rejected — the 7B Qwen2.5-Coder is still useful but we don't anchor it as the recommended-default family for 24 GB GPUs) - sort_candidates_promotes_preferred_over_newer (preferred-name beats newer-but-smaller mtime) - sort_candidates_newer_preferred_beats_older_preferred (within preferred-names, mtime still tiebreaks) - sort_candidates_validity_outranks_preference (Jidoka — invalid preferred loses to valid non-preferred) Live verification (this PR): $ apr pull qwen3-coder ✓ Downloaded successfully Path: /home/noah/.cache/pacha/models/2b88b180a790988f.gguf Size: 17.3 GB $ ln -s /home/noah/.cache/pacha/models/2b88b180a790988f.gguf \ /home/noah/.apr/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf $ apr code -p "ping" --max-turns 1 Model: Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf (auto-discovered) ↑ default-model preference picked correctly. Known gap (NOT addressed by this PR): After auto-discovery picks the model, both apr-serve subprocess and embedded inference fail with: Error: driver error: inference failed: Invalid shape: Tensor 'blk.0.ffn_up.weight' not found Qwen3-Coder-30B-A3B is a Mixture-of-Experts model that uses per-expert tensor names (`ffn_up_exps`, `ffn_gate_exps`, etc.), not the dense `ffn_up.weight` the current realizar GGUF loader expects. qwen3moe architecture support is upstream realizar work — separate from this PR. The discovery / alias / preferred- name selection mechanism is fully ready for when that lands. In the interim users hitting the inference error should fall back to a dense model — either Qwen2.5-Coder-32B-Instruct (also recognized by is_preferred_default_model) or Qwen2.5-Coder-7B. Refs: - Research write-up: paiml/claude-code-parity-apr / chat 2026-04-28 - Hugging Face: unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF - aprender CLAUDE.md § Claude Messages-API proxy spec — same model is already declared as the default for `apr serve anthropic` Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(p3): apr-cli-trace-save-tensor-v1 — contract for per-stage tensor capture (unblocks SHIP-007 layer-0 bisection) Triggering observation 2026-04-28: SHIP-007's hypothesis space has been narrowed by 5 falsified hypotheses (§28, §28.4(a), §31, §32, #1101). The remaining bug surface is per-element divergence at some specific stage of layer-0 forward. Aggregate stats — already emitted by `apr trace --payload` — are insufficient since they can hide per-element drift behind similar std values. This contract defines the missing infrastructure: `--save-tensor <stage>` flag that captures raw F32 tensor values at chosen forward-pass stages, written as APRT-magic-prefixed binaries that `apr diff --values` can load directly. ## Stages enumerated (19 total) embedding, attn_norm, qkv_matmul, qkv_bias, q_post_rope, k_post_rope, attention, attn_out, post_attn_residual, ffn_norm, ffn_gate, ffn_up, ffn_silu, ffn_swigl, ffn_out, post_ffn_residual, layer_output, final_norm, lm_head ## Falsification tests (8) - 001: --save-tensor flag recognized - 002: determinism (byte-identical across runs) - 003: ffn_gate stage produces expected APR-vs-GGUF diff (corroborates #1099) - 004: APRT header format self-describing - 005: multi-stage comma-list works - 006: NaN preservation - 007: --layer subset compatible - 008: pv validates `pv validate` exits 0 (verified). ## Implementation cost 400-600 LOC + 8 tests, multi-day Rust task. ## Linkage to shipping MODEL-1 Once shipped, the SHIP-007 layer-0 bisection completes in one debug session: run save-tensor in both APR and GGUF formats, apr diff at each stage, pinpoint the first divergent stage as the actual bug surface. SHIP-002/005/006/007/008 (5 PARTIALs) all depend on the SHIP-007 fix. With this tooling, the fix is unblocked. paiml/qwen2.5-coder-7b-apache-q4k-v1 ships cleanly through both APR and GGUF backends → MODEL-1 completes. Status: PROPOSED. Implementation deferred to multi-day Rust task. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…SIFY-QW3-MOE-FORWARD-003 (#1127) ## What ships Adds `crates/apr-cli/tests/qwen3_moe_apr_run_live_falsifier.rs` — F-QW3-MOE-C22214-001, an integration test that invokes the user-facing `apr` binary as a subprocess and asserts: 1. exit 0 2. stdout contains ≥1 non-whitespace character against the cached 17.3 GB Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf with a fresh date-tagged prompt. This pins the M32c.2.2.2.1.3 dispatch flip (PR #1126, squash a902eea) in CI / regression-prevention. Without it, a future regression that re-routed qwen3_moe back to the dense `run_gguf_generate` path (which produces garbage on MoE weights) would slip through CI silently — there'd be no signal at the `apr run` user-facing surface. ## Live evidence (lambda-vector RTX 4090, 2026-04-29) ``` running 1 test test f_qw3_moe_c22214_001_apr_run_emits_at_least_one_non_whitespace_char ... F-QW3-MOE-C22214-001: live `apr run` against /home/noah/.cache/pacha/models/2b88b180a790988f.gguf F-QW3-MOE-C22214-001: elapsed = 130.945370974s stdout (first 200B): === APR Run === Source: /home/noah/.cache/pacha/models/2b88b180a790988f.gguf Output: . Completed in 130.83s (cached) stderr (first 200B): [BOS-FALLBACK] No tokenizer.ggml.bos_token_id in GGUF — using architecture default for 'qwen3moe' [BOS-FALLBACK] No tokenizer.ggml.bos_token_id in GGUF — using architecture default for 'qwen3moe' F-QW3-MOE-C22214-001: PASS ok test result: ok. 1 passed; 0 failed; 0 ignored ``` Token quality vs llama.cpp Q4_K (cosine on logits) is M32d. This test asserts ONLY emit/exit-0 — the discharge gate for FALSIFY-QW3-MOE-FORWARD-003. ## Skip path CI runners (and any host without the cached GGUF) print: F-QW3-MOE-C22214-001: SKIP — no cached Qwen3-Coder GGUF at any of [...] and return success. Same skip pattern as `crates/aprender-serve/tests/qwen3_moe_forward_one_token.rs` (M32c.2.2.2.1.1 in-process forward primitive). ## Contract chain status M32a qwen3-moe-forward-v1 contract scaffold SHIPPED (#1099) M32b arch-aware FFN load refuses qwen3_moe SHIPPED (#1100) M32c.1+ MoE descriptor load + per-expert byte slicer SHIPPED M32c.2.2.2.1.1 forward_qwen3_moe method SHIPPED (#1124) M32c.2.2.2.1.2 run_qwen3_moe_generate function SHIPPED (#1125) M32c.2.2.2.1.3 dispatch flip + Q4_K_M qtype dispatch SHIPPED (#1126) M32c.2.2.2.1.4 live `apr run` falsifier THIS PR M32d numerical parity vs llama.cpp PENDING After M32d the contract flips DRAFT → ACTIVE_RUNTIME, which unblocks the companion-repo FALSIFY-CCPA-013 measured tool-dispatch parity gate. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) April 28, 2026 07:58

noahgift and others added 2 commits April 28, 2026 10:17

noahgift force-pushed the docs/ship-007-layer3-bisection-evidence branch from 01842c8 to 904b15b Compare April 28, 2026 08:17

noahgift merged commit f33c8bb into main Apr 28, 2026
10 checks passed

noahgift deleted the docs/ship-007-layer3-bisection-evidence branch April 28, 2026 08:38

noahgift mentioned this pull request Apr 28, 2026

docs(ship-007): determinism test FALSIFIES parallel-reduction-nondeterminism hypothesis #1101

Merged

3 tasks

noahgift mentioned this pull request Apr 29, 2026

test(realizar): M32c.2.2.2.1.4 — live apr run falsifier pinning FALSIFY-QW3-MOE-FORWARD-003 #1127

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(ship-007): layer-3 sub-FFN bisection — divergence pinned at ffn_gate matmul (first statistical site)#1099

docs(ship-007): layer-3 sub-FFN bisection — divergence pinned at ffn_gate matmul (first statistical site)#1099
noahgift merged 2 commits into
mainfrom
docs/ship-007-layer3-bisection-evidence

noahgift commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 28, 2026

Summary

Why this matters for shipping MODEL-1

Layer-3 result

Paradox

Remaining hypothesis

Path to shipping MODEL-1

Files

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant