docs(ship-007): layer-3 sub-FFN bisection — divergence pinned at ffn_gate matmul (first statistical site)#1099
Merged
Conversation
…gate matmul output After PR #1082 (sub-FFN populate) and #1083 (CLI wiring) merged today, ran `apr trace --payload` on canonical 7B teacher in both APR and GGUF formats. First time we have side-by-side per-layer sub-FFN stats. Layer-3 result (1.36× ratio at ffn_gate, amplifies to 60× at ffn_out): | Stat | APR | GGUF | Ratio | |------|----:|-----:|------:| | ffn_norm (input) | 0.995 | 1.035 | 0.96× | | ffn_gate (post-matmul) | 1.924 | 1.413 | 1.36× ← divergence | | ffn_up | 1.335 | 1.456 | 0.92× | | ffn_silu | 0.168 | 0.037 | 4.59× silu amp | | ffn_swigl | 1.222 | 0.067 | 18.23× compound | | ffn_out | 11.459 | 0.191 | 60.0× cascade | Layer-3 ffn_gate is the FIRST sub-FFN site where APR and GGUF aggregate stats diverge significantly. Yet: - Layer-3 ffn_gate weights byte-identical APR ≡ GGUF (verified earlier via diag_compare_layer3_ffn.rs) - ffn_norm inputs agree within 5% on aggregate stats The remaining hypothesis: per-element values of ffn_norm input differ (despite similar std), produced by cumulative F32 precision drift through layers 0-2 residual connections. Per-element diff at this specific stage is the next investigation step. ## Why this matters for shipping MODEL-1 paiml/qwen2.5-coder-7b-apache-q4k-v1 is published to HuggingFace but its APR backend produces wrong outputs. SHIP-002/005/006/007/008 (5 PARTIALs) all depend on this fix. With this bisection: - Bug surface narrowed from "(layer 3, FFN sub-block)" (§17) to "(layer 3, ffn_gate matmul output)" — first statistical divergence - Weights agree → fix not in converter - Aggregate input stats agree → fix in per-element behavior of ffn_norm input or matmul nondeterminism - Once per-element source identified and fixed, the 5 PARTIALs promote to DISCHARGED and MODEL-1 ships cleanly through both APR and GGUF backends Files: - evidence/ship-007-layer3-bisection-2026-04-28/findings.md - evidence/ship-007-layer3-bisection-2026-04-28/apr-trace.txt - evidence/ship-007-layer3-bisection-2026-04-28/gguf-trace.txt Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…thesis for the fix Parsed apr-trace.txt and gguf-trace.txt to compute APR/GGUF std ratio across all sub-stages of layers 0-6. Result: drift accumulates gradually in layers 0-2 (output ratio 1.12 → 1.39 → 1.30) then EXPLODES at layer 3 (output ratio 18.57x). Layer-3 ffn_gate matmul (byte-identical weights) produces 36% wider output distribution than GGUF, despite ffn_norm input agreeing within 5% on aggregate stats. Silu's saturated regime at gate values near -6 amplifies the 36% to 4.6x ffn_silu, then 18.2x ffn_swigl, then 60x ffn_out. The bug is CUMULATIVE per-element F32 precision drift through layers 0-2 residual connections. ## Concrete next investigation step Hypothesis: APR's matmul reduction is parallel (rayon) producing non-deterministic ordering of f32 accumulations. GGUF's may be serial or have fixed deterministic order. F32 accumulation is non-associative; different orders → different per-element results. Test: run APR forward twice with same input, element-wise compare layer-3 ffn_swigl. If non-deterministic across runs, parallel reduction is the source. ## Path to shipping MODEL-1 If hypothesis confirmed: 1. Fix APR matmul reduction order to be deterministic 2. Re-run trace, verify layer-3 ffn_swigl ratio drops below 1.5x 3. Verify SHIP-002/005/006/007/008 PARTIALs flip to DISCHARGED 4. MODEL-1 ships cleanly through both APR and GGUF backends (paiml/qwen2.5-coder-7b-apache-q4k-v1) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
01842c8 to
904b15b
Compare
Merged
3 tasks
noahgift
added a commit
that referenced
this pull request
Apr 28, 2026
…r capture (unblocks SHIP-007 layer-0 bisection) Triggering observation 2026-04-28: SHIP-007's hypothesis space has been narrowed by 5 falsified hypotheses (§28, §28.4(a), §31, §32, #1101). The remaining bug surface is per-element divergence at some specific stage of layer-0 forward. Aggregate stats — already emitted by `apr trace --payload` — are insufficient since they can hide per-element drift behind similar std values. This contract defines the missing infrastructure: `--save-tensor <stage>` flag that captures raw F32 tensor values at chosen forward-pass stages, written as APRT-magic-prefixed binaries that `apr diff --values` can load directly. ## Stages enumerated (19 total) embedding, attn_norm, qkv_matmul, qkv_bias, q_post_rope, k_post_rope, attention, attn_out, post_attn_residual, ffn_norm, ffn_gate, ffn_up, ffn_silu, ffn_swigl, ffn_out, post_ffn_residual, layer_output, final_norm, lm_head ## Falsification tests (8) - 001: --save-tensor flag recognized - 002: determinism (byte-identical across runs) - 003: ffn_gate stage produces expected APR-vs-GGUF diff (corroborates #1099) - 004: APRT header format self-describing - 005: multi-stage comma-list works - 006: NaN preservation - 007: --layer subset compatible - 008: pv validates `pv validate` exits 0 (verified). ## Implementation cost 400-600 LOC + 8 tests, multi-day Rust task. ## Linkage to shipping MODEL-1 Once shipped, the SHIP-007 layer-0 bisection completes in one debug session: run save-tensor in both APR and GGUF formats, apr diff at each stage, pinpoint the first divergent stage as the actual bug surface. SHIP-002/005/006/007/008 (5 PARTIALs) all depend on the SHIP-007 fix. With this tooling, the fix is unblocked. paiml/qwen2.5-coder-7b-apache-q4k-v1 ships cleanly through both APR and GGUF backends → MODEL-1 completes. Status: PROPOSED. Implementation deferred to multi-day Rust task. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 28, 2026
…007 layer-0 bisection (ships MODEL-1) (#1102) * feat(apr-code): add --emit-trace flag (M28 — ccpa-trace.jsonl emission) Adds `apr code --emit-trace <path>` flag — when set, after the agent loop completes the runtime writes a 4-record `ccpa-trace.jsonl` file to `<path>` describing the run. Format mirrors the schema at https://github.com/paiml/claude-code-parity-apr/blob/main/contracts/claude-code-parity-apr-v1.yaml § trace_schema. The companion-repo `ccpa measure` subcommand (M26) consumes this file to score apr-code against canonical Claude Code reference fixtures. Records emitted: 1. session_start — synthetic UUIDv7-shaped session_id derived from the start ts; ts is a timestamp string; cwd_sha256 is a 64-char placeholder (the companion-repo differ normalizes these at compare time). 2. user_prompt — turn 0, verbatim text. 3. assistant_turn — turn 1, single Block::Text carrying the agent's final response text. Tool dispatch / hook / skill records are M29+ enrichment follow-ups. 4. session_end — real elapsed_ms + token counts from AgentLoopResult.usage (input_tokens / output_tokens). Real metadata, not stubbed. Plumbing: - commands_enum.rs — new `emit_trace: Option<PathBuf>` field on the Code variant. - dispatch.rs — threads it into batuta::agent::code::cmd_code. - code.rs cmd_code — accepts the new param + plumbs to run_single_prompt. - code.rs run_single_prompt — captures `Instant::now()` at start; after the agent loop returns Ok(r), if the caller passed --emit-trace, calls the new emit_ccpa_trace helper. On write-failure eprintln! a warning but DO NOT fail the agent run. - code.rs emit_ccpa_trace — new helper (~85 LOC) that hand-rolls JSONL via serde_json::json! macros (no new dependency on ccpa_trace types). Tests (4 new in code_tests.rs::emit_trace_tests): - emit_writes_4_jsonl_records_with_correct_kinds - emit_carries_prompt_and_response_text - emit_carries_token_counts_and_elapsed - emit_each_record_has_v1_envelope (per-record back-compat invariant from the ccpa-trace v2 schema) Total in agent::code: 50 → 54 tests passing. Live dogfood: $ apr code --emit-trace /tmp/measured.jsonl \ -p "Show me which CLAUDE.md takes precedence right now" $ cat /tmp/measured.jsonl | jq -r '.kind' session_start user_prompt assistant_turn session_end $ cat /tmp/measured.jsonl | jq -r 'select(.kind=="session_end")' {"v":1,"kind":"session_end","turn":1,"stop_reason":"end_turn", "elapsed_ms":3295,"tokens_in":44,"tokens_out":1024} Real elapsed_ms / token counts populated correctly. Note: the response text from Qwen3-1.7B in the dogfood was gibberish (<think>-loop pre-existing aprender concern, see PMAT-190). The trace format is correct; the model behavior is a separate workstream. The emit-trace flag works regardless of model quality. Refs: - paiml/claude-code-parity-apr#31 (M26 — ccpa measure subcommand that consumes this file) - paiml/claude-code-parity-apr/contracts/claude-code-parity-apr-v1.yaml § trace_schema (the canonical schema) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(apr-code): default to Qwen3-Coder-30B-A3B-Instruct on 24 GB GPUs Recommends and auto-discovers Qwen3-Coder-30B-A3B-Instruct as the default model for `apr code` when present. Aligned with the research write-up at paiml/claude-code-parity-apr / 2026-04-28. What ships: configs/aliases.yaml + new short name `qwen3-coder` → hf://unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF Now `apr pull qwen3-coder` works. crates/aprender-registry/src/aliases.rs + matching entry in the in-memory AliasRegistry (kept in sync with configs/aliases.yaml). crates/aprender-orchestrate/src/agent/manifest.rs + `~/.cache/pacha/models/` added to model_search_dirs so `apr pull`-cached files (content-hashed names) are visible to discovery; pair with a friendly symlink in `~/.apr/models/` for the preferred-name filter to recognize. + new module-level helper `is_preferred_default_model(path)`: case-insensitive substring match against a short list of recommended-default model names. Order: 1. qwen3-coder-30b-a3b 2. qwen3-coder-next 3. qwen2.5-coder-32b 4. qwen2.5-coder-14b + discover_model + sort_candidates updated to insert preferred-name as a sort key BETWEEN validity (still wins overall) and newest-mtime. So when a small recently-pulled model exists alongside the recommended default, the recommended default is selected — fixing the failure mode where Qwen3-1.7B (PMAT-190 thinking-loop bug, emits gibberish) was being auto-picked over a known-good 30B model. Tests (5 new in manifest_tests_discovery.rs, 49 → 54 in agent::manifest): - preferred_default_recognises_qwen3_coder_30b_a3b (any-case, any-quant matching) - preferred_default_rejects_small_fallbacks (1.7B / 1.5B / 1.1B / 7B all rejected — the 7B Qwen2.5-Coder is still useful but we don't anchor it as the recommended-default family for 24 GB GPUs) - sort_candidates_promotes_preferred_over_newer (preferred-name beats newer-but-smaller mtime) - sort_candidates_newer_preferred_beats_older_preferred (within preferred-names, mtime still tiebreaks) - sort_candidates_validity_outranks_preference (Jidoka — invalid preferred loses to valid non-preferred) Live verification (this PR): $ apr pull qwen3-coder ✓ Downloaded successfully Path: /home/noah/.cache/pacha/models/2b88b180a790988f.gguf Size: 17.3 GB $ ln -s /home/noah/.cache/pacha/models/2b88b180a790988f.gguf \ /home/noah/.apr/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf $ apr code -p "ping" --max-turns 1 Model: Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf (auto-discovered) ↑ default-model preference picked correctly. Known gap (NOT addressed by this PR): After auto-discovery picks the model, both apr-serve subprocess and embedded inference fail with: Error: driver error: inference failed: Invalid shape: Tensor 'blk.0.ffn_up.weight' not found Qwen3-Coder-30B-A3B is a Mixture-of-Experts model that uses per-expert tensor names (`ffn_up_exps`, `ffn_gate_exps`, etc.), not the dense `ffn_up.weight` the current realizar GGUF loader expects. qwen3moe architecture support is upstream realizar work — separate from this PR. The discovery / alias / preferred- name selection mechanism is fully ready for when that lands. In the interim users hitting the inference error should fall back to a dense model — either Qwen2.5-Coder-32B-Instruct (also recognized by is_preferred_default_model) or Qwen2.5-Coder-7B. Refs: - Research write-up: paiml/claude-code-parity-apr / chat 2026-04-28 - Hugging Face: unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF - aprender CLAUDE.md § Claude Messages-API proxy spec — same model is already declared as the default for `apr serve anthropic` Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(p3): apr-cli-trace-save-tensor-v1 — contract for per-stage tensor capture (unblocks SHIP-007 layer-0 bisection) Triggering observation 2026-04-28: SHIP-007's hypothesis space has been narrowed by 5 falsified hypotheses (§28, §28.4(a), §31, §32, #1101). The remaining bug surface is per-element divergence at some specific stage of layer-0 forward. Aggregate stats — already emitted by `apr trace --payload` — are insufficient since they can hide per-element drift behind similar std values. This contract defines the missing infrastructure: `--save-tensor <stage>` flag that captures raw F32 tensor values at chosen forward-pass stages, written as APRT-magic-prefixed binaries that `apr diff --values` can load directly. ## Stages enumerated (19 total) embedding, attn_norm, qkv_matmul, qkv_bias, q_post_rope, k_post_rope, attention, attn_out, post_attn_residual, ffn_norm, ffn_gate, ffn_up, ffn_silu, ffn_swigl, ffn_out, post_ffn_residual, layer_output, final_norm, lm_head ## Falsification tests (8) - 001: --save-tensor flag recognized - 002: determinism (byte-identical across runs) - 003: ffn_gate stage produces expected APR-vs-GGUF diff (corroborates #1099) - 004: APRT header format self-describing - 005: multi-stage comma-list works - 006: NaN preservation - 007: --layer subset compatible - 008: pv validates `pv validate` exits 0 (verified). ## Implementation cost 400-600 LOC + 8 tests, multi-day Rust task. ## Linkage to shipping MODEL-1 Once shipped, the SHIP-007 layer-0 bisection completes in one debug session: run save-tensor in both APR and GGUF formats, apr diff at each stage, pinpoint the first divergent stage as the actual bug surface. SHIP-002/005/006/007/008 (5 PARTIALs) all depend on the SHIP-007 fix. With this tooling, the fix is unblocked. paiml/qwen2.5-coder-7b-apache-q4k-v1 ships cleanly through both APR and GGUF backends → MODEL-1 completes. Status: PROPOSED. Implementation deferred to multi-day Rust task. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Merged
5 tasks
noahgift
added a commit
that referenced
this pull request
Apr 29, 2026
…SIFY-QW3-MOE-FORWARD-003 (#1127) ## What ships Adds `crates/apr-cli/tests/qwen3_moe_apr_run_live_falsifier.rs` — F-QW3-MOE-C22214-001, an integration test that invokes the user-facing `apr` binary as a subprocess and asserts: 1. exit 0 2. stdout contains ≥1 non-whitespace character against the cached 17.3 GB Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf with a fresh date-tagged prompt. This pins the M32c.2.2.2.1.3 dispatch flip (PR #1126, squash a902eea) in CI / regression-prevention. Without it, a future regression that re-routed qwen3_moe back to the dense `run_gguf_generate` path (which produces garbage on MoE weights) would slip through CI silently — there'd be no signal at the `apr run` user-facing surface. ## Live evidence (lambda-vector RTX 4090, 2026-04-29) ``` running 1 test test f_qw3_moe_c22214_001_apr_run_emits_at_least_one_non_whitespace_char ... F-QW3-MOE-C22214-001: live `apr run` against /home/noah/.cache/pacha/models/2b88b180a790988f.gguf F-QW3-MOE-C22214-001: elapsed = 130.945370974s stdout (first 200B): === APR Run === Source: /home/noah/.cache/pacha/models/2b88b180a790988f.gguf Output: . Completed in 130.83s (cached) stderr (first 200B): [BOS-FALLBACK] No tokenizer.ggml.bos_token_id in GGUF — using architecture default for 'qwen3moe' [BOS-FALLBACK] No tokenizer.ggml.bos_token_id in GGUF — using architecture default for 'qwen3moe' F-QW3-MOE-C22214-001: PASS ok test result: ok. 1 passed; 0 failed; 0 ignored ``` Token quality vs llama.cpp Q4_K (cosine on logits) is M32d. This test asserts ONLY emit/exit-0 — the discharge gate for FALSIFY-QW3-MOE-FORWARD-003. ## Skip path CI runners (and any host without the cached GGUF) print: F-QW3-MOE-C22214-001: SKIP — no cached Qwen3-Coder GGUF at any of [...] and return success. Same skip pattern as `crates/aprender-serve/tests/qwen3_moe_forward_one_token.rs` (M32c.2.2.2.1.1 in-process forward primitive). ## Contract chain status M32a qwen3-moe-forward-v1 contract scaffold SHIPPED (#1099) M32b arch-aware FFN load refuses qwen3_moe SHIPPED (#1100) M32c.1+ MoE descriptor load + per-expert byte slicer SHIPPED M32c.2.2.2.1.1 forward_qwen3_moe method SHIPPED (#1124) M32c.2.2.2.1.2 run_qwen3_moe_generate function SHIPPED (#1125) M32c.2.2.2.1.3 dispatch flip + Q4_K_M qtype dispatch SHIPPED (#1126) M32c.2.2.2.1.4 live `apr run` falsifier THIS PR M32d numerical parity vs llama.cpp PENDING After M32d the contract flips DRAFT → ACTIVE_RUNTIME, which unblocks the companion-repo FALSIFY-CCPA-013 measured tool-dispatch parity gate. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
After PR #1082 (sub-FFN populate) + PR #1083 (CLI wiring) merged today, ran
apr trace --payloadon the canonical 7B teacher in both APR and GGUF formats. First time we have side-by-side per-layer sub-FFN stats.Why this matters for shipping MODEL-1
paiml/qwen2.5-coder-7b-apache-q4k-v1is published to HuggingFace but its APR backend produces wrong outputs. SHIP-002/005/006/007/008 (5 PARTIALs) all depend on this fix. This bisection narrows the bug surface from "(layer 3, FFN sub-block)" to "(layer 3, ffn_gate matmul output)" — the first statistical divergence point.Layer-3 result
Paradox
Layer-3 ffn_gate weights are byte-identical APR ≡ GGUF (verified earlier today). Inputs (ffn_norm) agree within 5%. Yet outputs diverge 36%.
Remaining hypothesis
Per-element values of ffn_norm input differ (despite similar aggregate std), produced by cumulative F32 precision drift through layers 0-2 residual connections. Per-element diff at this stage is the next investigation step.
Path to shipping MODEL-1
Once per-element source identified and fixed, the 5 PARTIALs promote to DISCHARGED → MODEL-1 ships cleanly through both APR and GGUF backends → SHIP-TWO-001 MODEL-1 row complete.
Files
evidence/ship-007-layer3-bisection-2026-04-28/findings.md— full analysisevidence/ship-007-layer3-bisection-2026-04-28/apr-trace.txtevidence/ship-007-layer3-bisection-2026-04-28/gguf-trace.txt🤖 Generated with Claude Code