Skip to content

docs(ship-007): layer-3 sub-FFN bisection — divergence pinned at ffn_gate matmul (first statistical site)#1099

Merged
noahgift merged 2 commits into
mainfrom
docs/ship-007-layer3-bisection-evidence
Apr 28, 2026
Merged

docs(ship-007): layer-3 sub-FFN bisection — divergence pinned at ffn_gate matmul (first statistical site)#1099
noahgift merged 2 commits into
mainfrom
docs/ship-007-layer3-bisection-evidence

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

After PR #1082 (sub-FFN populate) + PR #1083 (CLI wiring) merged today, ran apr trace --payload on the canonical 7B teacher in both APR and GGUF formats. First time we have side-by-side per-layer sub-FFN stats.

Why this matters for shipping MODEL-1

paiml/qwen2.5-coder-7b-apache-q4k-v1 is published to HuggingFace but its APR backend produces wrong outputs. SHIP-002/005/006/007/008 (5 PARTIALs) all depend on this fix. This bisection narrows the bug surface from "(layer 3, FFN sub-block)" to "(layer 3, ffn_gate matmul output)" — the first statistical divergence point.

Layer-3 result

Stat APR GGUF Ratio
ffn_norm (input) 0.995 1.035 0.96×
ffn_gate (post-matmul) 1.924 1.413 1.36× ← divergence starts
ffn_up 1.335 1.456 0.92×
ffn_silu 0.168 0.037 4.59× silu amp
ffn_swigl 1.222 0.067 18.23× compound
ffn_out 11.459 0.191 60.0× cascade

Paradox

Layer-3 ffn_gate weights are byte-identical APR ≡ GGUF (verified earlier today). Inputs (ffn_norm) agree within 5%. Yet outputs diverge 36%.

Remaining hypothesis

Per-element values of ffn_norm input differ (despite similar aggregate std), produced by cumulative F32 precision drift through layers 0-2 residual connections. Per-element diff at this stage is the next investigation step.

Path to shipping MODEL-1

Once per-element source identified and fixed, the 5 PARTIALs promote to DISCHARGED → MODEL-1 ships cleanly through both APR and GGUF backends → SHIP-TWO-001 MODEL-1 row complete.

Files

  • evidence/ship-007-layer3-bisection-2026-04-28/findings.md — full analysis
  • evidence/ship-007-layer3-bisection-2026-04-28/apr-trace.txt
  • evidence/ship-007-layer3-bisection-2026-04-28/gguf-trace.txt

🤖 Generated with Claude Code

@noahgift noahgift enabled auto-merge (squash) April 28, 2026 07:58
noahgift and others added 2 commits April 28, 2026 10:17
…gate matmul output

After PR #1082 (sub-FFN populate) and #1083 (CLI wiring) merged today,
ran `apr trace --payload` on canonical 7B teacher in both APR and GGUF
formats. First time we have side-by-side per-layer sub-FFN stats.

Layer-3 result (1.36× ratio at ffn_gate, amplifies to 60× at ffn_out):

| Stat | APR | GGUF | Ratio |
|------|----:|-----:|------:|
| ffn_norm (input) | 0.995 | 1.035 | 0.96× |
| ffn_gate (post-matmul) | 1.924 | 1.413 | 1.36× ← divergence |
| ffn_up | 1.335 | 1.456 | 0.92× |
| ffn_silu | 0.168 | 0.037 | 4.59× silu amp |
| ffn_swigl | 1.222 | 0.067 | 18.23× compound |
| ffn_out | 11.459 | 0.191 | 60.0× cascade |

Layer-3 ffn_gate is the FIRST sub-FFN site where APR and GGUF aggregate
stats diverge significantly. Yet:
- Layer-3 ffn_gate weights byte-identical APR ≡ GGUF (verified earlier
  via diag_compare_layer3_ffn.rs)
- ffn_norm inputs agree within 5% on aggregate stats

The remaining hypothesis: per-element values of ffn_norm input differ
(despite similar std), produced by cumulative F32 precision drift
through layers 0-2 residual connections. Per-element diff at this
specific stage is the next investigation step.

## Why this matters for shipping MODEL-1

paiml/qwen2.5-coder-7b-apache-q4k-v1 is published to HuggingFace but
its APR backend produces wrong outputs. SHIP-002/005/006/007/008
(5 PARTIALs) all depend on this fix. With this bisection:

- Bug surface narrowed from "(layer 3, FFN sub-block)" (§17) to
  "(layer 3, ffn_gate matmul output)" — first statistical divergence
- Weights agree → fix not in converter
- Aggregate input stats agree → fix in per-element behavior of
  ffn_norm input or matmul nondeterminism
- Once per-element source identified and fixed, the 5 PARTIALs
  promote to DISCHARGED and MODEL-1 ships cleanly through both
  APR and GGUF backends

Files:
- evidence/ship-007-layer3-bisection-2026-04-28/findings.md
- evidence/ship-007-layer3-bisection-2026-04-28/apr-trace.txt
- evidence/ship-007-layer3-bisection-2026-04-28/gguf-trace.txt

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…thesis for the fix

Parsed apr-trace.txt and gguf-trace.txt to compute APR/GGUF std ratio
across all sub-stages of layers 0-6. Result: drift accumulates gradually
in layers 0-2 (output ratio 1.12 → 1.39 → 1.30) then EXPLODES at layer 3
(output ratio 18.57x).

Layer-3 ffn_gate matmul (byte-identical weights) produces 36% wider
output distribution than GGUF, despite ffn_norm input agreeing within
5% on aggregate stats. Silu's saturated regime at gate values near -6
amplifies the 36% to 4.6x ffn_silu, then 18.2x ffn_swigl, then 60x ffn_out.

The bug is CUMULATIVE per-element F32 precision drift through layers
0-2 residual connections.

## Concrete next investigation step

Hypothesis: APR's matmul reduction is parallel (rayon) producing
non-deterministic ordering of f32 accumulations. GGUF's may be serial
or have fixed deterministic order. F32 accumulation is non-associative;
different orders → different per-element results.

Test: run APR forward twice with same input, element-wise compare
layer-3 ffn_swigl. If non-deterministic across runs, parallel reduction
is the source.

## Path to shipping MODEL-1

If hypothesis confirmed:
1. Fix APR matmul reduction order to be deterministic
2. Re-run trace, verify layer-3 ffn_swigl ratio drops below 1.5x
3. Verify SHIP-002/005/006/007/008 PARTIALs flip to DISCHARGED
4. MODEL-1 ships cleanly through both APR and GGUF backends
   (paiml/qwen2.5-coder-7b-apache-q4k-v1)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the docs/ship-007-layer3-bisection-evidence branch from 01842c8 to 904b15b Compare April 28, 2026 08:17
@noahgift noahgift merged commit f33c8bb into main Apr 28, 2026
10 checks passed
@noahgift noahgift deleted the docs/ship-007-layer3-bisection-evidence branch April 28, 2026 08:38
noahgift added a commit that referenced this pull request Apr 28, 2026
…r capture (unblocks SHIP-007 layer-0 bisection)

Triggering observation 2026-04-28: SHIP-007's hypothesis space has been
narrowed by 5 falsified hypotheses (§28, §28.4(a), §31, §32, #1101).
The remaining bug surface is per-element divergence at some specific
stage of layer-0 forward. Aggregate stats — already emitted by
`apr trace --payload` — are insufficient since they can hide per-element
drift behind similar std values.

This contract defines the missing infrastructure: `--save-tensor <stage>`
flag that captures raw F32 tensor values at chosen forward-pass stages,
written as APRT-magic-prefixed binaries that `apr diff --values` can
load directly.

## Stages enumerated (19 total)

embedding, attn_norm, qkv_matmul, qkv_bias, q_post_rope, k_post_rope,
attention, attn_out, post_attn_residual, ffn_norm, ffn_gate, ffn_up,
ffn_silu, ffn_swigl, ffn_out, post_ffn_residual, layer_output,
final_norm, lm_head

## Falsification tests (8)

- 001: --save-tensor flag recognized
- 002: determinism (byte-identical across runs)
- 003: ffn_gate stage produces expected APR-vs-GGUF diff (corroborates #1099)
- 004: APRT header format self-describing
- 005: multi-stage comma-list works
- 006: NaN preservation
- 007: --layer subset compatible
- 008: pv validates

`pv validate` exits 0 (verified).

## Implementation cost

400-600 LOC + 8 tests, multi-day Rust task.

## Linkage to shipping MODEL-1

Once shipped, the SHIP-007 layer-0 bisection completes in one debug
session: run save-tensor in both APR and GGUF formats, apr diff at each
stage, pinpoint the first divergent stage as the actual bug surface.

SHIP-002/005/006/007/008 (5 PARTIALs) all depend on the SHIP-007 fix.
With this tooling, the fix is unblocked.
paiml/qwen2.5-coder-7b-apache-q4k-v1 ships cleanly through both APR
and GGUF backends → MODEL-1 completes.

Status: PROPOSED. Implementation deferred to multi-day Rust task.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 28, 2026
…007 layer-0 bisection (ships MODEL-1) (#1102)

* feat(apr-code): add --emit-trace flag (M28 — ccpa-trace.jsonl emission)

Adds `apr code --emit-trace <path>` flag — when set, after the agent
loop completes the runtime writes a 4-record `ccpa-trace.jsonl` file
to `<path>` describing the run.

Format mirrors the schema at
https://github.com/paiml/claude-code-parity-apr/blob/main/contracts/claude-code-parity-apr-v1.yaml
§ trace_schema. The companion-repo `ccpa measure` subcommand (M26)
consumes this file to score apr-code against canonical Claude Code
reference fixtures.

Records emitted:

  1. session_start  — synthetic UUIDv7-shaped session_id derived from
                      the start ts; ts is a timestamp string;
                      cwd_sha256 is a 64-char placeholder (the
                      companion-repo differ normalizes these at compare
                      time).
  2. user_prompt    — turn 0, verbatim text.
  3. assistant_turn — turn 1, single Block::Text carrying the agent's
                      final response text. Tool dispatch / hook /
                      skill records are M29+ enrichment follow-ups.
  4. session_end    — real elapsed_ms + token counts from
                      AgentLoopResult.usage (input_tokens /
                      output_tokens). Real metadata, not stubbed.

Plumbing:
  - commands_enum.rs   — new `emit_trace: Option<PathBuf>` field on
                         the Code variant.
  - dispatch.rs        — threads it into batuta::agent::code::cmd_code.
  - code.rs cmd_code   — accepts the new param + plumbs to
                         run_single_prompt.
  - code.rs run_single_prompt — captures `Instant::now()` at start;
                         after the agent loop returns Ok(r), if the
                         caller passed --emit-trace, calls the new
                         emit_ccpa_trace helper. On write-failure
                         eprintln! a warning but DO NOT fail the
                         agent run.
  - code.rs emit_ccpa_trace — new helper (~85 LOC) that hand-rolls
                         JSONL via serde_json::json! macros (no new
                         dependency on ccpa_trace types).

Tests (4 new in code_tests.rs::emit_trace_tests):
  - emit_writes_4_jsonl_records_with_correct_kinds
  - emit_carries_prompt_and_response_text
  - emit_carries_token_counts_and_elapsed
  - emit_each_record_has_v1_envelope (per-record back-compat
    invariant from the ccpa-trace v2 schema)

Total in agent::code: 50 → 54 tests passing.

Live dogfood:
  $ apr code --emit-trace /tmp/measured.jsonl \
      -p "Show me which CLAUDE.md takes precedence right now"
  $ cat /tmp/measured.jsonl | jq -r '.kind'
    session_start
    user_prompt
    assistant_turn
    session_end
  $ cat /tmp/measured.jsonl | jq -r 'select(.kind=="session_end")'
    {"v":1,"kind":"session_end","turn":1,"stop_reason":"end_turn",
     "elapsed_ms":3295,"tokens_in":44,"tokens_out":1024}

Real elapsed_ms / token counts populated correctly.

Note: the response text from Qwen3-1.7B in the dogfood was gibberish
(<think>-loop pre-existing aprender concern, see PMAT-190). The trace
format is correct; the model behavior is a separate workstream. The
emit-trace flag works regardless of model quality.

Refs:
  - paiml/claude-code-parity-apr#31 (M26 — ccpa measure subcommand
    that consumes this file)
  - paiml/claude-code-parity-apr/contracts/claude-code-parity-apr-v1.yaml
    § trace_schema (the canonical schema)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(apr-code): default to Qwen3-Coder-30B-A3B-Instruct on 24 GB GPUs

Recommends and auto-discovers Qwen3-Coder-30B-A3B-Instruct as the
default model for `apr code` when present. Aligned with the
research write-up at paiml/claude-code-parity-apr / 2026-04-28.

What ships:

  configs/aliases.yaml
    + new short name `qwen3-coder` →
      hf://unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
    Now `apr pull qwen3-coder` works.

  crates/aprender-registry/src/aliases.rs
    + matching entry in the in-memory AliasRegistry
      (kept in sync with configs/aliases.yaml).

  crates/aprender-orchestrate/src/agent/manifest.rs
    + `~/.cache/pacha/models/` added to model_search_dirs so
      `apr pull`-cached files (content-hashed names) are visible
      to discovery; pair with a friendly symlink in
      `~/.apr/models/` for the preferred-name filter to recognize.
    + new module-level helper `is_preferred_default_model(path)`:
      case-insensitive substring match against a short list of
      recommended-default model names. Order:
        1. qwen3-coder-30b-a3b
        2. qwen3-coder-next
        3. qwen2.5-coder-32b
        4. qwen2.5-coder-14b
    + discover_model + sort_candidates updated to insert
      preferred-name as a sort key BETWEEN validity (still wins
      overall) and newest-mtime. So when a small recently-pulled
      model exists alongside the recommended default, the
      recommended default is selected — fixing the failure mode
      where Qwen3-1.7B (PMAT-190 thinking-loop bug, emits
      gibberish) was being auto-picked over a known-good 30B model.

Tests (5 new in manifest_tests_discovery.rs, 49 → 54 in agent::manifest):
  - preferred_default_recognises_qwen3_coder_30b_a3b
    (any-case, any-quant matching)
  - preferred_default_rejects_small_fallbacks
    (1.7B / 1.5B / 1.1B / 7B all rejected — the 7B Qwen2.5-Coder is
    still useful but we don't anchor it as the recommended-default
    family for 24 GB GPUs)
  - sort_candidates_promotes_preferred_over_newer
    (preferred-name beats newer-but-smaller mtime)
  - sort_candidates_newer_preferred_beats_older_preferred
    (within preferred-names, mtime still tiebreaks)
  - sort_candidates_validity_outranks_preference
    (Jidoka — invalid preferred loses to valid non-preferred)

Live verification (this PR):

  $ apr pull qwen3-coder
    ✓ Downloaded successfully
      Path: /home/noah/.cache/pacha/models/2b88b180a790988f.gguf
      Size: 17.3 GB

  $ ln -s /home/noah/.cache/pacha/models/2b88b180a790988f.gguf \
         /home/noah/.apr/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf

  $ apr code -p "ping" --max-turns 1
    Model: Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf (auto-discovered)
    ↑ default-model preference picked correctly.

Known gap (NOT addressed by this PR):

  After auto-discovery picks the model, both apr-serve subprocess
  and embedded inference fail with:

    Error: driver error: inference failed:
           Invalid shape: Tensor 'blk.0.ffn_up.weight' not found

  Qwen3-Coder-30B-A3B is a Mixture-of-Experts model that uses
  per-expert tensor names (`ffn_up_exps`, `ffn_gate_exps`, etc.),
  not the dense `ffn_up.weight` the current realizar GGUF loader
  expects. qwen3moe architecture support is upstream realizar
  work — separate from this PR. The discovery / alias / preferred-
  name selection mechanism is fully ready for when that lands.

  In the interim users hitting the inference error should fall
  back to a dense model — either Qwen2.5-Coder-32B-Instruct
  (also recognized by is_preferred_default_model) or Qwen2.5-Coder-7B.

Refs:
  - Research write-up: paiml/claude-code-parity-apr / chat 2026-04-28
  - Hugging Face: unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
  - aprender CLAUDE.md § Claude Messages-API proxy spec — same model
    is already declared as the default for `apr serve anthropic`

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(p3): apr-cli-trace-save-tensor-v1 — contract for per-stage tensor capture (unblocks SHIP-007 layer-0 bisection)

Triggering observation 2026-04-28: SHIP-007's hypothesis space has been
narrowed by 5 falsified hypotheses (§28, §28.4(a), §31, §32, #1101).
The remaining bug surface is per-element divergence at some specific
stage of layer-0 forward. Aggregate stats — already emitted by
`apr trace --payload` — are insufficient since they can hide per-element
drift behind similar std values.

This contract defines the missing infrastructure: `--save-tensor <stage>`
flag that captures raw F32 tensor values at chosen forward-pass stages,
written as APRT-magic-prefixed binaries that `apr diff --values` can
load directly.

## Stages enumerated (19 total)

embedding, attn_norm, qkv_matmul, qkv_bias, q_post_rope, k_post_rope,
attention, attn_out, post_attn_residual, ffn_norm, ffn_gate, ffn_up,
ffn_silu, ffn_swigl, ffn_out, post_ffn_residual, layer_output,
final_norm, lm_head

## Falsification tests (8)

- 001: --save-tensor flag recognized
- 002: determinism (byte-identical across runs)
- 003: ffn_gate stage produces expected APR-vs-GGUF diff (corroborates #1099)
- 004: APRT header format self-describing
- 005: multi-stage comma-list works
- 006: NaN preservation
- 007: --layer subset compatible
- 008: pv validates

`pv validate` exits 0 (verified).

## Implementation cost

400-600 LOC + 8 tests, multi-day Rust task.

## Linkage to shipping MODEL-1

Once shipped, the SHIP-007 layer-0 bisection completes in one debug
session: run save-tensor in both APR and GGUF formats, apr diff at each
stage, pinpoint the first divergent stage as the actual bug surface.

SHIP-002/005/006/007/008 (5 PARTIALs) all depend on the SHIP-007 fix.
With this tooling, the fix is unblocked.
paiml/qwen2.5-coder-7b-apache-q4k-v1 ships cleanly through both APR
and GGUF backends → MODEL-1 completes.

Status: PROPOSED. Implementation deferred to multi-day Rust task.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 29, 2026
…SIFY-QW3-MOE-FORWARD-003 (#1127)

## What ships

Adds `crates/apr-cli/tests/qwen3_moe_apr_run_live_falsifier.rs` —
F-QW3-MOE-C22214-001, an integration test that invokes the user-facing
`apr` binary as a subprocess and asserts:

  1. exit 0
  2. stdout contains ≥1 non-whitespace character

against the cached 17.3 GB Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf
with a fresh date-tagged prompt.

This pins the M32c.2.2.2.1.3 dispatch flip (PR #1126,
squash a902eea) in CI / regression-prevention. Without it, a
future regression that re-routed qwen3_moe back to the dense
`run_gguf_generate` path (which produces garbage on MoE weights)
would slip through CI silently — there'd be no signal at the
`apr run` user-facing surface.

## Live evidence (lambda-vector RTX 4090, 2026-04-29)

```
running 1 test
test f_qw3_moe_c22214_001_apr_run_emits_at_least_one_non_whitespace_char ...
F-QW3-MOE-C22214-001: live `apr run` against /home/noah/.cache/pacha/models/2b88b180a790988f.gguf
F-QW3-MOE-C22214-001: elapsed = 130.945370974s
  stdout (first 200B): === APR Run ===

Source: /home/noah/.cache/pacha/models/2b88b180a790988f.gguf

Output:
.

Completed in 130.83s (cached)

  stderr (first 200B): [BOS-FALLBACK] No tokenizer.ggml.bos_token_id in GGUF — using architecture default for 'qwen3moe'
[BOS-FALLBACK] No tokenizer.ggml.bos_token_id in GGUF — using architecture default for 'qwen3moe'

F-QW3-MOE-C22214-001: PASS
ok

test result: ok. 1 passed; 0 failed; 0 ignored
```

Token quality vs llama.cpp Q4_K (cosine on logits) is M32d. This
test asserts ONLY emit/exit-0 — the discharge gate for
FALSIFY-QW3-MOE-FORWARD-003.

## Skip path

CI runners (and any host without the cached GGUF) print:

  F-QW3-MOE-C22214-001: SKIP — no cached Qwen3-Coder GGUF at any of [...]

and return success. Same skip pattern as
`crates/aprender-serve/tests/qwen3_moe_forward_one_token.rs`
(M32c.2.2.2.1.1 in-process forward primitive).

## Contract chain status

  M32a    qwen3-moe-forward-v1 contract scaffold        SHIPPED (#1099)
  M32b    arch-aware FFN load refuses qwen3_moe          SHIPPED (#1100)
  M32c.1+ MoE descriptor load + per-expert byte slicer   SHIPPED
  M32c.2.2.2.1.1 forward_qwen3_moe method                SHIPPED (#1124)
  M32c.2.2.2.1.2 run_qwen3_moe_generate function         SHIPPED (#1125)
  M32c.2.2.2.1.3 dispatch flip + Q4_K_M qtype dispatch   SHIPPED (#1126)
  M32c.2.2.2.1.4 live `apr run` falsifier               THIS PR
  M32d           numerical parity vs llama.cpp           PENDING

After M32d the contract flips DRAFT → ACTIVE_RUNTIME, which
unblocks the companion-repo FALSIFY-CCPA-013 measured tool-dispatch
parity gate.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant