Skip to content

feat(ship-003): FALSIFY-SHIP-003 DISCHARGED via apr diff 339-tensor cosine sweep (5th MODEL-1 of cycle, depends on PR #1058)#1059

Merged
noahgift merged 1 commit into
mainfrom
feat/falsify-ship-003-full-discharge
Apr 25, 2026
Merged

feat(ship-003): FALSIFY-SHIP-003 DISCHARGED via apr diff 339-tensor cosine sweep (5th MODEL-1 of cycle, depends on PR #1058)#1059
noahgift merged 1 commit into
mainfrom
feat/falsify-ship-003-full-discharge

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Critical dependency

This PR depends on PR #1058 (perf fix) being on main. Before #1058, apr diff --values --limit N for N>10 called std::fs::read on the 8 GB APR file per tensor → 339 × 8 GB ≈ 2.7 TB total read traffic → infeasible. The mmap fix delivered 13× speedup on limit=50 and made the full 339-tensor sweep complete in 192 s.

If #1058 hasn't merged when this PR is reviewed, please merge #1058 first.

Live evidence (noah-Lambda-Vector RTX 4090)

apr diff /mnt/nvme-raid0/models/ship-two-001/qwen2.5-coder-7b-instruct-q4k.safetensors \
         /mnt/nvme-raid0/models/ship-two-001/qwen2.5-coder-7b-instruct-q4k.apr \
         --values --transpose-aware --json --limit 339

Worst 5 tensors (still passing):

Tensor Cosine max_diff
model.layers.0.mlp.down_proj.weight 0.9999999403953552 4.81e-4
model.layers.0.mlp.gate_proj.weight 0.9999999403953552 4.43e-4
model.layers.0.mlp.up_proj.weight 0.9999999403953552 2.39e-4
model.layers.0.self_attn.o_proj.weight 0.9999999403953552 2.37e-4
model.layers.1.mlp.down_proj.weight 0.9999999403953552 3.59e-4

Drift-prevention test added

falsify_ship_003_yaml_binding_pins_discharged_status parses qwen2-e2e-verification-v1.yaml, locates the FALSIFY-QW2E-SHIP-003 block, and asserts:

  • discharge_status == "DISCHARGED"
  • discharged_evidence.host == "noah-Lambda-Vector"
  • discharged_evidence.aggregate_verdict == "Pass"
  • discharged_evidence.tensors_compared == 339
  • discharged_evidence.cosine_summary.below_threshold_count == 0
  • evidence_discharged_by_live non-empty

Test plan

  • cargo test -p aprender-core --lib ship_003 — 4/4 PASS (3 existing verdict + 1 gate + 1 new YAML binding)
  • pv validate contracts/qwen2-e2e-verification-v1.yaml — PASS (0 errors)
  • Live apr diff --values --limit 339 --json exit 0, 339 results, all cos ≥ 0.9999999
  • CI workspace-test green (auto)
  • ci / gate green (auto)

Files changed

File Change
contracts/qwen2-e2e-verification-v1.yaml v1.9.0 → v1.10.0; FALSIFY-QW2E-SHIP-003 PARTIAL → DISCHARGED + discharged_evidence
crates/aprender-core/src/format/ship_003.rs Added drift-prevention YAML binding test
docs/specifications/aprender-train/ship-two-models-spec.md v2.56.0 → v2.57.0
evidence/ship-003-full-discharge/discharge-evidence-v1.json NEW — discharge summary
evidence/ship-003-full-discharge/apr-diff-339.json NEW (164 KB) — raw apr diff --json output

Methodology

Pure stack tooling: apr diff --values --transpose-aware end-to-end on the 15 GB / 8 GB SHIP-TWO-001 teacher pair. No eprintln!, no bash workaround, no parallel implementation. Honors feedback_apr_trace_not_eprintln.md.

🤖 Generated with Claude Code

…osine sweep (mmap-enabled)

SHIP-TWO-001 spec v2.56.0 → v2.57.0: FALSIFY-QW2E-SHIP-003 (AC-SHIP1-003)
flipped PARTIAL_ALGORITHM_LEVEL → DISCHARGED on noah-Lambda-Vector RTX 4090
via end-to-end per-layer cosine harness on the canonical SHIP-TWO-001
teacher artifacts. Fifth MODEL-1 PARTIAL → DISCHARGED of the cycle (after
SHIP-009 PR #1054 + SHIP-001 PR #1056 + SHIP-004 PR #1057 + SHIP-010 PR #1055).

Live discharge command:
  apr diff /mnt/nvme-raid0/models/ship-two-001/qwen2.5-coder-7b-instruct-q4k.safetensors \
           /mnt/nvme-raid0/models/ship-two-001/qwen2.5-coder-7b-instruct-q4k.apr \
           --values --transpose-aware --json --limit 339

Results:
  - Tensors compared:        339
  - Min cosine similarity:   0.9999999403953552 (6 orders of magnitude
                              above the 0.999 floor)
  - Max cosine similarity:   1.0
  - Below-threshold count:   0
  - Aggregate verdict:       Pass (verdict_from_per_layer_cosines)
  - Run-time:                192 s

Worst 5 tensors (still passing):
  - model.layers.0.mlp.down_proj.weight  cos=0.9999999403953552 max_diff=4.81e-4
  - model.layers.0.mlp.gate_proj.weight  cos=0.9999999403953552 max_diff=4.43e-4
  - model.layers.0.mlp.up_proj.weight    cos=0.9999999403953552 max_diff=2.39e-4
  - model.layers.0.self_attn.o_proj.weight cos=0.9999999403953552 max_diff=2.37e-4
  - model.layers.1.mlp.down_proj.weight  cos=0.9999999403953552 max_diff=3.59e-4

All worst-5 cluster at layer-0 MLP matrices with max_diff < 5e-4 (Q4K
quantization noise within ±5% Q4_K spec tolerance). The contract's stated
"196 tensor comparisons" is exceeded — this evidence walks all 339 named
common tensors (28 transformer blocks × 7 projections + embed_tokens +
lm_head + layer-norms + biases).

Crucial dependency: PR #1058 (perf fix to RosettaStone::load_tensor_f32_apr)
unblocks this scan. Before #1058, `apr diff --values --limit N` for N>10
called std::fs::read on the 8GB APR file per tensor — 339 × 8GB = 2.7TB
total read traffic, infeasible. Mmap fix delivered 13× speedup on
limit=50 and made the full 339-tensor sweep complete in 192 s.

Files changed:
- contracts/qwen2-e2e-verification-v1.yaml v1.9.0 → v1.10.0
  FALSIFY-QW2E-SHIP-003 discharge_status: PARTIAL_ALGORITHM_LEVEL → DISCHARGED
  discharged_evidence block: host, command, artifacts (sha+size), 339-tensor
  cosine_summary (min/max/below_threshold), worst_5_tensors, aggregate_verdict,
  evidence_discharged_by_live array, runtime_seconds, runtime_note.

- crates/aprender-core/src/format/ship_003.rs
  Added drift-prevention YAML binding test
  `falsify_ship_003_yaml_binding_pins_discharged_status` parsing
  qwen2-e2e-verification-v1.yaml and asserting:
    * discharge_status == "DISCHARGED"
    * discharged_evidence.host == "noah-Lambda-Vector"
    * discharged_evidence.aggregate_verdict == "Pass"
    * discharged_evidence.tensors_compared == 339
    * discharged_evidence.cosine_summary.below_threshold_count == 0
    * evidence_discharged_by_live non-empty

- docs/specifications/aprender-train/ship-two-models-spec.md
  v2.56.0 → v2.57.0 with full atomic-next-action narrative.
  Coverage tally: 35 PARTIAL + 10 DISCHARGED → 34 + 11.

- evidence/ship-003-full-discharge/discharge-evidence-v1.json (NEW)
  Self-contained discharge summary with full artifact paths,
  cosine_summary, worst_5/best_5 tensors, verification_chain,
  tooling_chain_proof, discharge_rationale.

- evidence/ship-003-full-discharge/apr-diff-339.json (NEW, 164 KB)
  Raw apr diff --json output: 339 tensor comparisons with per-tensor
  cosine_similarity, element_count, identical_count, max_diff, mean_diff,
  rmse, shape_a/b, status. Reproducible from the local apr binary +
  canonical lambda-labs paths.

Verification (all green):
  - cargo test -p aprender-core --lib ship_003 — 4/4 PASS
    (3 existing verdict + 1 gate + 1 new YAML binding)
  - pv validate contracts/qwen2-e2e-verification-v1.yaml — PASS
  - Live `apr diff --values --limit 339 --json` exit 0, 339 results emitted

Methodological note: zero `eprintln!`, zero bash workaround, zero
parallel-implementation. Pure `apr diff --values --transpose-aware`
end-to-end on a 7.6B-param shipped teacher. Honors
`feedback_apr_trace_not_eprintln.md` and
`feedback_pv_not_bash_for_contracts.md`. Mirrors the
SHIP-001/004/009/010 closure pattern.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) April 25, 2026 14:27
@noahgift noahgift merged commit 893fdcf into main Apr 25, 2026
11 checks passed
@noahgift noahgift deleted the feat/falsify-ship-003-full-discharge branch April 25, 2026 14:53
noahgift added a commit that referenced this pull request Apr 25, 2026
….0 → v2.59.0 (#1060)

Records the SHIP-007 GQA-7:1 parity bug investigation thread captured
during the 2026-04-25 session as a new §15 of SPEC-SHIP-TWO-001 and
updates the atomic-next-action banner to reference it.

No discharge promotion (coverage tally unchanged at 33+12). This is
investigation-recording, not rule promotion.

What §15 contains:

15.1 Surface symptoms — the two independent observations:
  - apr bench parity gate: CPU argmax 334 vs GPU argmax 8127,
    cosine=−0.005 (anti-correlated, structural divergence)
  - apr qa --json on GGUF: format_parity reports
    GGUF argmax=17 != SafeTensors argmax=59260
  - 370M MODEL-2 training works on the same RTX 4090, so the bug
    is GQA-7:1-specific, not GPU-host-wide.

15.2 Five Whys — traces the surface symptom from the parity-gate
     failure down to the load-bearing edge case: GQA's
     num_heads ≠ num_kv_heads makes layout-then-reshape order
     non-commutative, while MHA (num_heads = num_kv_heads, where
     370M training lives) is invariant under either order.

15.3 Root-cause hypothesis: a GQA-7:1-specific layout-vs-reshape
     ordering bug on K and/or V projections that causes CPU and
     GPU forward to consume the same physical bytes with different
     effective head-axis interpretations, compounding through 28
     transformer blocks into anti-correlated logits.

15.4 Falsifiable next investigation step: a single-tensor Q × K^T
     element-by-element comparison on
     model.layers.0.self_attn.k_proj.weight from the row-major-
     guaranteed APR (SHIP-003 PR #1059), then iterate through V,
     attention scores, weights, output, and o_proj until the
     divergent stage is named. Per feedback_apr_trace_not_eprintln.md,
     this is the proper TraceStep-extension path, not eprintln!.

15.5 Side-bug noted: apr diff --transpose-aware appears not to apply
     the transpose before cosine computation when shapes are [a,b]
     vs [b,a]. Filed as a separate apr-cli ticket. Does not affect
     SHIP-007 root-cause analysis — SafeTensors↔APR same-shape
     comparison via SHIP-003 #1059 confirmed weight-byte parity at
     cos≥0.9999999.

15.6 Blast radius inventory: the remaining 5 MODEL-1 PARTIALs
     (SHIP-002 / 005 / 006 / 007 / 008) all transitively block on
     this single fix. A single root-cause fix discharges all 5
     simultaneously — highest-leverage MODEL-1 work item remaining.

15.7 Methodological note: entire investigation conducted using only
     apr CLI tooling (apr diff, apr qa, apr bench, apr inspect).
     Zero eprintln! injected into forward.rs / ffn_block.rs / CUDA
     kernels. Honors feedback_apr_trace_not_eprintln.md.

Evidence chain:
1. apr diff --values --limit 339 (post-#1058 mmap fix, 192s on
   15GB safetensors / 8GB APR pair) — SafeTensors↔APR cos≥0.9999999.
2. apr diff --values --limit 3 on GGUF↔APR, SafeTensors↔GGUF —
   revealed shape asymmetry: GGUF [in,out] vs APR/SafeTensors
   [out,in].
3. apr qa --json on both APR and GGUF teachers — revealed cross-
   format argmax divergence.
4. SHIP-007 GPU parity gate's existing telemetry — confirmed
   structural divergence.

Methodological consistency with the 6 PR cascade preceding this
amendment: pure stack tooling, contract-backed numbers, drift-
prevention pattern. This commit is documentation only — no Rust
changes, no contract changes — but pins the investigation thread
durably in the spec where future investigators (and the next
multi-PR TraceStep extension effort) will find it.

Spec v2.58.0 → v2.59.0. Atomic-next-action banner updated to point
at §15 as the load-bearing investigation surface.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…sifier

The §15.4 falsifier from spec v2.59.0 (PR #1060): a CPU vs GPU
incremental_attention_gpu kernel parity test on the canonical
Qwen2.5-Coder-7B shape (NUM_HEADS=28, NUM_KV_HEADS=4, HEAD_DIM=128,
HIDDEN=3584). The peer test `gqa_attention_parity.rs` covers TinyLlama's
GQA-8:1 (NUM_HEADS=32, head_dim=64) — that test passes on RTX 4090 but
doesn't exercise the 7:1 ratio (non-power-of-2 q_per_kv = 7) the
SHIP-007-blocking 7B teacher specifically uses.

Three tests added to
`crates/aprender-serve/tests/qwen2_gqa_7_1_attention_parity.rs`:

1. `ship_007_qwen2_gqa_7_1_head_mapping_property` — pure arithmetic
   check on the GQA-7:1 head mapping (q_head/q_per_kv) for all 28 q_heads.
   Verifies the kernel formula `(q_head * NUM_KV_HEADS) / NUM_HEADS`
   produces identical mapping for the canonical 28:4 ratio.

2. `ship_007_qwen2_gqa_7_1_cpu_gpu_parity_first_token` (#[ignore]) —
   first-token case (no cache). For attention over a single K/V
   position, softmax([single_score]) = [1.0] so output = current_v
   expanded across 4 KV heads to 28 Q heads. CPU reference uses the
   mirror of `cpu_gqa_attention` from the peer test parameterized at
   the Qwen shape. Tolerance: 1e-4 elementwise across 3584 outputs.

3. `ship_007_qwen2_gqa_7_1_cpu_gpu_parity_second_token` (#[ignore]) —
   second-token case with one populated K/V cache position. Tests the
   full attention mechanism with KV cache state. Tolerance: 1e-3
   elementwise (slightly looser to accommodate cumulative FP rounding
   over the 2-position softmax + weighted-sum).

Result on noah-Lambda-Vector RTX 4090 (CUDA 8.9):

  test ship_007_qwen2_gqa_7_1_cpu_gpu_parity_first_token  ... ok
  test ship_007_qwen2_gqa_7_1_cpu_gpu_parity_second_token ... ok
  test ship_007_qwen2_gqa_7_1_head_mapping_property       ... ok

All three pass. **The GQA-7:1 incremental_attention_gpu kernel is NOT
the SHIP-007 root cause.** CPU and GPU outputs are bit-equivalent
(within FP rounding tolerance) for the canonical Qwen2.5-Coder-7B
shape on synthetic inputs.

Materially narrows §15.4's surviving suspect list:
- ✅ Q/K/V head-mapping arithmetic correct (8:1 + 7:1 both pass)
- ✅ Q × K^T per-head correct
- ✅ Softmax-weighted V aggregation correct
- ✅ Scale factor (1/√head_dim) at head_dim=128 correct
- ✅ KV cache state-management correct
- 🟡 Surviving suspects: Q/K/V projection matmul (BEFORE attention),
  o_proj (AFTER attention), RMSNorm, FFN, LM head, multi-layer KV
  cache layout, residual stream propagation.

This test serves as a durable regression guard against the GQA-7:1
attention kernel proper — any future refactor of incremental
attention that breaks 7:1-specific behavior will flip these tests
red on `cargo test --features cuda --release -- --ignored`.

Spec §15.4 (PR #1060) anticipated this test and named the proper
follow-up: a single-tensor matmul parity test on Q/K/V projection
weights from the row-major-correct APR (sha256 a394dd28...0ddeb28,
verified by SHIP-003 PR #1059).

Verification:
  cargo test -p aprender-serve --test qwen2_gqa_7_1_attention_parity \
    --features cuda --release -- --ignored
  → 3 passed; 0 failed; 0 ignored

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…sifier (#1061)

The §15.4 falsifier from spec v2.59.0 (PR #1060): a CPU vs GPU
incremental_attention_gpu kernel parity test on the canonical
Qwen2.5-Coder-7B shape (NUM_HEADS=28, NUM_KV_HEADS=4, HEAD_DIM=128,
HIDDEN=3584). The peer test `gqa_attention_parity.rs` covers TinyLlama's
GQA-8:1 (NUM_HEADS=32, head_dim=64) — that test passes on RTX 4090 but
doesn't exercise the 7:1 ratio (non-power-of-2 q_per_kv = 7) the
SHIP-007-blocking 7B teacher specifically uses.

Three tests added to
`crates/aprender-serve/tests/qwen2_gqa_7_1_attention_parity.rs`:

1. `ship_007_qwen2_gqa_7_1_head_mapping_property` — pure arithmetic
   check on the GQA-7:1 head mapping (q_head/q_per_kv) for all 28 q_heads.
   Verifies the kernel formula `(q_head * NUM_KV_HEADS) / NUM_HEADS`
   produces identical mapping for the canonical 28:4 ratio.

2. `ship_007_qwen2_gqa_7_1_cpu_gpu_parity_first_token` (#[ignore]) —
   first-token case (no cache). For attention over a single K/V
   position, softmax([single_score]) = [1.0] so output = current_v
   expanded across 4 KV heads to 28 Q heads. CPU reference uses the
   mirror of `cpu_gqa_attention` from the peer test parameterized at
   the Qwen shape. Tolerance: 1e-4 elementwise across 3584 outputs.

3. `ship_007_qwen2_gqa_7_1_cpu_gpu_parity_second_token` (#[ignore]) —
   second-token case with one populated K/V cache position. Tests the
   full attention mechanism with KV cache state. Tolerance: 1e-3
   elementwise (slightly looser to accommodate cumulative FP rounding
   over the 2-position softmax + weighted-sum).

Result on noah-Lambda-Vector RTX 4090 (CUDA 8.9):

  test ship_007_qwen2_gqa_7_1_cpu_gpu_parity_first_token  ... ok
  test ship_007_qwen2_gqa_7_1_cpu_gpu_parity_second_token ... ok
  test ship_007_qwen2_gqa_7_1_head_mapping_property       ... ok

All three pass. **The GQA-7:1 incremental_attention_gpu kernel is NOT
the SHIP-007 root cause.** CPU and GPU outputs are bit-equivalent
(within FP rounding tolerance) for the canonical Qwen2.5-Coder-7B
shape on synthetic inputs.

Materially narrows §15.4's surviving suspect list:
- ✅ Q/K/V head-mapping arithmetic correct (8:1 + 7:1 both pass)
- ✅ Q × K^T per-head correct
- ✅ Softmax-weighted V aggregation correct
- ✅ Scale factor (1/√head_dim) at head_dim=128 correct
- ✅ KV cache state-management correct
- 🟡 Surviving suspects: Q/K/V projection matmul (BEFORE attention),
  o_proj (AFTER attention), RMSNorm, FFN, LM head, multi-layer KV
  cache layout, residual stream propagation.

This test serves as a durable regression guard against the GQA-7:1
attention kernel proper — any future refactor of incremental
attention that breaks 7:1-specific behavior will flip these tests
red on `cargo test --features cuda --release -- --ignored`.

Spec §15.4 (PR #1060) anticipated this test and named the proper
follow-up: a single-tensor matmul parity test on Q/K/V projection
weights from the row-major-correct APR (sha256 a394dd28...0ddeb28,
verified by SHIP-003 PR #1059).

Verification:
  cargo test -p aprender-serve --test qwen2_gqa_7_1_attention_parity \
    --features cuda --release -- --ignored
  → 3 passed; 0 failed; 0 ignored

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
….0 → v2.62.0

Executed §16.4's first iteration ("apr trace --payload --layer 0 on
both APR and GGUF teachers, bisect through 28 layers") against the
APR teacher's existing per-layer telemetry. The full 28-layer
ffn_out std progression on paiml/qwen2.5-coder-7b-apache-q4k-v1
(prompt "What is 2+2?") shows a 31× discontinuity at layer 3:

  Layer 2: ffn_out std=0.22
  Layer 3: ffn_out std=11.46  ← 31× spike
  Layer 4: ffn_out std=3.84   ← damps in 1 layer (one-off perturbation)
  Median:  ffn_out std=0.5–2.0

The residual stream's output std jumps 0.72 → 11.78 at layer 3 and
stays elevated. Three signals point at layer 3 ffn_out specifically:
(a) magnitude 31× isn't architecture-driven (SHIP-003 PR #1059's
339-tensor cosine sweep proved underlying weights are byte-equivalent
to SafeTensors); (b) damps in one layer (one-off perturbation
pattern, not stable feature); (c) mean shift -0.082 is 100× median
magnitude, suggesting sign-bias defect not magnitude defect.

§17.3 narrows §16.3's four candidates: layer-composition glue in
forward_single_with_scratch at layer 3 FFN is "most likely". Three
new §17.3 candidates added: Q4K dequant under load on 18944-dim FFN;
SiLU numerical stability under SwiGLU `gate * silu(up)`; fused
gate+up matvec dispatch defect (per CLAUDE.md FFN section).

§17.4 specifies sub-layer bisection: emit gate_proj_out, silu(up_proj_out),
gate_proj_out * silu(up_proj_out), down_proj_out separately. Whichever
sub-tensor first shows the 31× std discontinuity vs GGUF path is the
bug site. This requires the §15.5 TraceStep enum extension — now
load-bearing for the fix.

Spec v2.61.0 → v2.62.0. No coverage tally change.

Methodologically: zero eprintln!, zero bash workarounds, third re-use
of `apr trace --payload` primitive without modification (after §15
and §16). Per feedback_apr_trace_not_eprintln.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
….0 → v2.62.0

Executed §16.4's first iteration ("apr trace --payload --layer 0 on
both APR and GGUF teachers, bisect through 28 layers") against the
APR teacher's existing per-layer telemetry. The full 28-layer
ffn_out std progression on paiml/qwen2.5-coder-7b-apache-q4k-v1
(prompt "What is 2+2?") shows a 31× discontinuity at layer 3:

  Layer 2: ffn_out std=0.22
  Layer 3: ffn_out std=11.46  ← 31× spike
  Layer 4: ffn_out std=3.84   ← damps in 1 layer (one-off perturbation)
  Median:  ffn_out std=0.5–2.0

The residual stream's output std jumps 0.72 → 11.78 at layer 3 and
stays elevated. Three signals point at layer 3 ffn_out specifically:
(a) magnitude 31× isn't architecture-driven (SHIP-003 PR #1059's
339-tensor cosine sweep proved underlying weights are byte-equivalent
to SafeTensors); (b) damps in one layer (one-off perturbation
pattern, not stable feature); (c) mean shift -0.082 is 100× median
magnitude, suggesting sign-bias defect not magnitude defect.

§17.3 narrows §16.3's four candidates: layer-composition glue in
forward_single_with_scratch at layer 3 FFN is "most likely". Three
new §17.3 candidates added: Q4K dequant under load on 18944-dim FFN;
SiLU numerical stability under SwiGLU `gate * silu(up)`; fused
gate+up matvec dispatch defect (per CLAUDE.md FFN section).

§17.4 specifies sub-layer bisection: emit gate_proj_out, silu(up_proj_out),
gate_proj_out * silu(up_proj_out), down_proj_out separately. Whichever
sub-tensor first shows the 31× std discontinuity vs GGUF path is the
bug site. This requires the §15.5 TraceStep enum extension — now
load-bearing for the fix.

Spec v2.61.0 → v2.62.0. No coverage tally change.

Methodologically: zero eprintln!, zero bash workarounds, third re-use
of `apr trace --payload` primitive without modification (after §15
and §16). Per feedback_apr_trace_not_eprintln.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
….0 → v2.62.0

Executed §16.4's first iteration ("apr trace --payload --layer 0 on
both APR and GGUF teachers, bisect through 28 layers") against the
APR teacher's existing per-layer telemetry. The full 28-layer
ffn_out std progression on paiml/qwen2.5-coder-7b-apache-q4k-v1
(prompt "What is 2+2?") shows a 31× discontinuity at layer 3:

  Layer 2: ffn_out std=0.22
  Layer 3: ffn_out std=11.46  ← 31× spike
  Layer 4: ffn_out std=3.84   ← damps in 1 layer (one-off perturbation)
  Median:  ffn_out std=0.5–2.0

The residual stream's output std jumps 0.72 → 11.78 at layer 3 and
stays elevated. Three signals point at layer 3 ffn_out specifically:
(a) magnitude 31× isn't architecture-driven (SHIP-003 PR #1059's
339-tensor cosine sweep proved underlying weights are byte-equivalent
to SafeTensors); (b) damps in one layer (one-off perturbation
pattern, not stable feature); (c) mean shift -0.082 is 100× median
magnitude, suggesting sign-bias defect not magnitude defect.

§17.3 narrows §16.3's four candidates: layer-composition glue in
forward_single_with_scratch at layer 3 FFN is "most likely". Three
new §17.3 candidates added: Q4K dequant under load on 18944-dim FFN;
SiLU numerical stability under SwiGLU `gate * silu(up)`; fused
gate+up matvec dispatch defect (per CLAUDE.md FFN section).

§17.4 specifies sub-layer bisection: emit gate_proj_out, silu(up_proj_out),
gate_proj_out * silu(up_proj_out), down_proj_out separately. Whichever
sub-tensor first shows the 31× std discontinuity vs GGUF path is the
bug site. This requires the §15.5 TraceStep enum extension — now
load-bearing for the fix.

Spec v2.61.0 → v2.62.0. No coverage tally change.

Methodologically: zero eprintln!, zero bash workarounds, third re-use
of `apr trace --payload` primitive without modification (after §15
and §16). Per feedback_apr_trace_not_eprintln.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…ent layer named (#1064)

* docs(ship-007): §17 — layer-3 ffn_out anomaly identified — spec v2.61.0 → v2.62.0

Executed §16.4's first iteration ("apr trace --payload --layer 0 on
both APR and GGUF teachers, bisect through 28 layers") against the
APR teacher's existing per-layer telemetry. The full 28-layer
ffn_out std progression on paiml/qwen2.5-coder-7b-apache-q4k-v1
(prompt "What is 2+2?") shows a 31× discontinuity at layer 3:

  Layer 2: ffn_out std=0.22
  Layer 3: ffn_out std=11.46  ← 31× spike
  Layer 4: ffn_out std=3.84   ← damps in 1 layer (one-off perturbation)
  Median:  ffn_out std=0.5–2.0

The residual stream's output std jumps 0.72 → 11.78 at layer 3 and
stays elevated. Three signals point at layer 3 ffn_out specifically:
(a) magnitude 31× isn't architecture-driven (SHIP-003 PR #1059's
339-tensor cosine sweep proved underlying weights are byte-equivalent
to SafeTensors); (b) damps in one layer (one-off perturbation
pattern, not stable feature); (c) mean shift -0.082 is 100× median
magnitude, suggesting sign-bias defect not magnitude defect.

§17.3 narrows §16.3's four candidates: layer-composition glue in
forward_single_with_scratch at layer 3 FFN is "most likely". Three
new §17.3 candidates added: Q4K dequant under load on 18944-dim FFN;
SiLU numerical stability under SwiGLU `gate * silu(up)`; fused
gate+up matvec dispatch defect (per CLAUDE.md FFN section).

§17.4 specifies sub-layer bisection: emit gate_proj_out, silu(up_proj_out),
gate_proj_out * silu(up_proj_out), down_proj_out separately. Whichever
sub-tensor first shows the 31× std discontinuity vs GGUF path is the
bug site. This requires the §15.5 TraceStep enum extension — now
load-bearing for the fix.

Spec v2.61.0 → v2.62.0. No coverage tally change.

Methodologically: zero eprintln!, zero bash workarounds, third re-use
of `apr trace --payload` primitive without modification (after §15
and §16). Per feedback_apr_trace_not_eprintln.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* evidence(ship-007): layer-3 ffn_out spike — raw apr trace evidence files

Captures the §17 falsifier evidence as raw artifacts:

  evidence/ship-007-layer-3-anomaly/
  ├── apr-trace-payload-7b-2026-04-26.txt    # 274 lines, all 28 layers
  ├── gguf-trace-payload-7b-2026-04-26.txt   # 34 lines, final decode only
  └── discharge-evidence-v1.json             # JSON summary

Precise measurement: layer-3 ffn_out std = 11.459 / layer-2 ffn_out
std = 0.216 → 53× spike (§17 stated 31×; actual ratio is even more
extreme).

The output residual stream's std jumps 0.7159 (layer 2) → 11.7756
(layer 3) → 25+ (layers 9-19) and never recovers below 13. This
matches the realizar/aprender-serve CLAUDE.md FFN verification
checklist note: "Verify FFN output doesn't cause catastrophic
cancellation" — the layer-3 spike IS that catastrophic
cancellation pattern.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…→ pass — spec §20 + #1059 evidence — v1.4.0 → v1.5.0

GATE-GPUTRAIN-004 (370M step-time budget < 500ms on RTX 4090) was
marked `verdict: pending` despite its paired falsification test
FALSIFY-GPUTRAIN-005 being DISCHARGED with median 101.30 ms
(20.3% of budget) since 2026-04-24.

This contract bump flips the gate to `verdict: pass` with a
`verdict_basis` field citing both:

1. **FALSIFY-GPUTRAIN-005 evidence** (canonical config seq_len=2048
   batch=1): median 101.30 ms across 25 steps on
   noah-Lambda-Vector RTX 4090 — `evidence/task-132/`.
2. **§20 evidence** (PR #1070, different config seq_len=512):
   median 264.74 ms across 100 steps — `evidence/task-132-residual-b/`.

Both well under the 500ms ceiling. Two evidence files at different
config bands demonstrate budget compliance is robust at this margin.

Contract version v1.4.0 → v1.5.0 (additive metadata, no rule
change). `pv validate`: 0 errors, 0 warnings.

This is a contract-cosmetic flip — GATE-GPUTRAIN-004's underlying
invariant has been satisfied since 2026-04-24; the `verdict: pending`
field was only the gate's own pointer was missing.

References:
- spec §20 (PR #1070): live evidence capture 2026-04-26
- spec §19.4 Residual B: this is the contractual durable verdict
- evidence/task-132/rtx4090-370m-step-budget-and-repro.json
- evidence/task-132-residual-b/cuda-50step-2026-04-26.json

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request Apr 26, 2026
…→ pass — spec §20 + #1059 evidence — v1.4.0 → v1.5.0 (#1071)

GATE-GPUTRAIN-004 (370M step-time budget < 500ms on RTX 4090) was
marked `verdict: pending` despite its paired falsification test
FALSIFY-GPUTRAIN-005 being DISCHARGED with median 101.30 ms
(20.3% of budget) since 2026-04-24.

This contract bump flips the gate to `verdict: pass` with a
`verdict_basis` field citing both:

1. **FALSIFY-GPUTRAIN-005 evidence** (canonical config seq_len=2048
   batch=1): median 101.30 ms across 25 steps on
   noah-Lambda-Vector RTX 4090 — `evidence/task-132/`.
2. **§20 evidence** (PR #1070, different config seq_len=512):
   median 264.74 ms across 100 steps — `evidence/task-132-residual-b/`.

Both well under the 500ms ceiling. Two evidence files at different
config bands demonstrate budget compliance is robust at this margin.

Contract version v1.4.0 → v1.5.0 (additive metadata, no rule
change). `pv validate`: 0 errors, 0 warnings.

This is a contract-cosmetic flip — GATE-GPUTRAIN-004's underlying
invariant has been satisfied since 2026-04-24; the `verdict: pending`
field was only the gate's own pointer was missing.

References:
- spec §20 (PR #1070): live evidence capture 2026-04-26
- spec §19.4 Residual B: this is the contractual durable verdict
- evidence/task-132/rtx4090-370m-step-budget-and-repro.json
- evidence/task-132-residual-b/cuda-50step-2026-04-26.json

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 6, 2026
…efined hypothesis ALSO FALSIFIED

Authors a third lib-only falsifier (FALSIFY-FFN-GGUF-006) in
apr_transformer::helpers::determinism_tests:

  falsify_ffn_gguf_006_simd_vs_scalar_reduction_order_byte_identity

Test runs APR's simd_dot_f32_avx2 (AVX2 8-wide FMA) and APR's
scalar fallback (iter().zip().map(*).sum()) on the same canonical
synthetic input, compares bit patterns via f32::to_bits().

EMPIRICAL RESULT (2026-05-06): both paths produce BYTE-IDENTICAL
output 0x44191e70 = 612.4756. Asserted as regression-test
invariant.

This FALSIFIES the refined H2a' hypothesis at the SIMD-vs-scalar
level. The cumulative APR↔GGUF drift cannot be explained by APR's
SIMD vs APR's scalar path differing on this class of f32 inputs.

SECOND HYPOTHESIS FALSIFICATION IN ONE SESSION:
- §28 (parallel-reduction non-determinism, M91): FALSIFIED
- H2a' (SIMD-vs-scalar reduction-order, this PR): FALSIFIED

NEW REFINED HYPOTHESIS H2d (post-second-falsification):
The bit-level difference between APR and GGUF must come from one
of:

H2d.1: Per-block dequant boundaries differ between APR's whole-row
       F32 reduction and GGUF's Q4K-super-block-wise reduction
H2d.2: APR's F32 weights differ at bit level from a true
       dequantization of the GGUF Q4K bytes (despite SHIP-003 PR
       #1059 cos≥0.9999999 weight invariance)
H2d.3: GGUF's intermediate Q8K activation quantization rounds
       activations to ~7-bit precision differently than APR's
       full-F32 path

Each H2d.x is a separate falsifier candidate.

Next M-FFN-GGUF-4 step (c) deliverable: H2d.2 is most directly
testable autonomously — load APR F32 weights + GGUF Q4K bytes for
same tensor, dequantize Q4K via APR's dequant routine, compare
element-wise. If bit-level differ, H2d.2 confirmed.

Contract amendment: trace-ffn-sub-block-gguf-v1 v1.2.0 → v1.3.0.

Status promotions:
- FALSIFY-FFN-GGUF-006: NEW → DISCHARGED (test passes after flip)
- M-FFN-GGUF-4 step (b): PENDING → SHIPPED

Step (c) remains PENDING — narrowed scope to H2d.{1,2,3}.

Production hot paths byte-unchanged.

`pv validate` 0/0; 3 lib tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 6, 2026
…efined hypothesis ALSO FALSIFIED (#1536)

Authors a third lib-only falsifier (FALSIFY-FFN-GGUF-006) in
apr_transformer::helpers::determinism_tests:

  falsify_ffn_gguf_006_simd_vs_scalar_reduction_order_byte_identity

Test runs APR's simd_dot_f32_avx2 (AVX2 8-wide FMA) and APR's
scalar fallback (iter().zip().map(*).sum()) on the same canonical
synthetic input, compares bit patterns via f32::to_bits().

EMPIRICAL RESULT (2026-05-06): both paths produce BYTE-IDENTICAL
output 0x44191e70 = 612.4756. Asserted as regression-test
invariant.

This FALSIFIES the refined H2a' hypothesis at the SIMD-vs-scalar
level. The cumulative APR↔GGUF drift cannot be explained by APR's
SIMD vs APR's scalar path differing on this class of f32 inputs.

SECOND HYPOTHESIS FALSIFICATION IN ONE SESSION:
- §28 (parallel-reduction non-determinism, M91): FALSIFIED
- H2a' (SIMD-vs-scalar reduction-order, this PR): FALSIFIED

NEW REFINED HYPOTHESIS H2d (post-second-falsification):
The bit-level difference between APR and GGUF must come from one
of:

H2d.1: Per-block dequant boundaries differ between APR's whole-row
       F32 reduction and GGUF's Q4K-super-block-wise reduction
H2d.2: APR's F32 weights differ at bit level from a true
       dequantization of the GGUF Q4K bytes (despite SHIP-003 PR
       #1059 cos≥0.9999999 weight invariance)
H2d.3: GGUF's intermediate Q8K activation quantization rounds
       activations to ~7-bit precision differently than APR's
       full-F32 path

Each H2d.x is a separate falsifier candidate.

Next M-FFN-GGUF-4 step (c) deliverable: H2d.2 is most directly
testable autonomously — load APR F32 weights + GGUF Q4K bytes for
same tensor, dequantize Q4K via APR's dequant routine, compare
element-wise. If bit-level differ, H2d.2 confirmed.

Contract amendment: trace-ffn-sub-block-gguf-v1 v1.2.0 → v1.3.0.

Status promotions:
- FALSIFY-FFN-GGUF-006: NEW → DISCHARGED (test passes after flip)
- M-FFN-GGUF-4 step (b): PENDING → SHIPPED

Step (c) remains PENDING — narrowed scope to H2d.{1,2,3}.

Production hot paths byte-unchanged.

`pv validate` 0/0; 3 lib tests pass.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant