Skip to content

feat(M-FFN-GGUF-4 step b): SIMD-vs-scalar byte-identity test — H2a' ALSO FALSIFIED#1536

Merged
noahgift merged 1 commit into
mainfrom
feat/m-ffn-gguf-4b-simd-vs-scalar-reduction-order
May 6, 2026
Merged

feat(M-FFN-GGUF-4 step b): SIMD-vs-scalar byte-identity test — H2a' ALSO FALSIFIED#1536
noahgift merged 1 commit into
mainfrom
feat/m-ffn-gguf-4b-simd-vs-scalar-reduction-order

Conversation

@noahgift

@noahgift noahgift commented May 6, 2026

Copy link
Copy Markdown
Contributor

Summary

Authors FALSIFY-FFN-GGUF-006: byte-identity test between APR's simd_dot_f32_avx2 (AVX2 8-wide FMA) and APR's scalar fallback (iter().zip().map(*).sum()) on canonical synthetic input.

Empirical result: both paths produce byte-identical output 0x44191e70 = 612.4756. Asserted as regression-test invariant.

This FALSIFIES the refined H2a' hypothesis at the SIMD-vs-scalar level — the cumulative APR↔GGUF drift cannot be explained by APR-internal reduction-order differences.

Two hypothesis falsifications in one session

Falsifier Hypothesis Result
FALSIFY-FFN-GGUF-005 (M91) §28 parallel-reduction non-determinism FALSIFIED — APR f32_matmul byte-deterministic
FALSIFY-FFN-GGUF-006 (this PR) H2a' SIMD-vs-scalar reduction-order FALSIFIED — AVX2 and scalar produce byte-identical output

Refined hypothesis H2d (post-second-falsification)

Pinned in v1.3.0 amendment. The bit-level APR↔GGUF difference must come from one of:

Hypothesis Description
H2d.1 Per-block dequant boundaries differ (whole-row F32 reduction vs Q4K-super-block-wise)
H2d.2 APR's F32 weights differ at bit level from dequantized GGUF Q4K bytes
H2d.3 GGUF's intermediate Q8K activation quantization rounds differently than APR's F32 path

Next M-FFN-GGUF-4 step (c) deliverable

H2d.2 is most directly testable autonomously — load APR F32 weights + GGUF Q4K bytes for same tensor, dequantize Q4K via APR's dequant routine, compare element-wise. If bit-level differs, H2d.2 confirmed and the SHIP-007 fix scope narrows to "fix dequant invariant".

Contract amendment (v1.2.0 → v1.3.0)

Field Before After
version 1.2.0 1.3.0
FALSIFY-FFN-GGUF-006 NEW DISCHARGED
M-FFN-GGUF-4 step (b) PENDING SHIPPED

Test plan

  • pv validate 0/0
  • 3 lib tests pass (FFN-GGUF-005a, 005b, 006)
  • No production hot path touched (additive #[cfg(test)] mod)
  • H2a' empirically falsified at SIMD-vs-scalar level
  • H2d hypothesis triplet authored for next-step falsification

🤖 Generated with Claude Code

@noahgift noahgift enabled auto-merge (squash) May 6, 2026 19:49
@noahgift

noahgift commented May 6, 2026

Copy link
Copy Markdown
Contributor Author

Re-trigger CI

…efined hypothesis ALSO FALSIFIED

Authors a third lib-only falsifier (FALSIFY-FFN-GGUF-006) in
apr_transformer::helpers::determinism_tests:

  falsify_ffn_gguf_006_simd_vs_scalar_reduction_order_byte_identity

Test runs APR's simd_dot_f32_avx2 (AVX2 8-wide FMA) and APR's
scalar fallback (iter().zip().map(*).sum()) on the same canonical
synthetic input, compares bit patterns via f32::to_bits().

EMPIRICAL RESULT (2026-05-06): both paths produce BYTE-IDENTICAL
output 0x44191e70 = 612.4756. Asserted as regression-test
invariant.

This FALSIFIES the refined H2a' hypothesis at the SIMD-vs-scalar
level. The cumulative APR↔GGUF drift cannot be explained by APR's
SIMD vs APR's scalar path differing on this class of f32 inputs.

SECOND HYPOTHESIS FALSIFICATION IN ONE SESSION:
- §28 (parallel-reduction non-determinism, M91): FALSIFIED
- H2a' (SIMD-vs-scalar reduction-order, this PR): FALSIFIED

NEW REFINED HYPOTHESIS H2d (post-second-falsification):
The bit-level difference between APR and GGUF must come from one
of:

H2d.1: Per-block dequant boundaries differ between APR's whole-row
       F32 reduction and GGUF's Q4K-super-block-wise reduction
H2d.2: APR's F32 weights differ at bit level from a true
       dequantization of the GGUF Q4K bytes (despite SHIP-003 PR
       #1059 cos≥0.9999999 weight invariance)
H2d.3: GGUF's intermediate Q8K activation quantization rounds
       activations to ~7-bit precision differently than APR's
       full-F32 path

Each H2d.x is a separate falsifier candidate.

Next M-FFN-GGUF-4 step (c) deliverable: H2d.2 is most directly
testable autonomously — load APR F32 weights + GGUF Q4K bytes for
same tensor, dequantize Q4K via APR's dequant routine, compare
element-wise. If bit-level differ, H2d.2 confirmed.

Contract amendment: trace-ffn-sub-block-gguf-v1 v1.2.0 → v1.3.0.

Status promotions:
- FALSIFY-FFN-GGUF-006: NEW → DISCHARGED (test passes after flip)
- M-FFN-GGUF-4 step (b): PENDING → SHIPPED

Step (c) remains PENDING — narrowed scope to H2d.{1,2,3}.

Production hot paths byte-unchanged.

`pv validate` 0/0; 3 lib tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the feat/m-ffn-gguf-4b-simd-vs-scalar-reduction-order branch from a6fa508 to f6fa816 Compare May 6, 2026 20:10
@noahgift noahgift merged commit 496e955 into main May 6, 2026
10 checks passed
@noahgift noahgift deleted the feat/m-ffn-gguf-4b-simd-vs-scalar-reduction-order branch May 6, 2026 20:39
noahgift added a commit that referenced this pull request May 7, 2026
…s decompose §27 1723% within rounding — fix scope EMPIRICALLY VALIDATED — spec v3.03.0 → v3.04.0 (#1546)

Two-day autonomous /loop session shipped 11 lib-test + 1 integration-test
falsifiers (M91-M101, aprender PRs #1535/#1536/#1537/#1538/#1540/#1541/
#1542/#1543/#1544/#1545) decomposing the §27 layer-3 ffn_swigl 18.23×
APR-vs-GGUF std-ratio.

Final empirical decomposition (2026-05-07):

  M94 mechanism × M95 compounding × M99 std-ratio × A5 real-teacher × residual
  = 0.077% × 5.70× × 50× × 5.56× × 14×
  ≈ 1715%   ≈   §27's 1723% (within rounding)

Six synthetic amplifier candidates resolved:
- A1 (RoPE phase, M98)        — FALSIFIED 1.00× UNITARY
- A2 (Softmax saturation, M97) — FALSIFIED 0.01× COMPRESSES
- A3 (Block-scale variance, M96) — FALSIFIED 1.00× SCALE-INVARIANT
- A4 (Multi-token batch, M99) — FALSIFIED 0.26× per-token + 50× std-ratio
- A5 (Real-weight non-uniformity, M100) — PARTIALLY CONFIRMED 5.56× LIVE
- A6 (RMSNorm rsqrt, M101)    — FALSIFIED 1.00× HOMOGENEOUS

14× residual is now attributed entirely to cumulative-layer interaction.

SHIP-007 §22 fix scope EMPIRICALLY VALIDATED as Option-A (PROMOTE
GGUF-PATH semantics into APR forward): switching APR's `f32_matmul`
to Q8K activation quant + fused matvec semantics will recover the
5.56× per-matvec amplification on every matmul, eliminating cumulative
APR-vs-GGUF drift. Estimated fix scope ~250-400 LOC; transitively
discharges 5 MODEL-1 PARTIALs (SHIP-002, SHIP-005, SHIP-006, SHIP-007,
SHIP-008) per §17.5.

Cascade methodology consolidated:
- ~/.claude/projects/-home-noah-src-aprender/memory/feedback_falsifier_cascade_decomposes_magnitude.md
- ~/.claude/projects/-home-noah-src-aprender/memory/feedback_falsifier_chain_assert_difference.md

Companion-spec entries M91-M101 in claude-code-parity-apr/docs/
specifications/claude-code-parity-apr-poc.md provide the full per-PR
narrative. Aprender contract `contracts/trace-ffn-sub-block-gguf-v1.yaml`
v1.0.0 → v1.12.0 across 12 amendments.

MODEL-1 ship %: unchanged at 91% until M-FFN-GGUF-5 (actual fix PR) lands.
MODEL-2 ship %: unchanged at 57% until step 5g.3 produces val_loss < 9.38.

Spec v3.03.0 → v3.04.0. Atomic next action banner only — full §59
narrative deferred to deliberate-session work alongside M-FFN-GGUF-5
fix PR.

Refs PMAT-CCPA, SHIP-007 §22, M91-M101 cascade.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant