contract(apr-vs-gguf-forward-parity-v1): v1.0.0 → v1.1.0 — §37 sample-size-parity gate#1107
Merged
Merged
Conversation
…-size-parity gate Per SPEC-SHIP-TWO-001 §37 (TRACE-CAPTURE-POINT MISMATCH), the v1.0.0 ratio gates assume APR and GGUF forward_traced compute stats over the SAME tensor sample. They DO NOT today: apr_layer[0].attn_norm_stats.count == 25088 (7 × 3584, all-tokens) gguf_layer[0].attn_norm_stats.count == 3584 (1 × 3584, last-only) The 18.23× layer-3 ffn_swigl ratio mixes real precision drift with sample-size artifact in unknown proportions. v1.0.0 ratio gates produce false positives (Pass when there's a real bug masked by sampling) or false negatives (Fail when sampling alone explains the drift). This bump adds: - New equation `trace_sample_size_parity` documenting the count-equality precondition with both fix-surface options listed (§37.5). - New falsification test FALSIFY-APR-GGUF-PARITY-007 enforcing apr_layer[i].count == gguf_layer[i].count across 28 layers × 10 stat slots = 280 equality checks. FAILS today; PASSES post-fix. - New kani harness KH-APR-GGUF-PARITY-003 with bound=280. - Two new proof_obligations (invariant + soundness) tying ratio-gate credibility to count-parity restoration. Five-whys (recorded in §37.7 of spec): 1. Why isn't MODEL-1 inference correct? `apr run` produces gibberish. 2. Why has bisection been hard? §17→§27 chain produces 18.23× signal, but downstream investigations keep finding "byte-identical" results. 3. Why do byte-identical inputs produce different std reports? Different sample sizes (apples-to-oranges). 4. Why didn't this come up before? PRs #1082+#1083 matched APR's API structurally but not semantically. 5. What's the fix? Make both reporters use the same sample. Then re-measure ratio gates. Per §26.8 stack-tool-extension methodology + feedback_pv_not_bash_for_contracts.md: this contract bump precedes the implementation PR. Validates clean via `pv validate`. Spec ref: §37 (PR #1105 docs/ship-007-trace-capture-mismatch). Coverage scoreboard unchanged (15+33). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced Apr 28, 2026
noahgift
added a commit
that referenced
this pull request
Apr 30, 2026
…size parity (#1109) Implements `Option<LastTokenStats>` field on `LayerActivation` per SPEC-SHIP-TWO-001 §37.5 Option B + FALSIFY-APR-GGUF-PARITY-007 (contracts/apr-vs-gguf-forward-parity-v1.yaml v1.1.0, PR #1107). What changes: - New `LastTokenStats` struct mirroring 10 ActivationStats slots, computed only over last token's slice (hidden_dim or intermediate_dim elements per slot). - `LayerActivation.last_token: Option<LastTokenStats>` field, default None for backwards-compat. - `AprTransformer::forward_traced` populates last_token via `&hidden[(seq_len - 1) * dim..]` slicing for all 10 stat slots. - `OwnedQuantizedModel::forward_traced` populates last_token by cloning existing single-token stats (GGUF already traces only the last token). - 2 new unit tests pin schema invariants (default-None backwards- compat + populated-count == hidden_dim or intermediate_dim). - 6/6 unit tests PASS. Live verification (RTX 4090, canonical 7B teacher, prior iteration): ✓ FALSIFY-APR-GGUF-PARITY-007 PASS — count parity restored Layer 3 ffn_swigl ratio: 18.23× → 1.2154× (Pass) ALL 28 layers Pass v1.0.0 ratio gate. The §27 binding criterion (layer-3 18.23× ratio) was ALMOST ENTIRELY a sample-size artifact — see §38 (PR #1108) for full analysis. Five-whys (recorded in §38.6): 1. Why isn't MODEL-1 inference correct? `apr run` gibberish. 2. Why hasn't §17/§23/§27 chain produced a fix? 18× signal misleading. 3. Why was it artifact? APR all-7-tokens vs GGUF last-token-only. 4. Why didn't earlier reviews catch this? PRs #1082+#1083 matched API structurally but not semantically. 5. What's the fix? Make both reporters use same sample (this PR). Spec ref: §37 (PR #1105), §38 (PR #1108). Contract: apr-vs-gguf-forward-parity-v1 v1.1.0 (PR #1107). Coverage scoreboard unchanged (15+33). Authored in isolated worktree to avoid git-environment race condition that prevented commit in prior iteration. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
Apr 30, 2026
…rtifact (#1108) Live verification of §37 Option B implementation (last-token stats on `LayerActivation`) on canonical 7B teacher (RTX 4090): ✓ FALSIFY-APR-GGUF-PARITY-007 PASS — count parity restored Layer 3 ffn_swigl ratio: 18.23× → 1.2154× (Pass) ALL 28 layers Pass the v1.0.0 ratio gate. The §17 → §23 → §27 hypothesis chain (silu_g*u multiply / Q4K precision / matmul kernel as the SHIP-007 root cause) is REFUTED. The 18.23× layer-3 signal was almost entirely a sample-size artifact: APR's `forward_traced` traces all 7 prompt tokens (count=25088 for attn_norm), GGUF's `forward_traced` traces only the last token (count=3584). The std ratio mixed real drift with sampling noise. What this does NOT solve: SHIP-007 is still REAL. `apr run` produces "ampiezza = 0.5\ndiametro = 10" (Italian gibberish) vs GGUF's "2+2 is 4.". But the bug is NOT in layer-3 ffn_swigl. It lives elsewhere — autoregressive generation path, KV cache pre-fill, sampling, or some sub-component the trace path doesn't capture. Implementation status: Option B authored + tested clean (6/6 unit tests Pass) but currently uncommitted on the working tree due to a git-environment race condition between linter and parallel sessions switching HEAD mid-commit. Patch preserved at /tmp/last-token-impl.patch (78 KB diff); live diagnostic results at /tmp/ship-007-bisection/last-token-parity-diag.log. Resolution path: rerun implementation in isolated worktree next iteration. Per feedback_fix_root_cause_never_route_around.md: falsifying the misleading binding criterion IS the discharge step. The §17/§23/§27 chain is now deprioritized; next-iteration agenda: 1. Reapply Option B implementation in worktree (clean of racing). 2. Bisect SHIP-007 in autoregressive path / KV cache (single- forward layer ratios all Pass). 3. Find the real bug; 5 MODEL-1 PARTIALs auto-discharge. Spec ref: §37 (PR #1105), §38 (this PR), apr-vs-gguf-forward-parity-v1 v1.1.0 (PR #1107). Coverage scoreboard unchanged (15+33). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 13, 2026
…Option A (#1113) Authors a new provable contract that codifies the SPEC-SHIP-TWO-001 §40.6 Option A shipping decision: MODEL-1 (paiml/qwen2.5-coder-7b-apache-q4k-v1) IS shippable today via `apr run --no-gpu`. Live evidence (RTX 4090, lambda-labs): $ apr run /mnt/.../qwen2.5-coder-7b-instruct-q4k.apr \ --prompt "What is 2+2?" --max-tokens 5 --temperature 0 \ --skip-contract --no-gpu Output: "2 + 2 equals" ✓ FALSIFY-MODEL-1-SHIP-CPU-001 PASS (contains "equals") Contract structure: - 3 equations: cpu_path_correctness (PASSES today), gpu_path_known_issue (acknowledges defect tracked in §40), gpu_fix_obligation (durable closure mandate). - 6 falsification tests: -001 CPU correctness, -002 §40 in spec, -003 pv validate, -004 user-facing docs warn about GPU, -005 semver signals scope, -006 spec→contract back-reference. - 5 proof_obligations + 2 kani harnesses. - Contract validates clean via `pv validate` (0 errors, 0 warnings). Methodology compliance per `feedback_fix_root_cause_never_route_around.md`: This contract is NOT a workaround. It documents reality (CPU works, GPU has known bug), creates a falsifiable gate that catches CPU regressions, and MANDATES that the GPU bug remain visible in the spec until fixed: - v1.0.0: MODEL-1 ships CPU-only - v2.0.0: MODEL-1 ships CPU+GPU (requires gpu_path_known_issue closure) - Closing the GPU bug requires either: (a) GPU passes cpu_path_correctness gate (b) GPU dispatch is removed/deprecated (c) New hypothesis identified + spec amendment Five-whys (consistent with §40.5): 1. Why isn't MODEL-1 shipped today? Because we lacked a contract-backed verdict that "MODEL-1 produces correct output via SOME inference path". 2. Why? Because the §17/§23/§27/§38 chain was bisecting the wrong path, leaving the actual CPU correctness un-codified. 3. Why now? §40.4 + §40.5 + #1112 H1+H2 falsifiers narrowed the bug to GPU dispatch (H3); CPU is empirically correct. 4. Why this contract NOW? Per the user directive to ship + use contracts; MODEL-1 is shippable TODAY with a v1.0.0 SHIP-via-CPU contract. 5. What's next? On gpu_fix_obligation closure (a/b/c), bump v1.0.0 → v2.0.0 and 5 MODEL-1 PARTIALs auto-discharge. Spec ref: §40.6 Option A. PR cascade: #1105/#1107/#1108/#1109/#1110/#1111/#1112 (this is the SHIP gate that builds on top of §40 localization). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
contracts/apr-vs-gguf-forward-parity-v1.yamlv1.0.0 → v1.1.0trace_sample_size_paritydocumenting count-equality precondition + fix-surface optionsWhy this matters
Per SPEC-SHIP-TWO-001 §37 (PR #1105), the existing ratio gates (-001/-002/-003) compare:
The 18.23× layer-3 ffn_swigl ratio mixes real precision drift with sample-size artifact in unknown proportions. Until parity is restored, ratio gates produce false positives or false negatives.
Five-whys (codified in §37.7 of spec)
Plain progress on shipping models
Test plan
Methodology adherence
feedback_pv_not_bash_for_contracts.md: contract authored, not bash workaroundfeedback_full_problems_pmat_contracts.md: provable contract precedes implementationfeedback_fix_root_cause_never_route_around.md: enforces count-parity at the root, not at gate-output massage🤖 Generated with Claude Code