docs(spec): SHIP-TWO-001 §61.8 — PRED-61-A/B fired, refined 3-way bug taxonomy#1611
Closed
noahgift wants to merge 2 commits into
Closed
docs(spec): SHIP-TWO-001 §61.8 — PRED-61-A/B fired, refined 3-way bug taxonomy#1611noahgift wants to merge 2 commits into
noahgift wants to merge 2 commits into
Conversation
noahgift
added a commit
that referenced
this pull request
May 11, 2026
…1 on 10-problem HumanEval sample (PMAT-CODE-SHIP-TWO-SECTION-62) Records the closure of §61.8 Branch A (APR + ChatML "\ns\ns" degenerate output bug) across THREE same-class PRs, plus the LIVE 10-problem HumanEval empirical signal for SHIP-005. Branch A closure pattern (3 PRs, same defect class, 3 call sites): - PR #1615 — apr-cli/src/commands/output_verification.rs::golden_output_apr Reroute through realizar::run_inference + with_input_tokens. Discharge: SHIP-006 LIVE (apr qa 12/12 gates). - PR #1616 — apr-cli/src/commands/eval/inference.rs::run_humaneval_inference Reroute through same path. Model emits canonical solution structure but Python test FAILs on whitespace artifact. - PR #1617 — apr-cli/src/commands/eval/inference.rs::align_continuation_indent NEW post-processing fn: dedent over-indented body by N spaces; stop at first 0-indent non-empty line (preserve post-amble). Discharge: HumanEval/0 1/1 PASS post-fix. LIVE 10-problem HumanEval sample (2026-05-11, lambda-vector RTX 4090): - apr eval <canonical 7B APR teacher> --task humaneval --data <10> --samples 1 --temperature 0.0 - Result: passed = 8/10 = 80% pass@1 - Per-problem: HumanEval/0/1/3/4/5/7/8/9 PASS; /2 /6 FAIL - 95% binomial CI on 8/10: [44%, 97%] — within statistical noise of 86% nominal SHIP-005 floor - Full 164-problem run dispatched in background (`/tmp/he-164-result.json`, ~5h CPU wall, pre-authorized per feedback_compute_pre_authorized.md 48h ceiling) Five-Whys for the §62 amendment: 1. Why §62 now and not wait for 164 result? The 3-PR closure is a substantial cascade record that deserves spec-level permanence; 164-result is a separate "ship-%-flip" event that gets its own follow-up amendment when it lands. 2. Why 3 PRs for one bug class? The legacy AprTransformer path was wired in 3 distinct callsites (golden_output, humaneval, indent-residual post-processing). Each needs its own surgical reroute / post-process — fixing one doesn't fix the others. 3. Why is methodology lesson #10 worth recording? Prior methodology lessons (#6-#9) covered single-bug cascades. #10 generalises: "single bug class" may need multi-PR surgical fixes when manifest across multiple call sites. 4. Why ≤95% binomial CI is enough confidence to dispatch full 164? The 10-problem sample's 80% is well within the [44%, 97%] CI of the contract floor (84.80% effective). Full 164 dispatch reduces N=10 → N=164 → much tighter CI. 5. Why bump spec v3.07.0 → v3.08.0 now? §62 is a substantive record of 3-PR cascade closure + new empirical evidence; it warrants a minor version bump. Changes (1 spec file + 1 evidence directory): - docs/specifications/aprender-train/ship-two-models-spec.md: - Atomic next action banner: v3.06.0 → v3.08.0 (skips v3.07.0 which was claimed by PR #1611 in queue — once that lands, rebase to renumber if needed) - New §62 sub-section ABOVE §61 (newest-first ordering), with 7 sub-sub-sections: 62.1 3-PR cascade table, 62.2 10-problem LIVE evidence, 62.3 sample-vs-floor analysis, 62.4 164-run dispatch, 62.5 methodology lesson #10, 62.6 ship-% movement, 62.7 what §62 is NOT - evidence/section-62-branch-a-closure-2026-05-11/ (NEW): - humaneval-10-result.json (raw apr eval --json output) - findings.json (structured 3-PR cascade record + per-problem pass results + dispatch metadata) Validation: - Section format consistent with §61 (newest-first, dated, sub- sections numbered §62.X) - All 3 cascade PRs referenced explicitly - Empirical evidence reproducible via captured JSON Spec movement: - v3.06.0 → v3.08.0 - MODEL-1 ship %: stays at 94% pending 164-run completion - MODEL-2 ship %: unchanged at 57% Refs: - evidence/section-62-branch-a-closure-2026-05-11/findings.json (LIVE evidence) - PR #1615 (SHIP-006 fix + LIVE discharge — golden_output_apr) - PR #1616 (HumanEval inference path fix) - PR #1617 (HumanEval indent residual fix — align_continuation_indent) - SPEC-SHIP-TWO-001 §61.8 (Branch A vs Branch B taxonomy) - SPEC-SHIP-TWO-001 §17.5 (5 MODEL-1 PARTIAL chain) - feedback_compute_pre_authorized.md (lambda-labs 48h ceiling) Closes task #35 PMAT-CODE-SHIP-TWO-SECTION-62. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… taxonomy (PMAT-CODE-SHIP-TWO-SECTION-61-8)
Same-day continuation of §61. Both falsifiable predictions fired on
noah-Lambda-Vector RTX 4090 (apr v0.32.0 post-e856eb91f).
PRED-61-B GREEN (predicted):
apr run <APR teacher> --prompt "What is 2+2? The answer is " → "4"
Wall: 79.09s. Confirms APR forward path under direct prompts is
semantically correct. Matches §60 closure.
PRED-61-A RED — but in an unexpected way:
apr run <GGUF teacher> emits byte-identical
"ampiezza = 0.5\ndiametro = 10\naltezza = 20\n# Calcolo del volume\nvolume = ("
across THREE distinct prompts:
1. "What is 2+2? The answer is " (direct continuation)
2. "<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n" (ChatML)
3. "Hello, my name is" (conversational, no question)
Wall times: 48.73s / 48.68s / 39.65s — different (proving inference
IS running, not cached), but output text matches byte-for-byte.
This is a PROMPT-INSENSITIVE GGUF generation bug — input tokens are
dropped, ignored, or the model state is initialized to a fixed
configuration before forward pass starts.
Five-Whys for the §61.8 amendment:
1. Why §61.8? Both PRED-61-A and PRED-61-B fired; need durable record.
2. Why three prompts on GGUF? PRED-61-A's RED outcome in unexpected
shape required disambiguation — was it ChatML-specific or
structural? Three distinct prompts confirm structural.
3. Why does this matter? §61's 2-way picture (APR ChatML BROKEN /
APR direct WORKING) was incomplete. Reality is 3-way: APR direct
WORKING, APR ChatML BROKEN with \ns\ns repetition, GGUF any-prompt
BROKEN with prompt-insensitive canned output.
4. Why split into two branches? Branch A (APR ChatML) and Branch B
(GGUF prompt-insensitive) are independent — different code paths,
different failure modes, different fix scopes.
5. Why methodology lesson #8? PRED-61-A asked "is GGUF + ChatML clean?"
and the answer is "no, but for an entirely different reason than
ChatML special-token handling". Without the third-prompt control
(Hello), the §61.8 taxonomy would have collapsed into "all paths
broken under ChatML" which would mis-localize.
§61.8 amendments to spec (1 file):
- Atomic next action banner: v3.06.0 → v3.07.0
- Add §61.8 sub-section above the closing --- divider of §61, with:
- 61.8.0: empirical PRED firing (apr run examples + outputs)
- 61.8.1: refined 3-way bug taxonomy (table)
- 61.8.2: Branch A vs Branch B independent investigation cascades
- 61.8.3: ship-% movement (stays 92%) + per-SHIP* blocker mapping
- 61.8.4: methodology lesson #8 (RED outcome may surface different bug)
Evidence (NEW directory):
- evidence/section-61-8-pred-fired-2026-05-10/
- pred-61-b-apr-direct.txt (29 lines, "4" output)
- pred-61-a-gguf-direct.txt (32 lines, Italian "ampiezza...")
- pred-61-a-gguf-chatml.txt (32 lines, byte-identical Italian)
- gguf-third-prompt.txt (28 lines, "Hello..." → byte-identical)
- findings.json (structured 3-way taxonomy + investigation branches)
Validation:
- Section format consistent with §61.1-61.7 (numbered §61.X.N sub-
sub-sections under §61.8).
- All evidence files referenced in spec body.
- Methodological alignment: zero eprintln!, all evidence via apr
run + tail to text files.
Spec movement:
- v3.06.0 → v3.07.0
- MODEL-1 ship %: stays at 92% (snapshot, not falsifier flip).
- MODEL-2 ship %: unchanged at 57%.
Refs:
- evidence/section-61-8-pred-fired-2026-05-10/findings.json
- SPEC-SHIP-TWO-001 §61.5 (PRED-61-A/B definitions)
- SPEC-SHIP-TWO-001 §17.5 (5 MODEL-1 PARTIAL chain)
- SPEC-SHIP-TWO-001 §60 (SHIP-007 §22 closure)
Closes task #30 PMAT-CODE-SHIP-TWO-SECTION-61-8.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1982a6d to
503809c
Compare
Contributor
Author
|
Closing as superseded — the §65→§71 cascade narrative is complete on main via PRs #1629/#1631/#1633/#1634/#1636/#1642 (and the in-tree §67/§68/§69/§70/§71 sections). SHIP-005 LIVE-DISCHARGED at 86.59% pass@1 (§71); see contracts/apr-eval-humaneval-harness-invariant-v1.yaml v1.1.0 for the empirical evidence and root cause. |
auto-merge was automatically disabled
May 12, 2026 15:30
Pull request was closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Same-day continuation of §61 (PR #1610). Both PRED-61-A and PRED-61-B fired live on canonical 7B teacher; surfaces a refined 3-way bug taxonomy.
Stacks on PR #1610 — this branch has §61 + §61.8 commits. When #1610 merges first, this PR will simplify to just §61.8.
Predictions Fired
PRED-61-B GREEN (predicted):
apr run <APR teacher> --prompt "What is 2+2? The answer is " --max-tokens 32→4✓PRED-61-A RED — but in an unexpected way:
apr run <GGUF teacher>emits byte-identical"ampiezza = 0.5\ndiametro = 10\n..."Italian gibberish across THREE distinct prompts (direct continuation / ChatML wrapper / conversational).Refined 3-Way Bug Taxonomy
"\ns\ns\ns…"degenerate"ampiezza..."Two Independent Investigation Branches
apr trace --payloadon layer-0 attn_norm at first generated-token position.realizar::inference::forwardto log actual token IDs reaching embedding lookup.§17.5 PARTIALs Per Branch
Methodology Lesson #8
A falsifier's RED outcome may surface a DIFFERENT bug class than the one being investigated. PRED-61-A asked "is GGUF + ChatML clean?" — the answer is "no, but for an entirely different reason than ChatML special-token handling". Without the third-prompt control ("Hello"), §61.8's 3-way taxonomy would have collapsed into "all paths broken under ChatML" — mis-localizing.
Ship-% Movement
🤖 Generated with Claude Code