docs(spec): SHIP-TWO-001 §61.8 — PRED-61-A/B fired, refined 3-way bug taxonomy by noahgift · Pull Request #1611 · paiml/aprender

noahgift · 2026-05-10T12:50:15Z

Summary

Same-day continuation of §61 (PR #1610). Both PRED-61-A and PRED-61-B fired live on canonical 7B teacher; surfaces a refined 3-way bug taxonomy.

Stacks on PR #1610 — this branch has §61 + §61.8 commits. When #1610 merges first, this PR will simplify to just §61.8.

Predictions Fired

PRED-61-B GREEN (predicted):

apr run <APR teacher> --prompt "What is 2+2? The answer is " --max-tokens 32 → 4 ✓
Wall: 79.09s. Confirms APR direct path is semantically correct.

PRED-61-A RED — but in an unexpected way:

apr run <GGUF teacher> emits byte-identical "ampiezza = 0.5\ndiametro = 10\n..." Italian gibberish across THREE distinct prompts (direct continuation / ChatML wrapper / conversational).
Wall times: 48.73s / 48.68s / 39.65s — different (inference IS running, not cached), but text matches byte-for-byte.
This is a prompt-insensitive structural bug in GGUF inference path.

Refined 3-Way Bug Taxonomy

Path	Output	Verdict	Bug scope
APR + direct	Coherent, prompt-correlated	WORKING	Matches §60
APR + ChatML	`"\ns\ns\ns…"` degenerate	BROKEN	APR-side ChatML special-token handling
GGUF + any prompt	Byte-identical `"ampiezza..."`	BROKEN	GGUF input-handling/state-init

Two Independent Investigation Branches

Branch A: APR ChatML degenerate-output. Bisect via apr trace --payload on layer-0 attn_norm at first generated-token position.
Branch B: GGUF prompt-insensitive canned-output. Instrument realizar::inference::forward to log actual token IDs reaching embedding lookup.

§17.5 PARTIALs Per Branch

SHIP-006 (apr qa golden_output) co-blocked on Branch A AND Branch B
SHIP-008 (chat template render) blocked on Branch A
SHIP-005 (HumanEval) likely blocked on Branch B
SHIP-007 (decode tps ≥ 30) likely blocked on Branch B

Methodology Lesson #8

A falsifier's RED outcome may surface a DIFFERENT bug class than the one being investigated. PRED-61-A asked "is GGUF + ChatML clean?" — the answer is "no, but for an entirely different reason than ChatML special-token handling". Without the third-prompt control ("Hello"), §61.8's 3-way taxonomy would have collapsed into "all paths broken under ChatML" — mis-localizing.

Ship-% Movement

MODEL-1 ship %: stays at 92% (refines picture, does NOT ship a fix or LIVE-discharge).
MODEL-2 ship %: unchanged at 57% (gated on step 5g.3).

🤖 Generated with Claude Code

…1 on 10-problem HumanEval sample (PMAT-CODE-SHIP-TWO-SECTION-62) Records the closure of §61.8 Branch A (APR + ChatML "\ns\ns" degenerate output bug) across THREE same-class PRs, plus the LIVE 10-problem HumanEval empirical signal for SHIP-005. Branch A closure pattern (3 PRs, same defect class, 3 call sites): - PR #1615 — apr-cli/src/commands/output_verification.rs::golden_output_apr Reroute through realizar::run_inference + with_input_tokens. Discharge: SHIP-006 LIVE (apr qa 12/12 gates). - PR #1616 — apr-cli/src/commands/eval/inference.rs::run_humaneval_inference Reroute through same path. Model emits canonical solution structure but Python test FAILs on whitespace artifact. - PR #1617 — apr-cli/src/commands/eval/inference.rs::align_continuation_indent NEW post-processing fn: dedent over-indented body by N spaces; stop at first 0-indent non-empty line (preserve post-amble). Discharge: HumanEval/0 1/1 PASS post-fix. LIVE 10-problem HumanEval sample (2026-05-11, lambda-vector RTX 4090): - apr eval <canonical 7B APR teacher> --task humaneval --data <10> --samples 1 --temperature 0.0 - Result: passed = 8/10 = 80% pass@1 - Per-problem: HumanEval/0/1/3/4/5/7/8/9 PASS; /2 /6 FAIL - 95% binomial CI on 8/10: [44%, 97%] — within statistical noise of 86% nominal SHIP-005 floor - Full 164-problem run dispatched in background (`/tmp/he-164-result.json`, ~5h CPU wall, pre-authorized per feedback_compute_pre_authorized.md 48h ceiling) Five-Whys for the §62 amendment: 1. Why §62 now and not wait for 164 result? The 3-PR closure is a substantial cascade record that deserves spec-level permanence; 164-result is a separate "ship-%-flip" event that gets its own follow-up amendment when it lands. 2. Why 3 PRs for one bug class? The legacy AprTransformer path was wired in 3 distinct callsites (golden_output, humaneval, indent-residual post-processing). Each needs its own surgical reroute / post-process — fixing one doesn't fix the others. 3. Why is methodology lesson #10 worth recording? Prior methodology lessons (#6-#9) covered single-bug cascades. #10 generalises: "single bug class" may need multi-PR surgical fixes when manifest across multiple call sites. 4. Why ≤95% binomial CI is enough confidence to dispatch full 164? The 10-problem sample's 80% is well within the [44%, 97%] CI of the contract floor (84.80% effective). Full 164 dispatch reduces N=10 → N=164 → much tighter CI. 5. Why bump spec v3.07.0 → v3.08.0 now? §62 is a substantive record of 3-PR cascade closure + new empirical evidence; it warrants a minor version bump. Changes (1 spec file + 1 evidence directory): - docs/specifications/aprender-train/ship-two-models-spec.md: - Atomic next action banner: v3.06.0 → v3.08.0 (skips v3.07.0 which was claimed by PR #1611 in queue — once that lands, rebase to renumber if needed) - New §62 sub-section ABOVE §61 (newest-first ordering), with 7 sub-sub-sections: 62.1 3-PR cascade table, 62.2 10-problem LIVE evidence, 62.3 sample-vs-floor analysis, 62.4 164-run dispatch, 62.5 methodology lesson #10, 62.6 ship-% movement, 62.7 what §62 is NOT - evidence/section-62-branch-a-closure-2026-05-11/ (NEW): - humaneval-10-result.json (raw apr eval --json output) - findings.json (structured 3-PR cascade record + per-problem pass results + dispatch metadata) Validation: - Section format consistent with §61 (newest-first, dated, sub- sections numbered §62.X) - All 3 cascade PRs referenced explicitly - Empirical evidence reproducible via captured JSON Spec movement: - v3.06.0 → v3.08.0 - MODEL-1 ship %: stays at 94% pending 164-run completion - MODEL-2 ship %: unchanged at 57% Refs: - evidence/section-62-branch-a-closure-2026-05-11/findings.json (LIVE evidence) - PR #1615 (SHIP-006 fix + LIVE discharge — golden_output_apr) - PR #1616 (HumanEval inference path fix) - PR #1617 (HumanEval indent residual fix — align_continuation_indent) - SPEC-SHIP-TWO-001 §61.8 (Branch A vs Branch B taxonomy) - SPEC-SHIP-TWO-001 §17.5 (5 MODEL-1 PARTIAL chain) - feedback_compute_pre_authorized.md (lambda-labs 48h ceiling) Closes task #35 PMAT-CODE-SHIP-TWO-SECTION-62. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… taxonomy (PMAT-CODE-SHIP-TWO-SECTION-61-8) Same-day continuation of §61. Both falsifiable predictions fired on noah-Lambda-Vector RTX 4090 (apr v0.32.0 post-e856eb91f). PRED-61-B GREEN (predicted): apr run <APR teacher> --prompt "What is 2+2? The answer is " → "4" Wall: 79.09s. Confirms APR forward path under direct prompts is semantically correct. Matches §60 closure. PRED-61-A RED — but in an unexpected way: apr run <GGUF teacher> emits byte-identical "ampiezza = 0.5\ndiametro = 10\naltezza = 20\n# Calcolo del volume\nvolume = (" across THREE distinct prompts: 1. "What is 2+2? The answer is " (direct continuation) 2. "<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n" (ChatML) 3. "Hello, my name is" (conversational, no question) Wall times: 48.73s / 48.68s / 39.65s — different (proving inference IS running, not cached), but output text matches byte-for-byte. This is a PROMPT-INSENSITIVE GGUF generation bug — input tokens are dropped, ignored, or the model state is initialized to a fixed configuration before forward pass starts. Five-Whys for the §61.8 amendment: 1. Why §61.8? Both PRED-61-A and PRED-61-B fired; need durable record. 2. Why three prompts on GGUF? PRED-61-A's RED outcome in unexpected shape required disambiguation — was it ChatML-specific or structural? Three distinct prompts confirm structural. 3. Why does this matter? §61's 2-way picture (APR ChatML BROKEN / APR direct WORKING) was incomplete. Reality is 3-way: APR direct WORKING, APR ChatML BROKEN with \ns\ns repetition, GGUF any-prompt BROKEN with prompt-insensitive canned output. 4. Why split into two branches? Branch A (APR ChatML) and Branch B (GGUF prompt-insensitive) are independent — different code paths, different failure modes, different fix scopes. 5. Why methodology lesson #8? PRED-61-A asked "is GGUF + ChatML clean?" and the answer is "no, but for an entirely different reason than ChatML special-token handling". Without the third-prompt control (Hello), the §61.8 taxonomy would have collapsed into "all paths broken under ChatML" which would mis-localize. §61.8 amendments to spec (1 file): - Atomic next action banner: v3.06.0 → v3.07.0 - Add §61.8 sub-section above the closing --- divider of §61, with: - 61.8.0: empirical PRED firing (apr run examples + outputs) - 61.8.1: refined 3-way bug taxonomy (table) - 61.8.2: Branch A vs Branch B independent investigation cascades - 61.8.3: ship-% movement (stays 92%) + per-SHIP* blocker mapping - 61.8.4: methodology lesson #8 (RED outcome may surface different bug) Evidence (NEW directory): - evidence/section-61-8-pred-fired-2026-05-10/ - pred-61-b-apr-direct.txt (29 lines, "4" output) - pred-61-a-gguf-direct.txt (32 lines, Italian "ampiezza...") - pred-61-a-gguf-chatml.txt (32 lines, byte-identical Italian) - gguf-third-prompt.txt (28 lines, "Hello..." → byte-identical) - findings.json (structured 3-way taxonomy + investigation branches) Validation: - Section format consistent with §61.1-61.7 (numbered §61.X.N sub- sub-sections under §61.8). - All evidence files referenced in spec body. - Methodological alignment: zero eprintln!, all evidence via apr run + tail to text files. Spec movement: - v3.06.0 → v3.07.0 - MODEL-1 ship %: stays at 92% (snapshot, not falsifier flip). - MODEL-2 ship %: unchanged at 57%. Refs: - evidence/section-61-8-pred-fired-2026-05-10/findings.json - SPEC-SHIP-TWO-001 §61.5 (PRED-61-A/B definitions) - SPEC-SHIP-TWO-001 §17.5 (5 MODEL-1 PARTIAL chain) - SPEC-SHIP-TWO-001 §60 (SHIP-007 §22 closure) Closes task #30 PMAT-CODE-SHIP-TWO-SECTION-61-8. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift · 2026-05-12T15:30:57Z

Closing as superseded — the §65→§71 cascade narrative is complete on main via PRs #1629/#1631/#1633/#1634/#1636/#1642 (and the in-tree §67/§68/§69/§70/§71 sections). SHIP-005 LIVE-DISCHARGED at 86.59% pass@1 (§71); see contracts/apr-eval-humaneval-harness-invariant-v1.yaml v1.1.0 for the empirical evidence and root cause.

noahgift enabled auto-merge (squash) May 10, 2026 12:50

noahgift force-pushed the docs/ship-two-spec-section-61-8-pred-fired branch from 1982a6d to 503809c Compare May 11, 2026 14:40

Merge branch 'main' into docs/ship-two-spec-section-61-8-pred-fired

b246767

noahgift closed this May 12, 2026

auto-merge was automatically disabled May 12, 2026 15:30
Pull request was closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(spec): SHIP-TWO-001 §61.8 — PRED-61-A/B fired, refined 3-way bug taxonomy#1611

docs(spec): SHIP-TWO-001 §61.8 — PRED-61-A/B fired, refined 3-way bug taxonomy#1611
noahgift wants to merge 2 commits into
mainfrom
docs/ship-two-spec-section-61-8-pred-fired

noahgift commented May 10, 2026

Uh oh!

noahgift commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 10, 2026

Summary

Predictions Fired

Refined 3-Way Bug Taxonomy

Two Independent Investigation Branches

§17.5 PARTIALs Per Branch

Methodology Lesson #8

Ship-% Movement

Uh oh!

noahgift commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant