Skip to content

docs(spec): SHIP-TWO-001 §61.8 — PRED-61-A/B fired, refined 3-way bug taxonomy#1611

Closed
noahgift wants to merge 2 commits into
mainfrom
docs/ship-two-spec-section-61-8-pred-fired
Closed

docs(spec): SHIP-TWO-001 §61.8 — PRED-61-A/B fired, refined 3-way bug taxonomy#1611
noahgift wants to merge 2 commits into
mainfrom
docs/ship-two-spec-section-61-8-pred-fired

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Same-day continuation of §61 (PR #1610). Both PRED-61-A and PRED-61-B fired live on canonical 7B teacher; surfaces a refined 3-way bug taxonomy.

Stacks on PR #1610 — this branch has §61 + §61.8 commits. When #1610 merges first, this PR will simplify to just §61.8.

Predictions Fired

PRED-61-B GREEN (predicted):

  • apr run <APR teacher> --prompt "What is 2+2? The answer is " --max-tokens 324
  • Wall: 79.09s. Confirms APR direct path is semantically correct.

PRED-61-A RED — but in an unexpected way:

  • apr run <GGUF teacher> emits byte-identical "ampiezza = 0.5\ndiametro = 10\n..." Italian gibberish across THREE distinct prompts (direct continuation / ChatML wrapper / conversational).
  • Wall times: 48.73s / 48.68s / 39.65s — different (inference IS running, not cached), but text matches byte-for-byte.
  • This is a prompt-insensitive structural bug in GGUF inference path.

Refined 3-Way Bug Taxonomy

Path Output Verdict Bug scope
APR + direct Coherent, prompt-correlated WORKING Matches §60
APR + ChatML "\ns\ns\ns…" degenerate BROKEN APR-side ChatML special-token handling
GGUF + any prompt Byte-identical "ampiezza..." BROKEN GGUF input-handling/state-init

Two Independent Investigation Branches

  • Branch A: APR ChatML degenerate-output. Bisect via apr trace --payload on layer-0 attn_norm at first generated-token position.
  • Branch B: GGUF prompt-insensitive canned-output. Instrument realizar::inference::forward to log actual token IDs reaching embedding lookup.

§17.5 PARTIALs Per Branch

  • SHIP-006 (apr qa golden_output) co-blocked on Branch A AND Branch B
  • SHIP-008 (chat template render) blocked on Branch A
  • SHIP-005 (HumanEval) likely blocked on Branch B
  • SHIP-007 (decode tps ≥ 30) likely blocked on Branch B

Methodology Lesson #8

A falsifier's RED outcome may surface a DIFFERENT bug class than the one being investigated. PRED-61-A asked "is GGUF + ChatML clean?" — the answer is "no, but for an entirely different reason than ChatML special-token handling". Without the third-prompt control ("Hello"), §61.8's 3-way taxonomy would have collapsed into "all paths broken under ChatML" — mis-localizing.

Ship-% Movement

  • MODEL-1 ship %: stays at 92% (refines picture, does NOT ship a fix or LIVE-discharge).
  • MODEL-2 ship %: unchanged at 57% (gated on step 5g.3).

🤖 Generated with Claude Code

@noahgift noahgift enabled auto-merge (squash) May 10, 2026 12:50
noahgift added a commit that referenced this pull request May 11, 2026
…1 on 10-problem HumanEval sample (PMAT-CODE-SHIP-TWO-SECTION-62)

Records the closure of §61.8 Branch A (APR + ChatML "\ns\ns"
degenerate output bug) across THREE same-class PRs, plus the LIVE
10-problem HumanEval empirical signal for SHIP-005.

Branch A closure pattern (3 PRs, same defect class, 3 call sites):
- PR #1615 — apr-cli/src/commands/output_verification.rs::golden_output_apr
  Reroute through realizar::run_inference + with_input_tokens.
  Discharge: SHIP-006 LIVE (apr qa 12/12 gates).
- PR #1616 — apr-cli/src/commands/eval/inference.rs::run_humaneval_inference
  Reroute through same path. Model emits canonical solution
  structure but Python test FAILs on whitespace artifact.
- PR #1617 — apr-cli/src/commands/eval/inference.rs::align_continuation_indent
  NEW post-processing fn: dedent over-indented body by N spaces;
  stop at first 0-indent non-empty line (preserve post-amble).
  Discharge: HumanEval/0 1/1 PASS post-fix.

LIVE 10-problem HumanEval sample (2026-05-11, lambda-vector RTX 4090):
- apr eval <canonical 7B APR teacher> --task humaneval --data <10> --samples 1 --temperature 0.0
- Result: passed = 8/10 = 80% pass@1
- Per-problem: HumanEval/0/1/3/4/5/7/8/9 PASS; /2 /6 FAIL
- 95% binomial CI on 8/10: [44%, 97%] — within statistical
  noise of 86% nominal SHIP-005 floor
- Full 164-problem run dispatched in background
  (`/tmp/he-164-result.json`, ~5h CPU wall, pre-authorized per
  feedback_compute_pre_authorized.md 48h ceiling)

Five-Whys for the §62 amendment:
1. Why §62 now and not wait for 164 result? The 3-PR closure is
   a substantial cascade record that deserves spec-level
   permanence; 164-result is a separate "ship-%-flip" event that
   gets its own follow-up amendment when it lands.
2. Why 3 PRs for one bug class? The legacy AprTransformer path
   was wired in 3 distinct callsites (golden_output, humaneval,
   indent-residual post-processing). Each needs its own surgical
   reroute / post-process — fixing one doesn't fix the others.
3. Why is methodology lesson #10 worth recording? Prior
   methodology lessons (#6-#9) covered single-bug cascades. #10
   generalises: "single bug class" may need multi-PR surgical
   fixes when manifest across multiple call sites.
4. Why ≤95% binomial CI is enough confidence to dispatch full 164?
   The 10-problem sample's 80% is well within the [44%, 97%] CI
   of the contract floor (84.80% effective). Full 164 dispatch
   reduces N=10 → N=164 → much tighter CI.
5. Why bump spec v3.07.0 → v3.08.0 now? §62 is a substantive
   record of 3-PR cascade closure + new empirical evidence; it
   warrants a minor version bump.

Changes (1 spec file + 1 evidence directory):
- docs/specifications/aprender-train/ship-two-models-spec.md:
  - Atomic next action banner: v3.06.0 → v3.08.0 (skips v3.07.0
    which was claimed by PR #1611 in queue — once that lands,
    rebase to renumber if needed)
  - New §62 sub-section ABOVE §61 (newest-first ordering), with
    7 sub-sub-sections: 62.1 3-PR cascade table, 62.2 10-problem
    LIVE evidence, 62.3 sample-vs-floor analysis, 62.4 164-run
    dispatch, 62.5 methodology lesson #10, 62.6 ship-% movement,
    62.7 what §62 is NOT
- evidence/section-62-branch-a-closure-2026-05-11/ (NEW):
  - humaneval-10-result.json (raw apr eval --json output)
  - findings.json (structured 3-PR cascade record + per-problem
    pass results + dispatch metadata)

Validation:
- Section format consistent with §61 (newest-first, dated, sub-
  sections numbered §62.X)
- All 3 cascade PRs referenced explicitly
- Empirical evidence reproducible via captured JSON

Spec movement:
- v3.06.0 → v3.08.0
- MODEL-1 ship %: stays at 94% pending 164-run completion
- MODEL-2 ship %: unchanged at 57%

Refs:
- evidence/section-62-branch-a-closure-2026-05-11/findings.json (LIVE evidence)
- PR #1615 (SHIP-006 fix + LIVE discharge — golden_output_apr)
- PR #1616 (HumanEval inference path fix)
- PR #1617 (HumanEval indent residual fix — align_continuation_indent)
- SPEC-SHIP-TWO-001 §61.8 (Branch A vs Branch B taxonomy)
- SPEC-SHIP-TWO-001 §17.5 (5 MODEL-1 PARTIAL chain)
- feedback_compute_pre_authorized.md (lambda-labs 48h ceiling)

Closes task #35 PMAT-CODE-SHIP-TWO-SECTION-62.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… taxonomy (PMAT-CODE-SHIP-TWO-SECTION-61-8)

Same-day continuation of §61. Both falsifiable predictions fired on
noah-Lambda-Vector RTX 4090 (apr v0.32.0 post-e856eb91f).

PRED-61-B GREEN (predicted):
  apr run <APR teacher> --prompt "What is 2+2? The answer is " → "4"
  Wall: 79.09s. Confirms APR forward path under direct prompts is
  semantically correct. Matches §60 closure.

PRED-61-A RED — but in an unexpected way:
  apr run <GGUF teacher> emits byte-identical
    "ampiezza = 0.5\ndiametro = 10\naltezza = 20\n# Calcolo del volume\nvolume = ("
  across THREE distinct prompts:
    1. "What is 2+2? The answer is " (direct continuation)
    2. "<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n" (ChatML)
    3. "Hello, my name is" (conversational, no question)
  Wall times: 48.73s / 48.68s / 39.65s — different (proving inference
  IS running, not cached), but output text matches byte-for-byte.

This is a PROMPT-INSENSITIVE GGUF generation bug — input tokens are
dropped, ignored, or the model state is initialized to a fixed
configuration before forward pass starts.

Five-Whys for the §61.8 amendment:
1. Why §61.8? Both PRED-61-A and PRED-61-B fired; need durable record.
2. Why three prompts on GGUF? PRED-61-A's RED outcome in unexpected
   shape required disambiguation — was it ChatML-specific or
   structural? Three distinct prompts confirm structural.
3. Why does this matter? §61's 2-way picture (APR ChatML BROKEN /
   APR direct WORKING) was incomplete. Reality is 3-way: APR direct
   WORKING, APR ChatML BROKEN with \ns\ns repetition, GGUF any-prompt
   BROKEN with prompt-insensitive canned output.
4. Why split into two branches? Branch A (APR ChatML) and Branch B
   (GGUF prompt-insensitive) are independent — different code paths,
   different failure modes, different fix scopes.
5. Why methodology lesson #8? PRED-61-A asked "is GGUF + ChatML clean?"
   and the answer is "no, but for an entirely different reason than
   ChatML special-token handling". Without the third-prompt control
   (Hello), the §61.8 taxonomy would have collapsed into "all paths
   broken under ChatML" which would mis-localize.

§61.8 amendments to spec (1 file):
- Atomic next action banner: v3.06.0 → v3.07.0
- Add §61.8 sub-section above the closing --- divider of §61, with:
  - 61.8.0: empirical PRED firing (apr run examples + outputs)
  - 61.8.1: refined 3-way bug taxonomy (table)
  - 61.8.2: Branch A vs Branch B independent investigation cascades
  - 61.8.3: ship-% movement (stays 92%) + per-SHIP* blocker mapping
  - 61.8.4: methodology lesson #8 (RED outcome may surface different bug)

Evidence (NEW directory):
- evidence/section-61-8-pred-fired-2026-05-10/
  - pred-61-b-apr-direct.txt (29 lines, "4" output)
  - pred-61-a-gguf-direct.txt (32 lines, Italian "ampiezza...")
  - pred-61-a-gguf-chatml.txt (32 lines, byte-identical Italian)
  - gguf-third-prompt.txt (28 lines, "Hello..." → byte-identical)
  - findings.json (structured 3-way taxonomy + investigation branches)

Validation:
- Section format consistent with §61.1-61.7 (numbered §61.X.N sub-
  sub-sections under §61.8).
- All evidence files referenced in spec body.
- Methodological alignment: zero eprintln!, all evidence via apr
  run + tail to text files.

Spec movement:
- v3.06.0 → v3.07.0
- MODEL-1 ship %: stays at 92% (snapshot, not falsifier flip).
- MODEL-2 ship %: unchanged at 57%.

Refs:
- evidence/section-61-8-pred-fired-2026-05-10/findings.json
- SPEC-SHIP-TWO-001 §61.5 (PRED-61-A/B definitions)
- SPEC-SHIP-TWO-001 §17.5 (5 MODEL-1 PARTIAL chain)
- SPEC-SHIP-TWO-001 §60 (SHIP-007 §22 closure)

Closes task #30 PMAT-CODE-SHIP-TWO-SECTION-61-8.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the docs/ship-two-spec-section-61-8-pred-fired branch from 1982a6d to 503809c Compare May 11, 2026 14:40
@noahgift

Copy link
Copy Markdown
Contributor Author

Closing as superseded — the §65→§71 cascade narrative is complete on main via PRs #1629/#1631/#1633/#1634/#1636/#1642 (and the in-tree §67/§68/§69/§70/§71 sections). SHIP-005 LIVE-DISCHARGED at 86.59% pass@1 (§71); see contracts/apr-eval-humaneval-harness-invariant-v1.yaml v1.1.0 for the empirical evidence and root cause.

@noahgift noahgift closed this May 12, 2026
auto-merge was automatically disabled May 12, 2026 15:30

Pull request was closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant