docs(spec): SHIP-TWO-001 §68 — R1+R2 robustness baseline shipped; failures are Class B (sampling/quantization) by noahgift · Pull Request #1631 · paiml/aprender

noahgift · 2026-05-12T06:25:54Z

Summary

Records empirical finding from the 3-problem LIVE smoke after PR #1630 (R1+R2 robustness refinement): R1+R2 is a robustness baseline, not a gap-closer.

The 3-problem smoke verdict

Task	Pre-fix	Post-R1+R2
HumanEval/1 (separate_paren_groups)	FAIL	FAIL (unchanged)
HumanEval/3 (below_zero)	FAIL	FAIL (unchanged)
HumanEval/6 (parse_nested_parens)	FAIL	FAIL (unchanged)

0/3 flipped. R1+R2 did NOT help these three problems.

Failure-class taxonomy (NEW)

Class A — multi-block / wrong-block failures: model emits explanatory snippet + real solution. R1+R2 fixes these.
Class B — model-quality failures: model emits single block with subtly-wrong solution at greedy temp=0. R1+R2 cannot fix; needs R3 (Q4K → FP16) or R4 (temperature sampling).

The 3 sampled failures appear to be Class B. The 4.31pp residual is predominantly Class B.

Refined R-candidate priorities

Candidate	Status
R1+R2	SHIPPED (PR #1630) — robustness baseline
R3 (Q4K → FP16)	Not yet attempted — needs FP16 safetensors
R4 (temp=0.2 + samples=3)	Not yet attempted — ~17h gx10 compute

Methodology Lesson #15 (NEW)

Smoke-test-driven scope reduction. A 3-problem smoke (~5 min) upper-bounds refinement gain BEFORE dispatching a 5h full rerun. The smoke saved a full rerun's worth of compute by revealing R1+R2 doesn't address the dominant failure class.

Generalises lesson #14: near-miss results need their refinements empirically calibrated, not assumed.

Ship-% Movement

MODEL-1 ship %: stays at 94% (bounded path to 95% requires R3 or R4 — multi-day work)
MODEL-2 ship %: unchanged at 57%

🤖 Generated with Claude Code

…1630); failures are sampling/quantization, not extraction (PMAT-CODE-SHIP-TWO-SECTION-68) §67 identified 4 refinement candidates (R1-R4) for the SHIP-005 4.31pp gap. PR #1630 ships R1 (multi-block extraction) + R2 (function-targeted, prefer `def {entry_point}(` block) as the cheapest 1-PR slice. §68 records the empirical finding from the 3-problem LIVE smoke. Verdict: R1+R2 is a ROBUSTNESS BASELINE, not a gap-closer. Empirical evidence (gx10 3-problem smoke, known-failed HumanEval /1/3/6 from §67 baseline): - HumanEval/1: FAIL → FAIL (unchanged) - HumanEval/3: FAIL → FAIL (unchanged) - HumanEval/6: FAIL → FAIL (unchanged) Per-problem inspection via manual apr run: the model emits a SINGLE fenced code block (not multiple). The block contains the expected function but the body is non-canonical at greedy temp=0. Failure class taxonomy: - Class A: multi-block / wrong-block. R1+R2 fixes these. - Class B: model-quality failure (single block, subtly wrong). R1+R2 cannot fix; needs R3 (Q4K → FP16) or R4 (temperature sampling). The 3 sampled failures are Class B. The 4.31pp residual is predominantly Class B; R3 or R4 are the next levers. Methodology lesson #15 NEW: smoke-test-driven scope reduction. A 3-problem smoke (~5 min) upper-bounds the achievable gain from a refinement candidate BEFORE dispatching the 5h full 164-rerun. The smoke saved a full rerun's worth of compute. Generalises lesson #14: near-miss results need their refinements empirically calibrated, not assumed. Changes (1 spec file): - docs/specifications/aprender-train/ship-two-models-spec.md - Atomic next action banner: v3.13.0 → v3.14.0 - New §68 section ABOVE §63 (newest-first), 8 sub-sections - Cumulative methodology lessons table (#6-#15) restated Refined R-candidate priorities post-§68: - R1+R2: SHIPPED (PR #1630) — robustness baseline - R3 (Q4K → FP16): NOT YET ATTEMPTED — needs FP16 safetensors - R4 (temp=0.2 + samples=3): NOT YET ATTEMPTED — 17h gx10 compute Spec movement: - v3.13.0 → v3.14.0 - MODEL-1 ship %: stays at 94% (bounded path to 95% requires R3 or R4) - MODEL-2 ship %: unchanged at 57% Closes task #45 PMAT-CODE-SHIP-TWO-SECTION-68. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 12, 2026 06:25

noahgift merged commit ef08e87 into main May 12, 2026
11 checks passed

noahgift deleted the docs/section-68-r1-r2-baseline branch May 12, 2026 07:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(spec): SHIP-TWO-001 §68 — R1+R2 robustness baseline shipped; failures are Class B (sampling/quantization)#1631

docs(spec): SHIP-TWO-001 §68 — R1+R2 robustness baseline shipped; failures are Class B (sampling/quantization)#1631
noahgift merged 1 commit into
mainfrom
docs/section-68-r1-r2-baseline

noahgift commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 12, 2026

Summary

The 3-problem smoke verdict

Failure-class taxonomy (NEW)

Refined R-candidate priorities

Methodology Lesson #15 (NEW)

Ship-% Movement

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant