Skip to content

docs(spec): SHIP-TWO-001 §68 — R1+R2 robustness baseline shipped; failures are Class B (sampling/quantization)#1631

Merged
noahgift merged 1 commit into
mainfrom
docs/section-68-r1-r2-baseline
May 12, 2026
Merged

docs(spec): SHIP-TWO-001 §68 — R1+R2 robustness baseline shipped; failures are Class B (sampling/quantization)#1631
noahgift merged 1 commit into
mainfrom
docs/section-68-r1-r2-baseline

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Records empirical finding from the 3-problem LIVE smoke after PR #1630 (R1+R2 robustness refinement): R1+R2 is a robustness baseline, not a gap-closer.

The 3-problem smoke verdict

Task Pre-fix Post-R1+R2
HumanEval/1 (separate_paren_groups) FAIL FAIL (unchanged)
HumanEval/3 (below_zero) FAIL FAIL (unchanged)
HumanEval/6 (parse_nested_parens) FAIL FAIL (unchanged)

0/3 flipped. R1+R2 did NOT help these three problems.

Failure-class taxonomy (NEW)

  • Class A — multi-block / wrong-block failures: model emits explanatory snippet + real solution. R1+R2 fixes these.
  • Class B — model-quality failures: model emits single block with subtly-wrong solution at greedy temp=0. R1+R2 cannot fix; needs R3 (Q4K → FP16) or R4 (temperature sampling).

The 3 sampled failures appear to be Class B. The 4.31pp residual is predominantly Class B.

Refined R-candidate priorities

Candidate Status
R1+R2 SHIPPED (PR #1630) — robustness baseline
R3 (Q4K → FP16) Not yet attempted — needs FP16 safetensors
R4 (temp=0.2 + samples=3) Not yet attempted — ~17h gx10 compute

Methodology Lesson #15 (NEW)

Smoke-test-driven scope reduction. A 3-problem smoke (~5 min) upper-bounds refinement gain BEFORE dispatching a 5h full rerun. The smoke saved a full rerun's worth of compute by revealing R1+R2 doesn't address the dominant failure class.

Generalises lesson #14: near-miss results need their refinements empirically calibrated, not assumed.

Ship-% Movement

  • MODEL-1 ship %: stays at 94% (bounded path to 95% requires R3 or R4 — multi-day work)
  • MODEL-2 ship %: unchanged at 57%

🤖 Generated with Claude Code

…1630); failures are sampling/quantization, not extraction (PMAT-CODE-SHIP-TWO-SECTION-68)

§67 identified 4 refinement candidates (R1-R4) for the SHIP-005
4.31pp gap. PR #1630 ships R1 (multi-block extraction) + R2
(function-targeted, prefer `def {entry_point}(` block) as the
cheapest 1-PR slice. §68 records the empirical finding from the
3-problem LIVE smoke.

Verdict: R1+R2 is a ROBUSTNESS BASELINE, not a gap-closer.

Empirical evidence (gx10 3-problem smoke, known-failed HumanEval
/1/3/6 from §67 baseline):
- HumanEval/1: FAIL → FAIL (unchanged)
- HumanEval/3: FAIL → FAIL (unchanged)
- HumanEval/6: FAIL → FAIL (unchanged)

Per-problem inspection via manual apr run: the model emits a
SINGLE fenced code block (not multiple). The block contains the
expected function but the body is non-canonical at greedy temp=0.

Failure class taxonomy:
- Class A: multi-block / wrong-block. R1+R2 fixes these.
- Class B: model-quality failure (single block, subtly wrong).
  R1+R2 cannot fix; needs R3 (Q4K → FP16) or R4 (temperature
  sampling).

The 3 sampled failures are Class B. The 4.31pp residual is
predominantly Class B; R3 or R4 are the next levers.

Methodology lesson #15 NEW: smoke-test-driven scope reduction.
A 3-problem smoke (~5 min) upper-bounds the achievable gain from
a refinement candidate BEFORE dispatching the 5h full 164-rerun.
The smoke saved a full rerun's worth of compute.

Generalises lesson #14: near-miss results need their refinements
empirically calibrated, not assumed.

Changes (1 spec file):
- docs/specifications/aprender-train/ship-two-models-spec.md
  - Atomic next action banner: v3.13.0 → v3.14.0
  - New §68 section ABOVE §63 (newest-first), 8 sub-sections
  - Cumulative methodology lessons table (#6-#15) restated

Refined R-candidate priorities post-§68:
- R1+R2: SHIPPED (PR #1630) — robustness baseline
- R3 (Q4K → FP16): NOT YET ATTEMPTED — needs FP16 safetensors
- R4 (temp=0.2 + samples=3): NOT YET ATTEMPTED — 17h gx10 compute

Spec movement:
- v3.13.0 → v3.14.0
- MODEL-1 ship %: stays at 94% (bounded path to 95% requires R3 or R4)
- MODEL-2 ship %: unchanged at 57%

Closes task #45 PMAT-CODE-SHIP-TWO-SECTION-68.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant