docs(spec): SHIP-TWO-001 §68 — R1+R2 robustness baseline shipped; failures are Class B (sampling/quantization)#1631
Merged
Conversation
…1630); failures are sampling/quantization, not extraction (PMAT-CODE-SHIP-TWO-SECTION-68) §67 identified 4 refinement candidates (R1-R4) for the SHIP-005 4.31pp gap. PR #1630 ships R1 (multi-block extraction) + R2 (function-targeted, prefer `def {entry_point}(` block) as the cheapest 1-PR slice. §68 records the empirical finding from the 3-problem LIVE smoke. Verdict: R1+R2 is a ROBUSTNESS BASELINE, not a gap-closer. Empirical evidence (gx10 3-problem smoke, known-failed HumanEval /1/3/6 from §67 baseline): - HumanEval/1: FAIL → FAIL (unchanged) - HumanEval/3: FAIL → FAIL (unchanged) - HumanEval/6: FAIL → FAIL (unchanged) Per-problem inspection via manual apr run: the model emits a SINGLE fenced code block (not multiple). The block contains the expected function but the body is non-canonical at greedy temp=0. Failure class taxonomy: - Class A: multi-block / wrong-block. R1+R2 fixes these. - Class B: model-quality failure (single block, subtly wrong). R1+R2 cannot fix; needs R3 (Q4K → FP16) or R4 (temperature sampling). The 3 sampled failures are Class B. The 4.31pp residual is predominantly Class B; R3 or R4 are the next levers. Methodology lesson #15 NEW: smoke-test-driven scope reduction. A 3-problem smoke (~5 min) upper-bounds the achievable gain from a refinement candidate BEFORE dispatching the 5h full 164-rerun. The smoke saved a full rerun's worth of compute. Generalises lesson #14: near-miss results need their refinements empirically calibrated, not assumed. Changes (1 spec file): - docs/specifications/aprender-train/ship-two-models-spec.md - Atomic next action banner: v3.13.0 → v3.14.0 - New §68 section ABOVE §63 (newest-first), 8 sub-sections - Cumulative methodology lessons table (#6-#15) restated Refined R-candidate priorities post-§68: - R1+R2: SHIPPED (PR #1630) — robustness baseline - R3 (Q4K → FP16): NOT YET ATTEMPTED — needs FP16 safetensors - R4 (temp=0.2 + samples=3): NOT YET ATTEMPTED — 17h gx10 compute Spec movement: - v3.13.0 → v3.14.0 - MODEL-1 ship %: stays at 94% (bounded path to 95% requires R3 or R4) - MODEL-2 ship %: unchanged at 57% Closes task #45 PMAT-CODE-SHIP-TWO-SECTION-68. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 12, 2026
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Records empirical finding from the 3-problem LIVE smoke after PR #1630 (R1+R2 robustness refinement): R1+R2 is a robustness baseline, not a gap-closer.
The 3-problem smoke verdict
0/3 flipped. R1+R2 did NOT help these three problems.
Failure-class taxonomy (NEW)
The 3 sampled failures appear to be Class B. The 4.31pp residual is predominantly Class B.
Refined R-candidate priorities
Methodology Lesson #15 (NEW)
Smoke-test-driven scope reduction. A 3-problem smoke (~5 min) upper-bounds refinement gain BEFORE dispatching a 5h full rerun. The smoke saved a full rerun's worth of compute by revealing R1+R2 doesn't address the dominant failure class.
Generalises lesson #14: near-miss results need their refinements empirically calibrated, not assumed.
Ship-% Movement
🤖 Generated with Claude Code