fix(apr-cli): function-targeted multi-block extraction in HumanEval (R1+R2 SHIP-005 refinement)#1630
Closed
noahgift wants to merge 3 commits into
Closed
fix(apr-cli): function-targeted multi-block extraction in HumanEval (R1+R2 SHIP-005 refinement)#1630noahgift wants to merge 3 commits into
noahgift wants to merge 3 commits into
Conversation
…PMAT-CODE-SHIP-005-R1-R2-REFINEMENT)
§67 identified four refinement candidates for the SHIP-005 4.31pp
residual gap (80.49% → 84.80% floor). This PR ships R1+R2.
R1: multi-block extraction. The model sometimes emits an
explanatory snippet block BEFORE the actual solution block. The
prior first-block-wins extractor returned the snippet; this PR
scans ALL blocks.
R2: function-targeted extraction. When `entry_point` is supplied,
prefer the fenced block whose body contains `def {entry_point}(`.
This anchors extraction to the intended solution function rather
than relying on block ordering.
Fallback: when no block contains the entry_point (or none has the
target function), return the first non-empty block — preserving
the legacy `extract_python_code_block` behaviour as a strict
superset.
Implementation:
- NEW `extract_python_code_block_targeted(text, entry_point) -> Option<String>`
- `extract_python_code_block(text)` is now a thin wrapper that
calls the targeted variant with `None` (backwards-compatible)
- `run_humaneval_inference` passes `Some(entry)` so HumanEval
evaluations always use function-targeted extraction
Unit tests (7 new + 6 legacy = 13 GREEN):
- prefers_block_containing_entry_point (R2 canonical)
- single_block_matching_entry
- no_entry_match_falls_back_to_first (R2 robustness)
- no_entry_point_first_block_wins (legacy compat)
- mixed_fence_tags_picks_entry_block (R1+R2 combined)
- no_fence_returns_none
- skips_empty_fences_before_match
- (+ 6 legacy extract_python_code_block_tests still passing)
Five-Whys:
1. Why R1+R2? §67 identified 4 refinement candidates; R1+R2 are
cheapest (extraction-only, no compute beyond rerun).
2. Why both together? They share the same multi-pass parser
refactor; splitting them would be artificial.
3. Why not also R3 (Q4K → FP16)? Different artifact (needs
safetensors); separate cascade.
4. Why not R4 (temperature sampling)? Larger compute footprint
(3 samples × 164 problems × ~125s ≈ 17h on gx10). R1+R2 is
the higher-leverage single-PR win.
5. Why ship as robustness even if smoke test shows it doesn't
flip the 3 hardest failures (1, 3, 6)? Unit tests prove
correctness on multi-block scenarios. The 4.31pp gap may
require R3 or R4 to fully close, but R1+R2 is the necessary
robustness baseline for any future eval.
LIVE smoke (gx10 3 problems known-failed pre-fix):
- HumanEval/1 (separate_paren_groups): FAIL (unchanged — model
emits single block; the failure is model-quality at greedy
temp=0, not extraction)
- HumanEval/3 (below_zero): FAIL (unchanged)
- HumanEval/6 (parse_nested_parens): FAIL (unchanged — also
failed in PR #1628 5-problem smoke; hardest problem in the set)
These three are NOT extraction failures; they're greedy-sampling
or Q4K-quantization failures. R3/R4 may flip some. R1+R2 is the
robustness baseline.
A full 164-run on gx10 to measure R1+R2's exact gain is dispatchable
as a follow-up. Expected gain: 0-3pp depending on how many of the
32 failed problems were extraction failures vs sampling failures.
Validation:
- cargo test -p apr-cli --release --features cuda
extract_python_code_block → 13/13 pass (7 new + 6 legacy)
- cargo build -p apr-cli --release --features cuda (gx10 aarch64):
clean
- 3-problem LIVE smoke: confirms robust extraction (no regression)
Spec movement:
- MODEL-1 ship %: stays at 94% (no LIVE-discharge from this PR;
may flip post-full-164 if R1+R2 closes ≥4.31pp)
- MODEL-2 ship %: unchanged at 57%
Refs:
- SPEC-SHIP-TWO-001 §66 (H4 confirmation)
- SPEC-SHIP-TWO-001 §67 (H4 result + R1-R4 scope)
- PR #1628 (H4 fix — base of this refinement)
Closes task #44 PMAT-CODE-SHIP-005-R1-R2-REFINEMENT.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 12, 2026
…1630); failures are sampling/quantization, not extraction (PMAT-CODE-SHIP-TWO-SECTION-68) (#1631) §67 identified 4 refinement candidates (R1-R4) for the SHIP-005 4.31pp gap. PR #1630 ships R1 (multi-block extraction) + R2 (function-targeted, prefer `def {entry_point}(` block) as the cheapest 1-PR slice. §68 records the empirical finding from the 3-problem LIVE smoke. Verdict: R1+R2 is a ROBUSTNESS BASELINE, not a gap-closer. Empirical evidence (gx10 3-problem smoke, known-failed HumanEval /1/3/6 from §67 baseline): - HumanEval/1: FAIL → FAIL (unchanged) - HumanEval/3: FAIL → FAIL (unchanged) - HumanEval/6: FAIL → FAIL (unchanged) Per-problem inspection via manual apr run: the model emits a SINGLE fenced code block (not multiple). The block contains the expected function but the body is non-canonical at greedy temp=0. Failure class taxonomy: - Class A: multi-block / wrong-block. R1+R2 fixes these. - Class B: model-quality failure (single block, subtly wrong). R1+R2 cannot fix; needs R3 (Q4K → FP16) or R4 (temperature sampling). The 3 sampled failures are Class B. The 4.31pp residual is predominantly Class B; R3 or R4 are the next levers. Methodology lesson #15 NEW: smoke-test-driven scope reduction. A 3-problem smoke (~5 min) upper-bounds the achievable gain from a refinement candidate BEFORE dispatching the 5h full 164-rerun. The smoke saved a full rerun's worth of compute. Generalises lesson #14: near-miss results need their refinements empirically calibrated, not assumed. Changes (1 spec file): - docs/specifications/aprender-train/ship-two-models-spec.md - Atomic next action banner: v3.13.0 → v3.14.0 - New §68 section ABOVE §63 (newest-first), 8 sub-sections - Cumulative methodology lessons table (#6-#15) restated Refined R-candidate priorities post-§68: - R1+R2: SHIPPED (PR #1630) — robustness baseline - R3 (Q4K → FP16): NOT YET ATTEMPTED — needs FP16 safetensors - R4 (temp=0.2 + samples=3): NOT YET ATTEMPTED — 17h gx10 compute Spec movement: - v3.13.0 → v3.14.0 - MODEL-1 ship %: stays at 94% (bounded path to 95% requires R3 or R4) - MODEL-2 ship %: unchanged at 57% Closes task #45 PMAT-CODE-SHIP-TWO-SECTION-68. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
Author
|
Closing as redundant — content is included in the squash-merge of PR #1634 (which carries the full chain: H4 ChatML + R1+R2 extraction + §69 spec amendment + APR_EVAL_DEBUG diagnostic + harness invariant contract). RC3 follow-up is PR #1635; §70 discharge is PR #1636. See §70 of the spec for the full cascade narrative. |
auto-merge was automatically disabled
May 12, 2026 08:15
Pull request was closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
§67 identified four refinement candidates for the SHIP-005 4.31pp residual gap (80.49% → 84.80% floor). This PR ships R1+R2 as the cheapest extraction-layer improvement.
R1+R2
entry_pointis supplied, prefer the fenced block whose body containsdef {entry_point}(. Anchors extraction to the intended solution.extract_python_code_block).Implementation
extract_python_code_block_targeted(text, entry_point) -> Option<String>extract_python_code_block(text)becomes thin wrapper calling targeted withNonerun_humaneval_inferencepassesSome(entry)for HumanEval evalsUnit Tests (13 GREEN, 7 new + 6 legacy)
prefers_block_containing_entry_point(R2 canonical)single_block_matching_entryno_entry_match_falls_back_to_first(R2 robustness)no_entry_point_first_block_wins(legacy compat)mixed_fence_tags_picks_entry_block(R1+R2 combined)no_fence_returns_noneskips_empty_fences_before_matchextract_python_code_block_testsstill passing)LIVE Smoke (gx10, 3 problems known-failed pre-fix)
These are NOT extraction failures; they're greedy-sampling or Q4K-quantization failures (the model emits single blocks with non-canonical solutions). R3/R4 may flip some. R1+R2 is the robustness baseline for any future eval cascade.
Validation
cargo test -p apr-cli --release --features cuda extract_python_code_block— 13/13 passcargo build -p apr-cli --release --features cuda(gx10 aarch64): cleanShip-% Movement
Why ship as robustness even if 3-problem smoke shows 0/3 flip?
Unit tests prove correctness on multi-block scenarios. The 4.31pp gap may require R3 (Q4K→FP16) or R4 (temperature sampling) to fully close, but R1+R2 is the necessary robustness baseline. Stacking R1+R2 + R3 or + R4 in a follow-up should reach the floor.
🤖 Generated with Claude Code