Skip to content

fix(apr-cli): function-targeted multi-block extraction in HumanEval (R1+R2 SHIP-005 refinement)#1630

Closed
noahgift wants to merge 3 commits into
mainfrom
feat/ship-005-r1-r2-refinement
Closed

fix(apr-cli): function-targeted multi-block extraction in HumanEval (R1+R2 SHIP-005 refinement)#1630
noahgift wants to merge 3 commits into
mainfrom
feat/ship-005-r1-r2-refinement

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

§67 identified four refinement candidates for the SHIP-005 4.31pp residual gap (80.49% → 84.80% floor). This PR ships R1+R2 as the cheapest extraction-layer improvement.

R1+R2

  • R1 (multi-block extraction): model sometimes emits an explanatory snippet block BEFORE the actual solution. Prior first-block-wins extractor returned the snippet; new path scans ALL blocks.
  • R2 (function-targeted): when entry_point is supplied, prefer the fenced block whose body contains def {entry_point}(. Anchors extraction to the intended solution.
  • Fallback: no matching block → first non-empty block (backwards-compatible with legacy extract_python_code_block).

Implementation

  • NEW extract_python_code_block_targeted(text, entry_point) -> Option<String>
  • extract_python_code_block(text) becomes thin wrapper calling targeted with None
  • run_humaneval_inference passes Some(entry) for HumanEval evals

Unit Tests (13 GREEN, 7 new + 6 legacy)

  • prefers_block_containing_entry_point (R2 canonical)
  • single_block_matching_entry
  • no_entry_match_falls_back_to_first (R2 robustness)
  • no_entry_point_first_block_wins (legacy compat)
  • mixed_fence_tags_picks_entry_block (R1+R2 combined)
  • no_fence_returns_none
  • skips_empty_fences_before_match
  • (+ 6 legacy extract_python_code_block_tests still passing)

LIVE Smoke (gx10, 3 problems known-failed pre-fix)

  • HumanEval/1, /3, /6 — unchanged FAIL

These are NOT extraction failures; they're greedy-sampling or Q4K-quantization failures (the model emits single blocks with non-canonical solutions). R3/R4 may flip some. R1+R2 is the robustness baseline for any future eval cascade.

Validation

  • cargo test -p apr-cli --release --features cuda extract_python_code_block — 13/13 pass
  • cargo build -p apr-cli --release --features cuda (gx10 aarch64): clean
  • LIVE 3-problem smoke confirms no regression

Ship-% Movement

  • MODEL-1 ship %: stays at 94%. May flip post-full-164-rerun if R1+R2 closes ≥4.31pp; full rerun dispatchable as follow-up.
  • MODEL-2 ship %: unchanged at 57%.

Why ship as robustness even if 3-problem smoke shows 0/3 flip?

Unit tests prove correctness on multi-block scenarios. The 4.31pp gap may require R3 (Q4K→FP16) or R4 (temperature sampling) to fully close, but R1+R2 is the necessary robustness baseline. Stacking R1+R2 + R3 or + R4 in a follow-up should reach the floor.

🤖 Generated with Claude Code

…PMAT-CODE-SHIP-005-R1-R2-REFINEMENT)

§67 identified four refinement candidates for the SHIP-005 4.31pp
residual gap (80.49% → 84.80% floor). This PR ships R1+R2.

R1: multi-block extraction. The model sometimes emits an
explanatory snippet block BEFORE the actual solution block. The
prior first-block-wins extractor returned the snippet; this PR
scans ALL blocks.

R2: function-targeted extraction. When `entry_point` is supplied,
prefer the fenced block whose body contains `def {entry_point}(`.
This anchors extraction to the intended solution function rather
than relying on block ordering.

Fallback: when no block contains the entry_point (or none has the
target function), return the first non-empty block — preserving
the legacy `extract_python_code_block` behaviour as a strict
superset.

Implementation:
- NEW `extract_python_code_block_targeted(text, entry_point) -> Option<String>`
- `extract_python_code_block(text)` is now a thin wrapper that
  calls the targeted variant with `None` (backwards-compatible)
- `run_humaneval_inference` passes `Some(entry)` so HumanEval
  evaluations always use function-targeted extraction

Unit tests (7 new + 6 legacy = 13 GREEN):
- prefers_block_containing_entry_point (R2 canonical)
- single_block_matching_entry
- no_entry_match_falls_back_to_first (R2 robustness)
- no_entry_point_first_block_wins (legacy compat)
- mixed_fence_tags_picks_entry_block (R1+R2 combined)
- no_fence_returns_none
- skips_empty_fences_before_match
- (+ 6 legacy extract_python_code_block_tests still passing)

Five-Whys:
1. Why R1+R2? §67 identified 4 refinement candidates; R1+R2 are
   cheapest (extraction-only, no compute beyond rerun).
2. Why both together? They share the same multi-pass parser
   refactor; splitting them would be artificial.
3. Why not also R3 (Q4K → FP16)? Different artifact (needs
   safetensors); separate cascade.
4. Why not R4 (temperature sampling)? Larger compute footprint
   (3 samples × 164 problems × ~125s ≈ 17h on gx10). R1+R2 is
   the higher-leverage single-PR win.
5. Why ship as robustness even if smoke test shows it doesn't
   flip the 3 hardest failures (1, 3, 6)? Unit tests prove
   correctness on multi-block scenarios. The 4.31pp gap may
   require R3 or R4 to fully close, but R1+R2 is the necessary
   robustness baseline for any future eval.

LIVE smoke (gx10 3 problems known-failed pre-fix):
- HumanEval/1 (separate_paren_groups): FAIL (unchanged — model
  emits single block; the failure is model-quality at greedy
  temp=0, not extraction)
- HumanEval/3 (below_zero): FAIL (unchanged)
- HumanEval/6 (parse_nested_parens): FAIL (unchanged — also
  failed in PR #1628 5-problem smoke; hardest problem in the set)

These three are NOT extraction failures; they're greedy-sampling
or Q4K-quantization failures. R3/R4 may flip some. R1+R2 is the
robustness baseline.

A full 164-run on gx10 to measure R1+R2's exact gain is dispatchable
as a follow-up. Expected gain: 0-3pp depending on how many of the
32 failed problems were extraction failures vs sampling failures.

Validation:
- cargo test -p apr-cli --release --features cuda
  extract_python_code_block → 13/13 pass (7 new + 6 legacy)
- cargo build -p apr-cli --release --features cuda (gx10 aarch64):
  clean
- 3-problem LIVE smoke: confirms robust extraction (no regression)

Spec movement:
- MODEL-1 ship %: stays at 94% (no LIVE-discharge from this PR;
  may flip post-full-164 if R1+R2 closes ≥4.31pp)
- MODEL-2 ship %: unchanged at 57%

Refs:
- SPEC-SHIP-TWO-001 §66 (H4 confirmation)
- SPEC-SHIP-TWO-001 §67 (H4 result + R1-R4 scope)
- PR #1628 (H4 fix — base of this refinement)

Closes task #44 PMAT-CODE-SHIP-005-R1-R2-REFINEMENT.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 12, 2026
…1630); failures are sampling/quantization, not extraction (PMAT-CODE-SHIP-TWO-SECTION-68) (#1631)

§67 identified 4 refinement candidates (R1-R4) for the SHIP-005
4.31pp gap. PR #1630 ships R1 (multi-block extraction) + R2
(function-targeted, prefer `def {entry_point}(` block) as the
cheapest 1-PR slice. §68 records the empirical finding from the
3-problem LIVE smoke.

Verdict: R1+R2 is a ROBUSTNESS BASELINE, not a gap-closer.

Empirical evidence (gx10 3-problem smoke, known-failed HumanEval
/1/3/6 from §67 baseline):
- HumanEval/1: FAIL → FAIL (unchanged)
- HumanEval/3: FAIL → FAIL (unchanged)
- HumanEval/6: FAIL → FAIL (unchanged)

Per-problem inspection via manual apr run: the model emits a
SINGLE fenced code block (not multiple). The block contains the
expected function but the body is non-canonical at greedy temp=0.

Failure class taxonomy:
- Class A: multi-block / wrong-block. R1+R2 fixes these.
- Class B: model-quality failure (single block, subtly wrong).
  R1+R2 cannot fix; needs R3 (Q4K → FP16) or R4 (temperature
  sampling).

The 3 sampled failures are Class B. The 4.31pp residual is
predominantly Class B; R3 or R4 are the next levers.

Methodology lesson #15 NEW: smoke-test-driven scope reduction.
A 3-problem smoke (~5 min) upper-bounds the achievable gain from
a refinement candidate BEFORE dispatching the 5h full 164-rerun.
The smoke saved a full rerun's worth of compute.

Generalises lesson #14: near-miss results need their refinements
empirically calibrated, not assumed.

Changes (1 spec file):
- docs/specifications/aprender-train/ship-two-models-spec.md
  - Atomic next action banner: v3.13.0 → v3.14.0
  - New §68 section ABOVE §63 (newest-first), 8 sub-sections
  - Cumulative methodology lessons table (#6-#15) restated

Refined R-candidate priorities post-§68:
- R1+R2: SHIPPED (PR #1630) — robustness baseline
- R3 (Q4K → FP16): NOT YET ATTEMPTED — needs FP16 safetensors
- R4 (temp=0.2 + samples=3): NOT YET ATTEMPTED — 17h gx10 compute

Spec movement:
- v3.13.0 → v3.14.0
- MODEL-1 ship %: stays at 94% (bounded path to 95% requires R3 or R4)
- MODEL-2 ship %: unchanged at 57%

Closes task #45 PMAT-CODE-SHIP-TWO-SECTION-68.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift

Copy link
Copy Markdown
Contributor Author

Closing as redundant — content is included in the squash-merge of PR #1634 (which carries the full chain: H4 ChatML + R1+R2 extraction + §69 spec amendment + APR_EVAL_DEBUG diagnostic + harness invariant contract). RC3 follow-up is PR #1635; §70 discharge is PR #1636. See §70 of the spec for the full cascade narrative.

@noahgift noahgift closed this May 12, 2026
auto-merge was automatically disabled May 12, 2026 08:15

Pull request was closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant