Skip to content

feat: ALB-084 HumanEval pass@k with real model inference#429

Merged
noahgift merged 3 commits into
mainfrom
feat/humaneval-inference
Mar 7, 2026
Merged

feat: ALB-084 HumanEval pass@k with real model inference#429
noahgift merged 3 commits into
mainfrom
feat/humaneval-inference

Conversation

@noahgift

@noahgift noahgift commented Mar 7, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Replace HumanEval structural validation stub with end-to-end inference pipeline
  • Load model via SafetensorsToAprConverter, tokenize with BpeTokenizer, generate with forward_with_cache
  • Truncate completions at function boundary (\ndef or \nclass )
  • Execute Python tests via subprocess with proper timeout enforcement (FALSIFY-EVAL-003)
  • Falls back to structural validation when inference feature unavailable
  • Verified end-to-end: v4 checkpoint loads → generates → executes tests (0% pass@1 expected for undertrained model)

Test plan

  • Build succeeds with --release -p apr-cli
  • Smoke test: 1 problem from HumanEval JSONL → correct FAIL + inference mode label
  • JSON output: 3 problems → well-formed JSON with pass@k, mode, timing
  • No duplicate result printing
  • Timeout enforcement via try_wait loop (50ms polling)
  • CI gates pass

Refs #64

🤖 Generated with Claude Code

noahgift and others added 3 commits March 7, 2026 11:20
Replace structural validation stub with end-to-end inference pipeline:
- Load model via SafetensorsToAprConverter + BpeTokenizer
- Generate completions with forward_with_cache (greedy, max 256 tokens)
- Truncate at function boundary (\ndef or \nclass)
- Execute Python tests via subprocess with timeout enforcement
- Falls back to structural validation if inference unavailable

Verified: v4 checkpoint loads, generates, executes tests (0/164 pass@1
expected for untrained model). ~15s/problem on CPU, well within 2h bound.

Refs #64

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contract requires: benchmark, model, problems, passed, pass_at_k,
per_problem_results. Adds per-problem task_id, entry_point, passed
to both inference and structural validation JSON output.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
)

Contract specifies temp=0.8 for pass@k>1, temp=0.0 for pass@1.
Adds sample_token() with softmax + xorshift64 deterministic RNG.
Currently defaults to greedy (temp=0.0) for pass@1 baseline.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@noahgift noahgift merged commit a7b1da8 into main Mar 7, 2026
4 checks passed
@noahgift noahgift deleted the feat/humaneval-inference branch March 7, 2026 10:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant