feat: ALB-084 HumanEval pass@k with real model inference by noahgift · Pull Request #429 · paiml/aprender

noahgift · 2026-03-07T10:20:23Z

Summary

Replace HumanEval structural validation stub with end-to-end inference pipeline
Load model via SafetensorsToAprConverter, tokenize with BpeTokenizer, generate with forward_with_cache
Truncate completions at function boundary (\ndef or \nclass )
Execute Python tests via subprocess with proper timeout enforcement (FALSIFY-EVAL-003)
Falls back to structural validation when inference feature unavailable
Verified end-to-end: v4 checkpoint loads → generates → executes tests (0% pass@1 expected for undertrained model)

Test plan

Build succeeds with --release -p apr-cli
Smoke test: 1 problem from HumanEval JSONL → correct FAIL + inference mode label
JSON output: 3 problems → well-formed JSON with pass@k, mode, timing
No duplicate result printing
Timeout enforcement via try_wait loop (50ms polling)
CI gates pass

Refs #64

🤖 Generated with Claude Code

Replace structural validation stub with end-to-end inference pipeline: - Load model via SafetensorsToAprConverter + BpeTokenizer - Generate completions with forward_with_cache (greedy, max 256 tokens) - Truncate at function boundary (\ndef or \nclass) - Execute Python tests via subprocess with timeout enforcement - Falls back to structural validation if inference unavailable Verified: v4 checkpoint loads, generates, executes tests (0/164 pass@1 expected for untrained model). ~15s/problem on CPU, well within 2h bound. Refs #64 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Contract requires: benchmark, model, problems, passed, pass_at_k, per_problem_results. Adds per-problem task_id, entry_point, passed to both inference and structural validation JSON output. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

) Contract specifies temp=0.8 for pass@k>1, temp=0.0 for pass@1. Adds sample_token() with softmax + xorshift64 deterministic RNG. Currently defaults to greedy (temp=0.0) for pass@1 baseline. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

noahgift and others added 3 commits March 7, 2026 11:20

noahgift merged commit a7b1da8 into main Mar 7, 2026
4 checks passed

noahgift deleted the feat/humaneval-inference branch March 7, 2026 10:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: ALB-084 HumanEval pass@k with real model inference#429

feat: ALB-084 HumanEval pass@k with real model inference#429
noahgift merged 3 commits into
mainfrom
feat/humaneval-inference

noahgift commented Mar 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Mar 7, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant