Skip to content

feat: MBPP benchmark evaluation — ALB-085#430

Merged
noahgift merged 1 commit into
mainfrom
alb-085-mbpp-eval
Mar 7, 2026
Merged

feat: MBPP benchmark evaluation — ALB-085#430
noahgift merged 1 commit into
mainfrom
alb-085-mbpp-eval

Conversation

@noahgift

@noahgift noahgift commented Mar 7, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add apr eval --task mbpp --data mbpp.jsonl for MBPP benchmark evaluation
  • Reuses ALB-084 inference bridge (SafetensorsToAprConverter + forward_with_cache + execute_python_test)
  • 974 problems: natural language description → model completion → test_list assertion execution
  • max_new_tokens=512, timeout=10s, JSON output with per_problem_results + pass@k

Test plan

  • Compiles clean (no warnings)
  • MBPP JSONL parsing verified (974 problems loaded, process running)
  • CI gates pass
  • MBPP baseline on v4 checkpoint (running in background)

Refs albor#65

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

Add `apr eval --task mbpp --data mbpp.jsonl` with full inference pipeline:
- MbppProblem struct (text, code, task_id, test_list, test_setup_code)
- run_mbpp() with same inference bridge as HumanEval (ALB-084)
- run_mbpp_inference() with SafetensorsToAprConverter + forward_with_cache
- Natural language prompt → completion → test_list assertion execution
- max_new_tokens=512, timeout=10s (longer than HumanEval)
- JSON output with per_problem_results, pass@k metrics

974 problems from Google MBPP dataset. Reuses sample_token(),
truncate_at_function_boundary(), execute_python_test(), compute_pass_at_k().

Refs #65

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@noahgift noahgift merged commit 592de9f into main Mar 7, 2026
4 checks passed
@noahgift noahgift deleted the alb-085-mbpp-eval branch March 7, 2026 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant