Create a benchmark suite for tracking LLM code generation performance on Vera, analogous to HumanEval/MBPP for Python and DafnyBench for Dafny.
Motivation
DafnyBench tracked Dafny verification success rates improving from 68% to 96% in one year as models improved. Vera needs an equivalent benchmark to measure progress, identify weak spots, and provide fine-tuning data.
Proposed structure
- 50-100 programming tasks with natural language descriptions
- Each task specifies: function signature, contracts (requires/ensures), effect annotations, expected behavior
- Graded by difficulty: arithmetic, control flow, ADTs, effects, verification, modules
- Evaluation criteria: parses, type-checks, verifies (Tier 1 vs Tier 3), produces correct output
- Machine-readable format for automated evaluation
Relationship to conformance suite
The conformance suite tests language features (does this syntax work?). The benchmark suite tests generation quality (can an LLM produce correct Vera from a natural language prompt?).
Dependencies
Create a benchmark suite for tracking LLM code generation performance on Vera, analogous to HumanEval/MBPP for Python and DafnyBench for Dafny.
Motivation
DafnyBench tracked Dafny verification success rates improving from 68% to 96% in one year as models improved. Vera needs an equivalent benchmark to measure progress, identify weak spots, and provide fine-tuning data.
Proposed structure
Relationship to conformance suite
The conformance suite tests language features (does this syntax work?). The benchmark suite tests generation quality (can an LLM produce correct Vera from a natural language prompt?).
Dependencies