Benchmark suite for LLM code generation

Create a benchmark suite for tracking LLM code generation performance on Vera, analogous to HumanEval/MBPP for Python and DafnyBench for Dafny.

## Motivation

DafnyBench tracked Dafny verification success rates improving from 68% to 96% in one year as models improved. Vera needs an equivalent benchmark to measure progress, identify weak spots, and provide fine-tuning data.

## Proposed structure

- 50-100 programming tasks with natural language descriptions
- Each task specifies: function signature, contracts (requires/ensures), effect annotations, expected behavior
- Graded by difficulty: arithmetic, control flow, ADTs, effects, verification, modules
- Evaluation criteria: parses, type-checks, verifies (Tier 1 vs Tier 3), produces correct output
- Machine-readable format for automated evaluation

## Relationship to conformance suite

The conformance suite tests language features (does this syntax work?). The benchmark suite tests generation quality (can an LLM produce correct Vera from a natural language prompt?).

## Dependencies

- None


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark suite for LLM code generation #225

Motivation

Proposed structure

Relationship to conformance suite

Dependencies

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Benchmark suite for LLM code generation #225

Description

Motivation

Proposed structure

Relationship to conformance suite

Dependencies

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions