Skip to content

Benchmark suite for LLM code generation #225

@aallan

Description

@aallan

Create a benchmark suite for tracking LLM code generation performance on Vera, analogous to HumanEval/MBPP for Python and DafnyBench for Dafny.

Motivation

DafnyBench tracked Dafny verification success rates improving from 68% to 96% in one year as models improved. Vera needs an equivalent benchmark to measure progress, identify weak spots, and provide fine-tuning data.

Proposed structure

  • 50-100 programming tasks with natural language descriptions
  • Each task specifies: function signature, contracts (requires/ensures), effect annotations, expected behavior
  • Graded by difficulty: arithmetic, control flow, ADTs, effects, verification, modules
  • Evaluation criteria: parses, type-checks, verifies (Tier 1 vs Tier 3), produces correct output
  • Machine-readable format for automated evaluation

Relationship to conformance suite

The conformance suite tests language features (does this syntax work?). The benchmark suite tests generation quality (can an LLM produce correct Vera from a natural language prompt?).

Dependencies

  • None

Metadata

Metadata

Assignees

No one assigned

    Labels

    C10C10 — EcosystemenhancementNew feature or requestvera-benchThe issue or PR originated due to aallan/vera-bench

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions