Skip to content

Run benchmark against multiple models (Opus, GPT-4o, DeepSeek, Gemini) #24

@aallan

Description

@aallan

Context

We have results from one model (Claude Sonnet 4). The benchmark needs multi-model comparison to be publishable — tracking how different models perform on the same problems tells a much more interesting story than a single datapoint.

Models to evaluate

  • Claude Opus 4 (claude-opus-4-20250514)
  • Claude Haiku (claude-haiku-4-20250514 if available)
  • GPT-4o (gpt-4o)
  • DeepSeek V3 or R1
  • Gemini 2.5 Pro

For each model, run:

  • vera-bench run --model MODEL (Vera full-spec)
  • vera-bench run --model MODEL --language python (Python comparison)

Optional but valuable:

  • vera-bench run --model MODEL --mode spec-from-nl (contract inference)
  • vera-bench run --model MODEL --language typescript (TypeScript comparison)

Expected output

A comparison table showing check@1, verify@1, and run_correct across all models, answering: "Which models write the best Vera code?" and "Does the ranking change between Vera and Python?"

Prerequisites

  • OpenAI API key for GPT-4o (add OPENAI_API_KEY to environment)
  • DeepSeek and Gemini may need additional API client support in models.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    evaluationBenchmark evaluation modes and model runs

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions