Context
We have results from one model (Claude Sonnet 4). The benchmark needs multi-model comparison to be publishable — tracking how different models perform on the same problems tells a much more interesting story than a single datapoint.
Models to evaluate
For each model, run:
vera-bench run --model MODEL (Vera full-spec)
vera-bench run --model MODEL --language python (Python comparison)
Optional but valuable:
vera-bench run --model MODEL --mode spec-from-nl (contract inference)
vera-bench run --model MODEL --language typescript (TypeScript comparison)
Expected output
A comparison table showing check@1, verify@1, and run_correct across all models, answering: "Which models write the best Vera code?" and "Does the ranking change between Vera and Python?"
Prerequisites
- OpenAI API key for GPT-4o (add
OPENAI_API_KEY to environment)
- DeepSeek and Gemini may need additional API client support in
models.py
Context
We have results from one model (Claude Sonnet 4). The benchmark needs multi-model comparison to be publishable — tracking how different models perform on the same problems tells a much more interesting story than a single datapoint.
Models to evaluate
claude-opus-4-20250514)claude-haiku-4-20250514if available)gpt-4o)For each model, run:
vera-bench run --model MODEL(Vera full-spec)vera-bench run --model MODEL --language python(Python comparison)Optional but valuable:
vera-bench run --model MODEL --mode spec-from-nl(contract inference)vera-bench run --model MODEL --language typescript(TypeScript comparison)Expected output
A comparison table showing check@1, verify@1, and run_correct across all models, answering: "Which models write the best Vera code?" and "Does the ranking change between Vera and Python?"
Prerequisites
OPENAI_API_KEYto environment)models.py