Run benchmark against multiple models (Opus, GPT-4o, DeepSeek, Gemini)

## Context

We have results from one model (Claude Sonnet 4). The benchmark needs multi-model comparison to be publishable — tracking how different models perform on the same problems tells a much more interesting story than a single datapoint.

## Models to evaluate

- [ ] Claude Opus 4 (`claude-opus-4-20250514`)
- [ ] Claude Haiku (`claude-haiku-4-20250514` if available)
- [ ] GPT-4o (`gpt-4o`)
- [ ] DeepSeek V3 or R1
- [ ] Gemini 2.5 Pro

For each model, run:
- `vera-bench run --model MODEL` (Vera full-spec)
- `vera-bench run --model MODEL --language python` (Python comparison)

Optional but valuable:
- `vera-bench run --model MODEL --mode spec-from-nl` (contract inference)
- `vera-bench run --model MODEL --language typescript` (TypeScript comparison)

## Expected output

A comparison table showing check@1, verify@1, and run_correct across all models, answering: "Which models write the best Vera code?" and "Does the ranking change between Vera and Python?"

## Prerequisites

- OpenAI API key for GPT-4o (add `OPENAI_API_KEY` to environment)
- DeepSeek and Gemini may need additional API client support in `models.py`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run benchmark against multiple models (Opus, GPT-4o, DeepSeek, Gemini) #24

Context

Models to evaluate

Expected output

Prerequisites

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Run benchmark against multiple models (Opus, GPT-4o, DeepSeek, Gemini) #24

Description

Context

Models to evaluate

Expected output

Prerequisites

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions