Feature: Publish per-task variance and retry counts on leaderboard

## Context

Feedback from [@strakyo on X](https://x.com/strakyo/status/2030178037078114729):

> Strong benchmark setup. Are you publishing per-task variance and retry counts, or only aggregate success rate? Those two numbers usually decide real agent reliability.

## The Ask

Currently the leaderboard shows aggregate scores. For real-world agent reliability assessment, users want to see:

1. **Per-task variance** — How consistent is a model across multiple runs of the same task? High variance = unreliable even if mean score is good.

2. **Retry counts** — How many attempts did the agent need? An agent that succeeds on first try is more reliable than one that needs 3 retries.

## Current State

The benchmark runner already supports `--runs N` for multiple runs per task, and the results JSON includes:
- `grading.runs[]` — individual run scores
- `grading.mean` — average score
- `grading.std` — standard deviation
- `grading.min` / `grading.max` — range

This data exists in the results payload but may not be surfaced on the leaderboard UI.

## Proposed Changes

### Leaderboard UI
- Add columns or expandable detail for:
  - **Std Dev** or **Variance** per model
  - **Consistency score** (e.g., % of tasks where all runs succeeded)
  - **First-try success rate** vs overall success rate

### Results Upload
- Ensure `runs_per_task` and per-run breakdowns are included in submissions
- Consider requiring minimum N runs for "verified" status

### Visualization Ideas
- Sparklines showing score distribution across runs
- Color-coding models by reliability (low variance = green, high = yellow/red)
- Hover/click to see per-task breakdown

## Why This Matters

A model scoring 0.85 with std dev 0.05 is much more useful than one scoring 0.90 with std dev 0.25. Production agents need predictability, not just peak performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Publish per-task variance and retry counts on leaderboard #8

Context

The Ask

Current State

Proposed Changes

Leaderboard UI

Results Upload

Visualization Ideas

Why This Matters

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature: Publish per-task variance and retry counts on leaderboard #8

Description

Context

The Ask

Current State

Proposed Changes

Leaderboard UI

Results Upload

Visualization Ideas

Why This Matters

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions