Context
Feedback from @strakyo on X:
Strong benchmark setup. Are you publishing per-task variance and retry counts, or only aggregate success rate? Those two numbers usually decide real agent reliability.
The Ask
Currently the leaderboard shows aggregate scores. For real-world agent reliability assessment, users want to see:
-
Per-task variance — How consistent is a model across multiple runs of the same task? High variance = unreliable even if mean score is good.
-
Retry counts — How many attempts did the agent need? An agent that succeeds on first try is more reliable than one that needs 3 retries.
Current State
The benchmark runner already supports --runs N for multiple runs per task, and the results JSON includes:
grading.runs[] — individual run scores
grading.mean — average score
grading.std — standard deviation
grading.min / grading.max — range
This data exists in the results payload but may not be surfaced on the leaderboard UI.
Proposed Changes
Leaderboard UI
- Add columns or expandable detail for:
- Std Dev or Variance per model
- Consistency score (e.g., % of tasks where all runs succeeded)
- First-try success rate vs overall success rate
Results Upload
- Ensure
runs_per_task and per-run breakdowns are included in submissions
- Consider requiring minimum N runs for "verified" status
Visualization Ideas
- Sparklines showing score distribution across runs
- Color-coding models by reliability (low variance = green, high = yellow/red)
- Hover/click to see per-task breakdown
Why This Matters
A model scoring 0.85 with std dev 0.05 is much more useful than one scoring 0.90 with std dev 0.25. Production agents need predictability, not just peak performance.
Context
Feedback from @strakyo on X:
The Ask
Currently the leaderboard shows aggregate scores. For real-world agent reliability assessment, users want to see:
Per-task variance — How consistent is a model across multiple runs of the same task? High variance = unreliable even if mean score is good.
Retry counts — How many attempts did the agent need? An agent that succeeds on first try is more reliable than one that needs 3 retries.
Current State
The benchmark runner already supports
--runs Nfor multiple runs per task, and the results JSON includes:grading.runs[]— individual run scoresgrading.mean— average scoregrading.std— standard deviationgrading.min/grading.max— rangeThis data exists in the results payload but may not be surfaced on the leaderboard UI.
Proposed Changes
Leaderboard UI
Results Upload
runs_per_taskand per-run breakdowns are included in submissionsVisualization Ideas
Why This Matters
A model scoring 0.85 with std dev 0.05 is much more useful than one scoring 0.90 with std dev 0.25. Production agents need predictability, not just peak performance.