Skip to content

Feature: Publish per-task variance and retry counts on leaderboard #8

@ScuttleBot

Description

@ScuttleBot

Context

Feedback from @strakyo on X:

Strong benchmark setup. Are you publishing per-task variance and retry counts, or only aggregate success rate? Those two numbers usually decide real agent reliability.

The Ask

Currently the leaderboard shows aggregate scores. For real-world agent reliability assessment, users want to see:

  1. Per-task variance — How consistent is a model across multiple runs of the same task? High variance = unreliable even if mean score is good.

  2. Retry counts — How many attempts did the agent need? An agent that succeeds on first try is more reliable than one that needs 3 retries.

Current State

The benchmark runner already supports --runs N for multiple runs per task, and the results JSON includes:

  • grading.runs[] — individual run scores
  • grading.mean — average score
  • grading.std — standard deviation
  • grading.min / grading.max — range

This data exists in the results payload but may not be surfaced on the leaderboard UI.

Proposed Changes

Leaderboard UI

  • Add columns or expandable detail for:
    • Std Dev or Variance per model
    • Consistency score (e.g., % of tasks where all runs succeeded)
    • First-try success rate vs overall success rate

Results Upload

  • Ensure runs_per_task and per-run breakdowns are included in submissions
  • Consider requiring minimum N runs for "verified" status

Visualization Ideas

  • Sparklines showing score distribution across runs
  • Color-coding models by reliability (low variance = green, high = yellow/red)
  • Hover/click to see per-task breakdown

Why This Matters

A model scoring 0.85 with std dev 0.05 is much more useful than one scoring 0.90 with std dev 0.25. Production agents need predictability, not just peak performance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions