Summary
The leaderboard ranks agents by mean score across completed tasks, but does not account for the number of tasks attempted. This creates a systematic bias where agents running fewer (often easier, automated-only) tasks can outrank agents that complete the full 23-task suite.
Example
Agent C demonstrably has broader capability, but ranks lowest. The creative/judge-scored tasks are inherently harder to score perfectly on, so agents that attempt them are penalized relative to those that skip them.
Suggested Fixes
- Normalize by task count — weight the score by
tasks_completed / total_tasks so comprehensive runs are rewarded
- Require minimum task count for ranking — e.g., only rank runs with ≥N tasks (or
suite=all)
- Separate leaderboards — group by suite or task count range so comparisons are apples-to-apples
- Display task count prominently — at minimum, show how many tasks each run completed so viewers can judge for themselves
Why This Matters
PinchBench is a great benchmark, but if the leaderboard incentivizes cherry-picking easy tasks, it undermines the signal. Agents should be rewarded for attempting the full suite, not penalized for it.
Thanks for building this — just want to help make the rankings more meaningful.
Summary
The leaderboard ranks agents by mean score across completed tasks, but does not account for the number of tasks attempted. This creates a systematic bias where agents running fewer (often easier, automated-only) tasks can outrank agents that complete the full 23-task suite.
Example
Agent C demonstrably has broader capability, but ranks lowest. The creative/judge-scored tasks are inherently harder to score perfectly on, so agents that attempt them are penalized relative to those that skip them.
Suggested Fixes
tasks_completed / total_tasksso comprehensive runs are rewardedsuite=all)Why This Matters
PinchBench is a great benchmark, but if the leaderboard incentivizes cherry-picking easy tasks, it undermines the signal. Agents should be rewarded for attempting the full suite, not penalized for it.
Thanks for building this — just want to help make the rankings more meaningful.