#1
Base Model
claude-opus-4-7
79.4%
Pass^3
Thinking
Off
Pass@3
87.3%
Tasks
50/63
Date
2026-04-20
Total Cost
$60.39
Avg Cost
$0.320
Avg Tokens
157k
o11y-bench
A standardized evaluation suite measuring how well AI agents perform 63 real-world observability tasks across logs, metrics, traces, dashboards, and incident workflows.
#1
Base Model
79.4%
Pass^3
Thinking
Off
Pass@3
87.3%
Tasks
50/63
Date
2026-04-20
Total Cost
$60.39
Avg Cost
$0.320
Avg Tokens
157k
#2
Base Model
73.0%
Pass^3
Thinking
High
Pass@3
90.5%
Tasks
46/63
Date
2026-04-21
Total Cost
$74.34
Avg Cost
$0.393
Avg Tokens
170k
#3
Base Model
68.3%
Pass^3
Thinking
High
Pass@3
84.1%
Tasks
43/63
Date
2026-04-21
Total Cost
$36.82
Avg Cost
$0.195
Avg Tokens
119k
#4
Base Model
66.7%
Pass^3
Thinking
Off
Pass@3
90.5%
Tasks
42/63
Date
2026-04-21
Total Cost
$53.19
Avg Cost
$0.281
Avg Tokens
131k
#5
Base Model
64.9%
Pass^3
Thinking
High
Pass@3
87.7%
Tasks
37/57
Date
2026-04-21
Total Cost
$24.59
Avg Cost
$0.146
Avg Tokens
119k
| # | Agent | Model | Thinking | Tasks | Date | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Base Model | | Off | 79.4% | 87.3% | 50/63 | $60.39 | $0.320 | 157k | 2026-04-20 |
| 2 | Base Model | | High | 73.0% | 90.5% | 46/63 | $74.34 | $0.393 | 170k | 2026-04-21 |
| 3 | Base Model | | High | 68.3% | 84.1% | 43/63 | $36.82 | $0.195 | 119k | 2026-04-21 |
| 4 | Base Model | | Off | 66.7% | 90.5% | 42/63 | $53.19 | $0.281 | 131k | 2026-04-21 |
| 5 | Base Model | | High | 64.9% | 87.7% | 37/57 | $24.59 | $0.146 | 119k | 2026-04-21 |
Category scores use Pass^3 consistency across the three benchmark trials per task. Green is 90%+, yellow is 70%+, and red is below 70%.
Swipe horizontally to compare category scores.
| Model | Dashboards | Grafana API | Investigation | Logs | Metrics | Traces |
|---|---|---|---|---|---|---|
| claude-opus-4-7 | 57% | 100% | 73% | 80% | 88% | 77% |
| claude-opus-4-7 | 43% | 100% | 45% | 80% | 88% | 77% |
| claude-sonnet-4-6 | 29% | 100% | 45% | 50% | 94% | 77% |
| claude-opus-4-6 | 43% | 100% | 45% | 60% | 75% | 77% |
| gpt-5.4-2026-03-05 | 0% | 83% | 64% | 40% | 81% | 62% |
| claude-opus-4-7 | 43% | 100% | 36% | 70% | 69% | 69% |
| gemini-3.1-pro-preview | 43% | 100% | 27% | 50% | 94% | 62% |
| gemini-3-flash-preview | 14% | 100% | 64% | 40% | 81% | 62% |
| gemini-3-flash-preview | 29% | 100% | 36% | 40% | 88% | 54% |
| claude-sonnet-4-6 | 14% | 100% | 55% | 30% | 81% | 62% |