o11y-bench

The first observability benchmark for AI agents

A standardized evaluation suite measuring how well AI agents perform 63 real-world observability tasks across logs, metrics, traces, dashboards, and incident workflows.

View Leaderboard

GitHub

Read the Blog Post

Top Agents

View all →

Base Model

claude-opus-4-7

79.4%

Pass^3

Thinking

Off

Pass@3

87.3%

Tasks

50/63

Date

2026-04-20

Total Cost

$60.39

Avg Cost

$0.320

Avg Tokens

157k

Base Model

claude-opus-4-7

73.0%

Pass^3

Thinking

High

Pass@3

90.5%

Tasks

46/63

Date

2026-04-21

Total Cost

$74.34

Avg Cost

$0.393

Avg Tokens

170k

Base Model

claude-sonnet-4-6

68.3%

Pass^3

Thinking

High

Pass@3

84.1%

Tasks

43/63

Date

2026-04-21

Total Cost

$36.82

Avg Cost

$0.195

Avg Tokens

119k

Base Model

claude-opus-4-6

66.7%

Pass^3

Thinking

Off

Pass@3

90.5%

Tasks

42/63

Date

2026-04-21

Total Cost

$53.19

Avg Cost

$0.281

Avg Tokens

131k

Base Model

gpt-5.4-2026-03-05

64.9%

Pass^3

Thinking

High

Pass@3

87.7%

Tasks

37/57

Date

2026-04-21

Total Cost

$24.59

Avg Cost

$0.146

Avg Tokens

119k

#	Agent	Model	Thinking			Tasks				Date
1	Base Model	claude-opus-4-7	Off	79.4%	87.3%	50/63	$60.39	$0.320	157k	2026-04-20
2	Base Model	claude-opus-4-7	High	73.0%	90.5%	46/63	$74.34	$0.393	170k	2026-04-21
3	Base Model	claude-sonnet-4-6	High	68.3%	84.1%	43/63	$36.82	$0.195	119k	2026-04-21
4	Base Model	claude-opus-4-6	Off	66.7%	90.5%	42/63	$53.19	$0.281	131k	2026-04-21
5	Base Model	gpt-5.4-2026-03-05	High	64.9%	87.7%	37/57	$24.59	$0.146	119k	2026-04-21

Top 10 By Category

Category scores use Pass^3 consistency across the three benchmark trials per task. Green is 90%+, yellow is 70%+, and red is below 70%.

Swipe horizontally to compare category scores.

Model	Dashboards	Grafana API	Investigation	Logs	Metrics	Traces
claude-opus-4-7	57%	100%	73%	80%	88%	77%
claude-opus-4-7	43%	100%	45%	80%	88%	77%
claude-sonnet-4-6	29%	100%	45%	50%	94%	77%
claude-opus-4-6	43%	100%	45%	60%	75%	77%
gpt-5.4-2026-03-05	0%	83%	64%	40%	81%	62%
claude-opus-4-7	43%	100%	36%	70%	69%	69%
gemini-3.1-pro-preview	43%	100%	27%	50%	94%	62%
gemini-3-flash-preview	14%	100%	64%	40%	81%	62%
gemini-3-flash-preview	29%	100%	36%	40%	88%	54%
claude-sonnet-4-6	14%	100%	55%	30%	81%	62%

Featured Tasks

Browse all →

Grafana API

audit-service-overview-datasources

Can you audit the saved "Service Overview Audit" dashboard (`service-overview-audit`) for me? I just want a quick panel-...

Grafana API

audit-service-overview-variable

Audit the saved "Service Overview Variable Audit" dashboard (`service-overview-variable-audit`) for me. I want to know w...

Investigation

cache-incident-blast-radius

Before we call this a broad backend rollout issue, check the earlier user-service cache-refresh incident and tell me whe...