o11y-bench

The first observability benchmark for AI agents

A standardized evaluation suite measuring how well AI agents perform 63 real-world observability tasks across logs, metrics, traces, dashboards, and incident workflows.

Top Agents

View all →

#1

Base Model

Anthropic logo

claude-opus-4-7

79.4%

Pass^3

Thinking

Off

Pass@3

87.3%

Tasks

50/63

Date

2026-04-20

Total Cost

$60.39

Avg Cost

$0.320

Avg Tokens

157k

#2

Base Model

Anthropic logo

claude-opus-4-7

73.0%

Pass^3

Thinking

High

Pass@3

90.5%

Tasks

46/63

Date

2026-04-21

Total Cost

$74.34

Avg Cost

$0.393

Avg Tokens

170k

#3

Base Model

Anthropic logo

claude-sonnet-4-6

68.3%

Pass^3

Thinking

High

Pass@3

84.1%

Tasks

43/63

Date

2026-04-21

Total Cost

$36.82

Avg Cost

$0.195

Avg Tokens

119k

#4

Base Model

Anthropic logo

claude-opus-4-6

66.7%

Pass^3

Thinking

Off

Pass@3

90.5%

Tasks

42/63

Date

2026-04-21

Total Cost

$53.19

Avg Cost

$0.281

Avg Tokens

131k

#5

Base Model

OpenAI logo

gpt-5.4-2026-03-05

64.9%

Pass^3

Thinking

High

Pass@3

87.7%

Tasks

37/57

Date

2026-04-21

Total Cost

$24.59

Avg Cost

$0.146

Avg Tokens

119k

Top 10 By Category

Category scores use Pass^3 consistency across the three benchmark trials per task. Green is 90%+, yellow is 70%+, and red is below 70%.

Swipe horizontally to compare category scores.

Model DashboardsGrafana APIInvestigationLogsMetricsTraces
claude-opus-4-7 57% 100% 73% 80% 88% 77%
claude-opus-4-7 43% 100% 45% 80% 88% 77%
claude-sonnet-4-6 29% 100% 45% 50% 94% 77%
claude-opus-4-6 43% 100% 45% 60% 75% 77%
gpt-5.4-2026-03-05 0% 83% 64% 40% 81% 62%
claude-opus-4-7 43% 100% 36% 70% 69% 69%
gemini-3.1-pro-preview 43% 100% 27% 50% 94% 62%
gemini-3-flash-preview 14% 100% 64% 40% 81% 62%
gemini-3-flash-preview 29% 100% 36% 40% 88% 54%
claude-sonnet-4-6 14% 100% 55% 30% 81% 62%

Featured Tasks

Browse all →