- Quick Start
- Running A Single Job
- Running With Different Agents
- Running Your Own Models
- Submitting Results To The Leaderboard
o11y-bench is an open benchmark for evaluating LLM agents on observability and SRE tasks.
It is built on top of Harbor and runs agents against a real
Grafana stack with Prometheus, Loki, and Tempo.
The repo includes:
- benchmark task specs
- a default custom Harbor agent
- grading and reporting logic
- Docker images and config for the synthetic observability environment
Each run produces machine-readable artifacts plus HTML reports for either a single job or a full comparison suite.
tasks-spec/ is the source of truth for benchmark scenarios.
tasks/ is generated output and should not be edited by hand.
At a high level, a benchmark run does the following:
- Materializes benchmark tasks from
tasks-spec/ - Starts the Harbor task container plus the observability sidecar stack
- Runs an agent against one or more tasks
- Grades the result against deterministic checks plus rubric criteria
- Writes job artifacts and HTML reports under
jobs/
The default agent lives in agents/o11y_agent.py, but you can also run Harbor built-in agents or
your own custom Harbor agent class by import path. See the Harbor agent docs for more on agent types.
Each run also persists one scenario clock in scenario_time.txt under the job or suite directory.
That keeps reruns and regrades aligned to the same synthetic data window.
You need all of the following installed locally:
You also need model-provider API keys in your environment.
Minimum environment variables:
ANTHROPIC_API_KEYUsed by the grading pipeline.- Provider key(s) for the model you want to run:
OPENAI_API_KEYANTHROPIC_API_KEYGOOGLE_API_KEYorGEMINI_API_KEYOPENROUTER_API_KEY— run any model available on OpenRouter
Optional environment variables:
OPENAI_API_BASEUse this for OpenAI-compatible endpoints.O11Y_SCENARIO_TIME_ISOOverride the scenario clock for debugging.
Clone the repo and install the pinned toolchain and Python environment:
git clone <your-fork-or-repo-url>
cd o11y-bench
mise install
uv syncThat is the normal one-time setup.
If you want to confirm the local toolchain is working before running a benchmark:
mise run setup:sync
mise run lint
mise run testRun a single task with the default repo agent:
mise run bench:job -- --model openai/gpt-5.4-nano --task-name query-cpu-metrics --n-concurrent 1This will:
- regenerate
tasks/fromtasks-spec/ - run the selected task with 3 attempts by default
- write artifacts under
jobs/<job-name>/ - generate
jobs/<job-name>/run_report.html
If you want a quiet version of the same command:
mise run bench:job:quiet -- --model openai/gpt-5.4-nano --task-name query-cpu-metrics --n-concurrent 1Browse all available tasks on the Tasks Explorer or see how models compare on the Leaderboard.
To regenerate tasks explicitly:
mise run setup:syncIf you invoke raw Harbor commands yourself instead of mise run bench:* or
uv run python -m o11y_bench ..., run preflight first:
mise run setup:preflightThat pre-builds shared images and cleans stale Harbor Docker projects.
Run one model across the benchmark:
mise run bench:job -- --model anthropic/claude-sonnet-4-6Run a single task only:
mise run bench:job -- --model anthropic/claude-sonnet-4-6 --task-name query-cpu-metrics --n-concurrent 1Run with a different reasoning level:
mise run bench:job -- --model openai/gpt-5.4-mini --reasoning-effort highRun with a custom output location or name:
mise run bench:job -- --model openai/gpt-5.4-mini --jobs-dir /tmp/o11y-bench-jobs --job-name my-smoke-runbench:job resumes by job directory.
If the job already exists and the saved config is compatible, it reuses completed work and reruns
only missing or retryable trials.
This means:
- rerunning the same command usually resumes
- changing the model, reasoning effort, or agent configuration creates a distinct job variant
- you can always force separation with
--job-name
Example default auto-generated job names:
openai-gpt-5-4-nano-off-k3openai-gpt-5-4-nano-high-k3openai-gpt-5-4-nano-off-opencode-k3openai-gpt-5-4-nano-off-agents-langchain-o11y-agent-langchaino11ybenchagent-k3
If you want a fresh run instead of resuming, pass a fresh --job-name.
This is the normal path and uses the custom repo agent:
mise run bench:job -- --model openai/gpt-5.4-nano --task-name query-cpu-metricsYou can switch to a Harbor built-in agent with --agent:
mise run bench:job -- --model openai/gpt-5.4-nano --task-name query-cpu-metrics --agent opencodeYou can run any importable Harbor agent class with --agent-import-path:
mise run bench:job -- --model openai/gpt-5.4-nano --task-name query-cpu-metrics --agent-import-path agents.langchain_o11y_agent:LangChainO11yBenchAgentUse either --agent or --agent-import-path, not both.
The LangChain agent in this repo is intentionally a simple example of wiring a custom Harbor agent entrypoint through the existing benchmark flow.
By default, agents interact with Grafana through mcp-grafana (MCP tools). To benchmark
agents using the gcx CLI instead, use the GcxOpenCodeAgent agent class:
mise run bench:job -- --model anthropic/claude-sonnet-4-6 --task-name query-cpu-metrics --agent-import-path agents.gcx_opencode_agent:GcxOpenCodeAgentThis runs OpenCode with gcx and gcx skills pre-installed in the container, but
with MCP tools stripped so the agent can only reach Grafana through gcx.
The container image has GRAFANA_SERVER and GRAFANA_ORG_ID set so gcx
connects to the sidecar Grafana automatically.
This agent can only run models available in OpenCode.
To test gcx changes before they are on main, you can use a locally-built binary instead of the published release.
Set LOCAL_GCX to the path of a gcx executable for Linux when running preflight or bench runs.
When set, the Docker image will use your local binary instead of downloading
from GitHub. The environment/gcx file is gitignored and cleaned up after build.
Use GOARCH=amd64 if your Docker is running x86_64 images.
cd ~/workspace/gcx && GOOS=linux GOARCH=arm64 mise run build
LOCAL_GCX=/path/to/gcx/bin/gcx-linux mise run bench:job -- --model openai/gpt-5.4-nanoIf your model is reachable through Harbor and LiteLLM, pass it as provider/model.
Examples:
mise run bench:job -- --model openai/gpt-5.4-mini
mise run bench:job -- --model anthropic/claude-haiku-4-5-20251001
mise run bench:job -- --model google/gemini-3-flash-previewYou can run any model available on OpenRouter by using the openrouter/ prefix.
Set OPENROUTER_API_KEY in your environment, then:
mise run bench:job -- --model openrouter/deepseek/deepseek-v3.2 --job-name openrouter-deepseek-v3-2You can dry-run the job planner without executing Harbor:
uv run python -m o11y_bench job --model openai/gpt-5.4-nano --task-name query-cpu-metrics --dry-runRun the standard comparison suite:
mise run bench:suiteBy default, bench:suite resumes the latest suite directory when possible.
To force a fresh suite directory:
mise run bench:suite -- --jobs-dir jobs/full-suite-$(date +%Y%m%d-%H%M%S)To disable resume explicitly:
mise run bench:suite -- --no-resume --jobs-dir jobs/full-suite-$(date +%Y%m%d-%H%M%S)To reduce local resource pressure:
mise run bench:suite -- --jobs-dir jobs/full-suite-$(date +%Y%m%d-%H%M%S) --n-concurrent 1The standard suite currently covers these provider/model/reasoning combinations:
- Anthropic
claude-haiku-4-5-20251001:off,low,highclaude-opus-4-6:off,low,highclaude-sonnet-4-5:off,highclaude-sonnet-4-6:off,low,high
- OpenAI
gpt-5.1-codex-mini:off,highgpt-5.2-codex:off,highgpt-5.2-2025-12-11:off,highgpt-5.4-2026-03-05:off,low,highgpt-5.4-mini:off,low,highgpt-5.4-nano:off,low,high
- Google
gemini-3-flash-preview:off,highgemini-3.1-pro-preview:off,low,highgemini-3.1-flash-lite-preview:off,low,high
The suite uses the default repo agent.
If you want to benchmark a custom agent across the same matrix, run bench:job variants yourself
or extend suite orchestration in code.
Single-job report:
jobs/<job-name>/run_report.html
Full-suite report:
jobs/<suite-id>/comparison.html
Useful artifacts inside each trial directory:
agent/instruction.txtagent/trajectory.jsonagent/command-0/stdout.txtverifier/grading_details.jsonverifier/reward.txtresult.json
If you changed grading and want to reuse existing transcripts without rerunning agents:
uv run python -m o11y_bench regrade --jobs-dir jobs/<suite-id> --path tasksTo regrade a specific job:
uv run python -m o11y_bench regrade --jobs-dir jobs --job-name <job-name> --path tasksregrade reruns the verifier against saved transcripts and rewrites verifier outputs in place.
For tasks whose checks need the live Grafana stack, it also starts a temporary local sidecar stack.
Rebuild a single run report:
uv run python -m reporting.run_report --job-dir jobs/<job-name>Rebuild a suite report:
mise run report -- --jobs-dir jobs/<suite-id>Compare two job directories directly:
uv run python -m reporting.compare_report --job-dir jobs/<suite-id>/<job-a> --job-dir jobs/<suite-id>/<job-b>After completing a benchmark run, you can submit your results to the o11y-bench leaderboard via the Hugging Face submission repo.
To submit:
- Fork the submission repo
- Create a branch and add your completed job directory under
submissions/o11y-bench/1.0/<agent>__<model>/ - Include a
metadata.yamlwith agent and model info - Open a Pull Request
See the submission repo README for the full submission structure, validation rules, and example layout.
mise run setup:sync
mise run setup:preflight
mise run setup:smoke
mise run lint
mise run format
mise run typecheck
mise run test
mise run bench:job -- --model openai/gpt-5.4-nano --task-name query-cpu-metrics --n-concurrent 1
mise run bench:suite- Edit
tasks-spec/, nottasks/ - Regenerate tasks after task-spec changes with
mise run setup:sync - Keep tests small and behavior-focused
- Be conservative with local concurrency if Docker resources are limited
This project is licensed under the GNU Affero General Public License v3.0.