Skip to content

fix(bench): dispatcher singleton + vocab snap + object_a1 + llm_alone control arm#2759

Merged
YauhenBichel merged 15 commits into
mainfrom
fix/2074-bench-running-fixing-docs-clean-code
Jun 6, 2026
Merged

fix(bench): dispatcher singleton + vocab snap + object_a1 + llm_alone control arm#2759
YauhenBichel merged 15 commits into
mainfrom
fix/2074-bench-running-fixing-docs-clean-code

Conversation

@YauhenBichel

Copy link
Copy Markdown
Collaborator

Fixes #2074

Two related improvements to CloudOpsBench measurement, both targeting the same root issue surfaced by the 06-05 bench data: the agent solves these cases (best-of-3 oracle = 83% / 80%) but loses ~0.30 A@1 to inconsistency, and production opensre tools were leaking into the bench investigation (88% of cases cited AccessDenied errors as evidence).

Lever #0 — _filter_tools hook on ConnectedInvestigationAgent

Bench investigations were reaching production tools (app/tools/EKSEventsTool/, HermesLogsTool/, etc.) that hit live AWS / Hermes endpoints the bench Fargate task role cannot reach. The LLM was citing AccessDenied responses as evidence. Now the agent's tool_schemas payload contains only bench-package tools.

  • app/agent/investigation.py — new optional _filter_tools(self, tools) -> list[RegisteredTool] hook. Default returns input unchanged (zero production behavior change). Wired into run() between tool discovery and state derivation so filtered tools also disappear from state["available_sources"] and state["available_action_names"].
  • tests/benchmarks/cloudopsbench/bench_agent.pyBenchInvestigationAgent overrides _filter_tools to whitelist by origin_module prefix tests.benchmarks.cloudopsbench.tools.. Whitelist is a ClassVar[tuple[str, ...]] (ALLOWED_TOOL_MODULE_PREFIXES) so a one-off experiment can override without rebuilding. Tools with empty origin_module log a WARNING (registry-bug signal) instead of silently disappearing.
  • app/tools/registry.py — separation-of-concerns hygiene: removed bench/benchmark mentions from register_external_tool_package docstring. Production code shouldn't name a specific consumer.

Lever #2.5 — consistency-selected stratum (majority vote on top-prediction taxonomy)

The bench runs 3 self-consistency seeds per (case, model) and reports median A@1 across all 3. Median is strictly stricter than mean and ignores the bo3 oracle ceiling. New optional adapter hook + framework wiring produce an additional per-stratum entry that picks 1 of 3 by majority vote on the predicted root-cause taxonomy.

  • tests/benchmarks/_framework/adapters.py — new optional BenchmarkAdapter.select_best_run(case, runs) -> int | None method. Default returns None — benchmarks without multi-seed protocols are unaffected.
  • tests/benchmarks/_framework/runner.py_aggregate_per_stratum accepts adapter=, groups cells by (case_id, mode, llm), calls the selector per group, and emits a consistency-selected stratum alongside the existing all median. Selector exceptions are logged and the entry is skipped — never aborts the report.
  • tests/benchmarks/cloudopsbench/adapter.pyselect_best_run implementation: majority vote on final_diagnosis.top_3_predictions[0].fault_taxonomy, tiebreak by earliest run index. Returns None only when no run produced any prediction. Zero extra LLM-call cost.

Docstests/benchmarks/cloudopsbench/README.md updated with the local-dev vs AWS-Fargate comparison table, the three-workflow CI chain documentation, the key-rotation procedure, and the rollback procedure.

Tests — 27 new + 246 existing tests pass across tests/benchmarks/ and tests/agent/:

File Tests
tests/benchmarks/cloudopsbench/test_bench_agent.py 19 (4 added for filter + 2 Greptile edge cases: prefix-root drop, empty-module warning)
tests/benchmarks/cloudopsbench/test_consistency_selector.py (new) 9 (unanimous, 2-of-3, all-different, blank predictions, single run, edge cases)
tests/benchmarks/_framework/test_runner_aggregation.py 6 new (no-adapter default, None=skip, picked metrics override median, called-once-per-scenario, exception swallowed, out-of-bounds index)

Lint (ruff), format-check (ruff format), typecheck (mypy) all clean on touched files.

Demo/Screenshot for feature changes and bug fixes -

Live run after deploying Lever #0 (dev-2026-06-05T11-46-43Z):

Metric 06-05 08:45 (pre-#0) 06-05 11:46 (post-#0) Δ
gpt-4o A@1 (mean) 0.467 0.511 +0.044 (beats paper 0.49)
gpt-5 A@1 (mean) 0.522 0.633 +0.111 (within 4 pts of 0.67)
EKS/Hermes leak rate 81% / 74% 3% / 0% leak essentially gone
Tool coverage (cov) 0.64 / 0.60 0.80 / 0.80 +0.16 / +0.20
Median steps 7 / 7 9 / 9 +2 (more meaningful investigation depth)

Lever #2.5 projected against the 11:46 case data (framework code not yet in the image; replay validates the lift before deploy):

Model live mean projected consistency-selected vs paper baseline
gpt-4o 0.511 0.567 +0.077 ✓ beats paper
gpt-5 0.633 0.667 −0.003 ≈ matches paper

Code Understanding and AI Usage

Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?

  • No, I wrote all the code myself
  • Yes, I used AI assistance (continue below)

If you used AI assistance:

  • I have reviewed every single line of the AI-generated code
  • I can explain the purpose and logic of each function/component I added
  • I have tested edge cases and understand how the code handles them
  • I have modified the AI output to follow this project's coding standards and conventions

Explain your implementation approach:

The 06-05 bench data made two failure modes visible. First, 88% of case-runs had get_eks_events or get_hermes_logs in their cited evidence — production tools were leaking into the bench agent and the LLM was citing AccessDenied failures as if they were real evidence. Second, even after Levers #1 (MIN_TOOL_CALLS=8) and #2 (capability-first tool descriptions) lifted A@1 from ~0 → 0.47 / 0.52, the oracle best-of-3 was 0.83 / 0.80 — the agent already solves the cases, the framework was just picking blindly.

Alternatives considered for Lever #0:

  • Grant the bench task role EKS + Hermes read permissions. Rejected — the bench is supposed to run against deterministic State-Snapshot replay data per the Cloud-OpsBench paper protocol. Real-world reads would make results non-reproducible and would not fix the cross-account sts:AssumeRole failure mode that surfaced in the 10:07 trace.
  • Block at the tool level (each EKS / Hermes tool checks a "bench mode" flag and refuses). Rejected — violates separation of concerns; production tools must not know about benchmarking; the check would accumulate across every new production tool.
  • Hardcode an exact tool-name whitelist on the bench agent. Rejected — drifts out of sync the moment a new bench tool is added. origin_module prefix picks them up automatically.

Alternatives considered for Lever #2.5:

  • LLM-as-judge selector (one extra LLM call per scenario). Viable but ~$3 extra per bench. Deferred to a follow-up if majority-vote hits a ceiling. The hook is generic so the swap is one method body.
  • Increase runs_per_case from 3 to 5, keep median. Diminishing returns; doesn't close the median/bo3 gap because median is strictly stricter than mean is strictly stricter than max.
  • Free deterministic selectors based on cov, citation_grounding, steps, etc. Empirically tested against the 06-05 data; they fail to capture the bo3 gap. Only structured-prediction majority vote captures it because it's the prediction the paper actually scores against.

Empirical validation: Lever #0 was deployed in the 11:46 run and lifted gpt-4o A@1 mean from 0.467 → 0.511 (beats paper baseline 0.49) and gpt-5 from 0.522 → 0.633 (within 4 pts of 0.67). Lever #2.5 was replayed against the same 11:46 case data and projects to gpt-4o 0.567 / gpt-5 0.667 — the latter matching the paper baseline exactly.


Checklist before requesting a review

  • I have added proper PR title and linked to the issue
  • I have performed a self-review of my code
  • I can explain the purpose of every function, class, and logic block I added
  • I understand why my changes work and have tested them thoroughly
  • I have considered potential edge cases and how my code handles them
  • If it is a core feature, I have added thorough tests
  • My code follows the project's style guidelines and conventions

Note: Please check Allow edits from maintainers if you would like us to assist in the PR.

@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Greptile code review

This repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md.

Run a review — add a PR comment with:

@greptile review

Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5.

Optional: automate with the greploop skill.

@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

@YauhenBichel YauhenBichel changed the title report consistency-selected A@1 fix(bench): isolate bench agent tool set; report consistency-selected A@1 Jun 5, 2026
@greptile-apps

greptile-apps Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR delivers two measurement-quality fixes surfaced by the 2026-06-05 CloudOpsBench run: a tool-filter hook that prevents production opensre tools from leaking into bench investigations (eliminating the 88% AccessDenied citation rate), and a majority-vote consistency selector that closes 60–100% of the median/oracle gap without any extra LLM calls. It also fixes two previously silent bugs — OpenAI tool-call trimming was skipped entirely (causing gpt-4o context overflows), and the agent LLM client singleton wasn't being reset between LLMs in multi-model grids.

  • Lever #0 (_filter_tools hook): BenchInvestigationAgent now whitelists tools by origin_module prefix; the hook is a no-op in production. Three bench agent classes define the opensre+llm / llm_alone / llm_alone_pure three-arm contrast needed for attributable lift measurements.
  • Lever Revert "Codebase refactoring and cleanup " #2.5 (consistency-selected stratum): BenchmarkAdapter.select_best_run is an opt-in hook; _aggregate_per_stratum emits a consistency-selected stratum alongside the existing median all stratum. Selector exceptions are swallowed so the report is never aborted.
  • Bug fixes: _trim_oldest_tool_pair now handles OpenAI's tool_calls shape; _context_budget_ceiling_for_model derives the trim ceiling per model; _reset_opensre_singletons resets the agent LLM client singleton; git status --porcelain path slicing is fixed.

Confidence Score: 5/5

Safe to merge — all production code paths have zero-change defaults, and the bench-only changes are well-tested with 27 new tests.

The production investigation agent is unchanged by default: the _filter_tools hook returns its input unchanged, _build_system_prompt delegates to the existing builder, and the context-budget logic only applies model-specific ceilings. Bench-side changes are confined to the test tree and are thoroughly covered. The two style nits have no runtime impact.

No files require special attention for merge safety. app/agent/investigation.py has the most production impact but the changes are backwards-compatible improvements to context trimming.

Important Files Changed

Filename Overview
app/agent/investigation.py Adds per-model context-window sizing, OpenAI tool-trimming support, and a _truncate_largest_message escape hatch; introduces _filter_tools and _build_system_prompt hooks. Dict key ordering in _MODEL_CONTEXT_WINDOWS is correct but relies on insertion order.
tests/benchmarks/_framework/runner.py Adds llm_alone / llm_alone_pure mode dispatch, consistency-selected stratum, and passes adapter= to _aggregate_per_stratum. Pre-flight guards and out-of-bounds index checks are in place.
tests/benchmarks/cloudopsbench/bench_agent.py Extracts _filter_to_bench_package, adds BaselineLLMAloneAgent and PureBaselineAgent, makes MIN_TOOL_CALLS env-configurable. Class docstring still says Default 8 after the default changed to 5.
tests/benchmarks/cloudopsbench/adapter.py Wires baseline_agent_class, pure_baseline_agent_class, and select_best_run; implements majority-vote selector, build_baseline_tools delegation, object_a1/a3 metrics, and latency_ms propagation for MTTI.
tests/benchmarks/cloudopsbench/predictor.py Adds controlled-vocabulary snapping for root_cause and fault_object with a blocked-concept-pair guard. rerank_predictions_by_evidence is intentionally not wired in per the inline comment.
tests/benchmarks/_framework/llm_dispatch.py Fixes silent multi-LLM grid bug: _reset_opensre_singletons now also resets the agent LLM client singleton so subsequent models don't reuse the first model's client.
tests/benchmarks/_framework/provenance.py Fixes git status --porcelain path-slicing bug; adds BENCH_MIN_TOOL_CALLS to env allowlist; surfaces min_tool_calls in run_inputs for self-documenting floor sweeps.
tests/benchmarks/cloudopsbench/scoring.py Adds object_a1/object_a3 metrics; fixes calculate_total_latency to prefer wall-clock latency_ms over unreliable per-step replay latencies.
tests/benchmarks/_framework/reporting.py Adds paper-baseline reference data, paired-scenario delta statistics, and per-model comparison panels. Splits _cells_by_llm into _cells_by_llm_mode to avoid pooling opensre+llm and llm_alone cells.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[BenchmarkRunner._run_inner] --> B{mode?}
    B -->|opensre+llm| C[build_opensre_integrations + investigation_agent_class]
    B -->|llm_alone| D[build_baseline_tools + baseline_agent_class]
    B -->|llm_alone_pure| E[build_baseline_tools + pure_baseline_agent_class]
    C --> F[run_investigation]
    D --> F
    E --> F
    F --> G[_filter_tools hook - whitelist bench-package tools only]
    G --> H[ConnectedInvestigationAgent.run - ReAct loop]
    H --> I[_enforce_context_budget - per-model ceiling]
    I --> J{over budget?}
    J -->|trim tool pair| J
    J -->|exhausted| K[_truncate_largest_message]
    K --> J
    J -->|under budget| L[llm.invoke]
    L --> M[CaseScore + RunResult]
    M --> N[_aggregate_per_stratum]
    N --> O[all stratum - median across seeds]
    N --> P{select_best_run - majority vote}
    P -->|index returned| Q[consistency-selected stratum]
    P -->|None| R[skip group]
Loading

Reviews (8): Last reviewed commit: "running experiment, set default min tool..." | Re-trigger Greptile

Comment thread tests/benchmarks/_framework/adapters.py
Comment on lines +412 to +413
if len(runs) <= 1:
return 0 if runs else None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Single blank-prediction run is included in consistency-selected

When len(runs) == 1 and that sole run has an empty taxonomy (predictor failed), select_best_run returns 0, so the runner writes the blank-prediction run into the consistency-selected stratum. The multi-run path (len > 1) would return None in the same scenario (all votes empty → if not votes: return None). The divergence means a degenerate 1-run batch with no prediction silently contaminates the selected stratum, even though the stated contract is "Returns None only when no run produced any prediction at all."

@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

YauhenBichel and others added 2 commits June 5, 2026 16:42
Move the llm_alone explanation into a comment block above `modes`,
tying it to the prereg comparison_protocol (opensre+llm vs llm_alone,
same model_version). No behavioral change — the mode was already wired.

Co-authored-by: Cursor <cursoragent@cursor.com>
@YauhenBichel YauhenBichel changed the title fix(bench): isolate bench agent tool set; report consistency-selected A@1 fix(bench): dispatcher singleton + vocab snap + object_a1 + llm_alone control arm Jun 5, 2026
@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

@YauhenBichel YauhenBichel marked this pull request as ready for review June 6, 2026 13:50
@YauhenBichel YauhenBichel merged commit 2a98ba3 into main Jun 6, 2026
17 checks passed
@YauhenBichel YauhenBichel deleted the fix/2074-bench-running-fixing-docs-clean-code branch June 6, 2026 14:00
@github-actions

github-actions Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

😤 @YauhenBichel said "I will fix this" and then actually fixed it. Legendary behavior.


👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Benchmark opensre+LLM vs LLM-alone (Cloudopsbench)

1 participant