fix(bench): dispatcher singleton + vocab snap + object_a1 + llm_alone control arm#2759
Conversation
Greptile code reviewThis repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md. Run a review — add a PR comment with: Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5. Optional: automate with the greploop skill. |
|
@greptile review |
Greptile SummaryThis PR delivers two measurement-quality fixes surfaced by the 2026-06-05 CloudOpsBench run: a tool-filter hook that prevents production opensre tools from leaking into bench investigations (eliminating the 88% AccessDenied citation rate), and a majority-vote consistency selector that closes 60–100% of the median/oracle gap without any extra LLM calls. It also fixes two previously silent bugs — OpenAI tool-call trimming was skipped entirely (causing gpt-4o context overflows), and the agent LLM client singleton wasn't being reset between LLMs in multi-model grids.
Confidence Score: 5/5Safe to merge — all production code paths have zero-change defaults, and the bench-only changes are well-tested with 27 new tests. The production investigation agent is unchanged by default: the _filter_tools hook returns its input unchanged, _build_system_prompt delegates to the existing builder, and the context-budget logic only applies model-specific ceilings. Bench-side changes are confined to the test tree and are thoroughly covered. The two style nits have no runtime impact. No files require special attention for merge safety. app/agent/investigation.py has the most production impact but the changes are backwards-compatible improvements to context trimming. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[BenchmarkRunner._run_inner] --> B{mode?}
B -->|opensre+llm| C[build_opensre_integrations + investigation_agent_class]
B -->|llm_alone| D[build_baseline_tools + baseline_agent_class]
B -->|llm_alone_pure| E[build_baseline_tools + pure_baseline_agent_class]
C --> F[run_investigation]
D --> F
E --> F
F --> G[_filter_tools hook - whitelist bench-package tools only]
G --> H[ConnectedInvestigationAgent.run - ReAct loop]
H --> I[_enforce_context_budget - per-model ceiling]
I --> J{over budget?}
J -->|trim tool pair| J
J -->|exhausted| K[_truncate_largest_message]
K --> J
J -->|under budget| L[llm.invoke]
L --> M[CaseScore + RunResult]
M --> N[_aggregate_per_stratum]
N --> O[all stratum - median across seeds]
N --> P{select_best_run - majority vote}
P -->|index returned| Q[consistency-selected stratum]
P -->|None| R[skip group]
Reviews (8): Last reviewed commit: "running experiment, set default min tool..." | Re-trigger Greptile |
| if len(runs) <= 1: | ||
| return 0 if runs else None |
There was a problem hiding this comment.
Single blank-prediction run is included in
consistency-selected
When len(runs) == 1 and that sole run has an empty taxonomy (predictor failed), select_best_run returns 0, so the runner writes the blank-prediction run into the consistency-selected stratum. The multi-run path (len > 1) would return None in the same scenario (all votes empty → if not votes: return None). The divergence means a degenerate 1-run batch with no prediction silently contaminates the selected stratum, even though the stated contract is "Returns None only when no run produced any prediction at all."
|
@greptile review |
|
@greptile review |
Move the llm_alone explanation into a comment block above `modes`, tying it to the prereg comparison_protocol (opensre+llm vs llm_alone, same model_version). No behavioral change — the mode was already wired. Co-authored-by: Cursor <cursoragent@cursor.com>
|
@greptile review |
|
@greptile review |
|
@greptile review |
|
@greptile review |
|
😤 @YauhenBichel said "I will fix this" and then actually fixed it. Legendary behavior. 👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome. |

Fixes #2074
Two related improvements to CloudOpsBench measurement, both targeting the same root issue surfaced by the 06-05 bench data: the agent solves these cases (best-of-3 oracle = 83% / 80%) but loses ~0.30 A@1 to inconsistency, and production opensre tools were leaking into the bench investigation (88% of cases cited AccessDenied errors as evidence).
Lever #0 —
_filter_toolshook onConnectedInvestigationAgentBench investigations were reaching production tools (
app/tools/EKSEventsTool/,HermesLogsTool/, etc.) that hit live AWS / Hermes endpoints the bench Fargate task role cannot reach. The LLM was citing AccessDenied responses as evidence. Now the agent'stool_schemaspayload contains only bench-package tools.app/agent/investigation.py— new optional_filter_tools(self, tools) -> list[RegisteredTool]hook. Default returns input unchanged (zero production behavior change). Wired intorun()between tool discovery and state derivation so filtered tools also disappear fromstate["available_sources"]andstate["available_action_names"].tests/benchmarks/cloudopsbench/bench_agent.py—BenchInvestigationAgentoverrides_filter_toolsto whitelist byorigin_moduleprefixtests.benchmarks.cloudopsbench.tools.. Whitelist is aClassVar[tuple[str, ...]](ALLOWED_TOOL_MODULE_PREFIXES) so a one-off experiment can override without rebuilding. Tools with emptyorigin_modulelog a WARNING (registry-bug signal) instead of silently disappearing.app/tools/registry.py— separation-of-concerns hygiene: removed bench/benchmark mentions fromregister_external_tool_packagedocstring. Production code shouldn't name a specific consumer.Lever #2.5 —
consistency-selectedstratum (majority vote on top-prediction taxonomy)The bench runs 3 self-consistency seeds per (case, model) and reports median A@1 across all 3. Median is strictly stricter than mean and ignores the bo3 oracle ceiling. New optional adapter hook + framework wiring produce an additional per-stratum entry that picks 1 of 3 by majority vote on the predicted root-cause taxonomy.
tests/benchmarks/_framework/adapters.py— new optionalBenchmarkAdapter.select_best_run(case, runs) -> int | Nonemethod. Default returnsNone— benchmarks without multi-seed protocols are unaffected.tests/benchmarks/_framework/runner.py—_aggregate_per_stratumacceptsadapter=, groups cells by(case_id, mode, llm), calls the selector per group, and emits aconsistency-selectedstratum alongside the existingallmedian. Selector exceptions are logged and the entry is skipped — never aborts the report.tests/benchmarks/cloudopsbench/adapter.py—select_best_runimplementation: majority vote onfinal_diagnosis.top_3_predictions[0].fault_taxonomy, tiebreak by earliest run index. ReturnsNoneonly when no run produced any prediction. Zero extra LLM-call cost.Docs —
tests/benchmarks/cloudopsbench/README.mdupdated with the local-dev vs AWS-Fargate comparison table, the three-workflow CI chain documentation, the key-rotation procedure, and the rollback procedure.Tests — 27 new + 246 existing tests pass across
tests/benchmarks/andtests/agent/:tests/benchmarks/cloudopsbench/test_bench_agent.pytests/benchmarks/cloudopsbench/test_consistency_selector.py(new)tests/benchmarks/_framework/test_runner_aggregation.pyLint (ruff), format-check (ruff format), typecheck (mypy) all clean on touched files.
Demo/Screenshot for feature changes and bug fixes -
Live run after deploying Lever #0 (
dev-2026-06-05T11-46-43Z):Lever #2.5 projected against the 11:46 case data (framework code not yet in the image; replay validates the lift before deploy):
consistency-selectedCode Understanding and AI Usage
Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?
If you used AI assistance:
Explain your implementation approach:
The 06-05 bench data made two failure modes visible. First, 88% of case-runs had
get_eks_eventsorget_hermes_logsin their cited evidence — production tools were leaking into the bench agent and the LLM was citing AccessDenied failures as if they were real evidence. Second, even after Levers #1 (MIN_TOOL_CALLS=8) and #2 (capability-first tool descriptions) lifted A@1 from ~0 → 0.47 / 0.52, the oracle best-of-3 was 0.83 / 0.80 — the agent already solves the cases, the framework was just picking blindly.Alternatives considered for Lever #0:
sts:AssumeRolefailure mode that surfaced in the 10:07 trace.origin_moduleprefix picks them up automatically.Alternatives considered for Lever #2.5:
runs_per_casefrom 3 to 5, keep median. Diminishing returns; doesn't close the median/bo3 gap because median is strictly stricter than mean is strictly stricter than max.Empirical validation: Lever #0 was deployed in the 11:46 run and lifted gpt-4o A@1 mean from 0.467 → 0.511 (beats paper baseline 0.49) and gpt-5 from 0.522 → 0.633 (within 4 pts of 0.67). Lever #2.5 was replayed against the same 11:46 case data and projects to gpt-4o 0.567 / gpt-5 0.667 — the latter matching the paper baseline exactly.
Checklist before requesting a review
Note: Please check Allow edits from maintainers if you would like us to assist in the PR.