fix(bench, CI): calling real EKS during bench running#2756
Conversation
Greptile code reviewThis repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md. Run a review — add a PR comment with: Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5. Optional: automate with the greploop skill. |
|
@greptile review |
Greptile SummaryThis PR fixes bench runs incorrectly calling live AWS/EKS and Hermes endpoints by adding a
Confidence Score: 5/5Safe to merge — the hook is an identity no-op on the base class, so production investigations are unaffected; the bench subclass now consistently restricts the tool set to replay-only tools throughout the entire agent run. The base-class _filter_tools returns its input unchanged, so every existing production code path continues without modification. The bench override is self-contained in the test tree, well-tested with five new unit tests, and applies uniformly to tool schemas, seed calls, and all parallel execution. No migration or schema changes are involved. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant Run as run()
participant GAT as _get_available_tools()
participant FT as _filter_tools()
participant BCT as _build_connected_tool_context()
participant LLM as LLM loop
Run->>GAT: resolved_integrations
GAT-->>Run: all available tools (prod + bench)
Run->>FT: all available tools
note over FT: Base class: identity (production)<br/>BenchAgent: keep origin_module<br/>startswith(ALLOWED_TOOL_MODULE_PREFIXES)
FT-->>Run: filtered tools
Run->>BCT: filtered tools
BCT-->>Run: available_sources, available_action_names
Run->>LLM: tool_schemas(filtered tools), seed_calls(filtered tools), _run_parallel(filtered tools)
Reviews (3): Last reviewed commit: "fixed greptile notes" | Re-trigger Greptile |
|
@greptile review |
|
🐸 Rebase? Handled. Conflicts? Squashed. CI? Vibing. @YauhenBichel touched the untouchable and lived. 🫡 👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome. |

Fixes #2074 (fixing calling real EKS during bench running)
Describe the changes you have made in this PR -
Issues: still call real EKS during bench running:
Code Understanding and AI Usage
Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?
If you used AI assistance:
Explain your implementation approach:
Root cause. The CloudOpsBench investigation agent was reaching into PRODUCTION opensre tools (
app/tools/EKSEventsTool/,app/tools/HermesLogsTool/, etc.) that hit live AWS / Hermes endpoints. The bench Fargate task role intentionally cannot reach those — bench cases are supposed to run against deterministic State-Snapshot replay data per the Cloud-OpsBench paper protocol. Both production tools and bench-package tools were visible to the agent because the registry merges them and the agent'stool_schemaspayload included everything.Fix.* One generic hook on the production agent, one override on the bench subclass:
app/agent/investigation.py_filter_tools(self, tools)hook onConnectedInvestigationAgent. Default returns input unchanged. Wired intorun()between_get_available_tools(resolved)and_build_connected_tool_context(resolved, tools). Filtered tools also disappear fromstate["available_sources"]/state["available_action_names"]— agent is not told sources exist that it can't reach. Zero behavior change for production.tests/benchmarks/cloudopsbench/bench_agent.pyBenchInvestigationAgentoverrides_filter_toolsto keep only tools whoseorigin_modulestarts withtests.benchmarks.cloudopsbench.tools.. Whitelist is aClassVar[tuple[str, ...]](ALLOWED_TOOL_MODULE_PREFIXES) so a one-off experiment can override without rebuilding the agent — same convention asMIN_TOOL_CALLS.tests/benchmarks/cloudopsbench/test_bench_agent.pyALLOWED_TOOL_MODULE_PREFIXESis overridable.Separation-of-concerns hygiene in the same PR (same conceptual change — production code should not know about benchmarking):
app/agent/investigation.py(_filter_toolsdocstring)app/tools/registry.py(register_external_tool_packagecomment + docstring)Checklist before requesting a review
Note: Please check Allow edits from maintainers if you would like us to assist in the PR.