fix(bench): bench experiments, refactoring, fixing code smell#2776
Conversation
Greptile code reviewThis repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md. Run a review — add a PR comment with: Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5. Optional: automate with the greploop skill. |
|
@greptile review |
Greptile SummaryThis PR refactors the benchmark framework by splitting the monolithic
Confidence Score: 5/5The refactoring is clean and backward-compatible; all previously flagged runtime bugs (missing seed field, silent registry suppression, missing lint guard) are resolved in this PR. The module split is mechanically straightforward with a complete backward-compat shim, the new agent classes are well-isolated bench-only code, the three experiment configs are properly pre-registered and independently verified by thorough test suites. The only finding is a docstring that misstates the block ordering in _build_user_prompt for the opensre+llm path — the code itself is correct. tests/benchmarks/cloudopsbench/predictor/llm_call.py — the _build_user_prompt docstring block-order description is inverted for the opensre+llm path relative to the implementation. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
CLI["CLI: bench run config.yml"] --> LC["load_config(path)"]
LC --> AV["apply_config_overrides(config)"]
AV -->|"min_tool_calls set"| MT["BenchInvestigationAgent.MIN_TOOL_CALLS = N"]
AV -->|"agent_variant=trimmed_prompt"| TP["investigation_agent_class → TrimmedPromptAgent"]
CLI --> RUN["BenchmarkRunner.run()"]
RUN --> LC2["adapter.load_cases(CaseFilters + seed)"]
LC2 --> CELL["per cell: mode x llm x run"]
CELL -->|"opensre+llm"| BENCH["BenchInvestigationAgent\n(floor=MIN_TOOL_CALLS)"]
CELL -->|"llm_alone"| BASE["BaselineLLMAloneAgent\n(no floor)"]
CELL -->|"llm_alone_pure"| PURE["PureBaselineAgent\n(minimal prompt, no floor)"]
BENCH --> FF["format_final_answer()\nemit_paper_predictions()"]
BASE --> FF
PURE --> FF
FF --> SC["score_case()\n15 paper metrics + validity"]
SC --> SBR["select_best_run()\nmajority vote on taxonomy"]
SBR --> REPORT["render_report_dir()"]
Reviews (5): Last reviewed commit: "bench registry refactoring" | Re-trigger Greptile |
|
@greptile review |
|
|
||
| from __future__ import annotations | ||
|
|
||
| _TAXONOMY_CATEGORIES: tuple[str, ...] = ( |
| "Performance_Fault", | ||
| ) | ||
|
|
||
| _ROOT_CAUSES: tuple[str, ...] = ( |
| # strings the LLM emits as long as they match the case's ground-truth | ||
| # exactly (post-normalize), but giving the LLM the universe of known | ||
| # values keeps it from inventing prefixes. | ||
| _FAULT_OBJECT_SERVICES: tuple[str, ...] = ( |
| "ts-ticket-office-service", | ||
| ) | ||
|
|
||
| _FAULT_OBJECT_NODES: tuple[str, ...] = ("master", "worker-01", "worker-02", "worker-03") |
| ) | ||
|
|
||
| _FAULT_OBJECT_NODES: tuple[str, ...] = ("master", "worker-01", "worker-02", "worker-03") | ||
| _FAULT_OBJECT_NAMESPACES: tuple[str, ...] = ("boutique", "train-ticket") |
|
|
||
| from __future__ import annotations | ||
|
|
||
| _TAXONOMY_CATEGORIES: tuple[str, ...] = ( |
| "Performance_Fault", | ||
| ) | ||
|
|
||
| _ROOT_CAUSES: tuple[str, ...] = ( |
| # strings the LLM emits as long as they match the case's ground-truth | ||
| # exactly (post-normalize), but giving the LLM the universe of known | ||
| # values keeps it from inventing prefixes. | ||
| _FAULT_OBJECT_SERVICES: tuple[str, ...] = ( |
| "ts-ticket-office-service", | ||
| ) | ||
|
|
||
| _FAULT_OBJECT_NODES: tuple[str, ...] = ("master", "worker-01", "worker-02", "worker-03") |
| ) | ||
|
|
||
| _FAULT_OBJECT_NODES: tuple[str, ...] = ("master", "worker-01", "worker-02", "worker-03") | ||
| _FAULT_OBJECT_NAMESPACES: tuple[str, ...] = ("boutique", "train-ticket") |
|
@greptile review |
|
@greptile review |
|
🧠 @YauhenBichel opened a PR. Maintainers feared them. CI genuflected. It merged. 🚨 👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome. |

Fixes #2074
Describe the changes you have made in this PR -
Cloudopsbnch benchmark experiments
Code Understanding and AI Usage
Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?
If you used AI assistance:
Explain your implementation approach:
I am doing changes in config files and analyze results
Checklist before requesting a review
Note: Please check Allow edits from maintainers if you would like us to assist in the PR.