Skip to content

fix(bench): bench experiments, refactoring, fixing code smell#2776

Merged
YauhenBichel merged 7 commits into
mainfrom
fix/2074-bench-experiment-floor0-full-run
Jun 9, 2026
Merged

fix(bench): bench experiments, refactoring, fixing code smell#2776
YauhenBichel merged 7 commits into
mainfrom
fix/2074-bench-experiment-floor0-full-run

Conversation

@YauhenBichel

Copy link
Copy Markdown
Collaborator

Fixes #2074

Describe the changes you have made in this PR -

Cloudopsbnch benchmark experiments

Code Understanding and AI Usage

Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?

  • No, I wrote all the code myself
  • Yes, I used AI assistance (continue below)

If you used AI assistance:

  • I have reviewed every single line of the AI-generated code
  • I can explain the purpose and logic of each function/component I added
  • I have tested edge cases and understand how the code handles them
  • I have modified the AI output to follow this project's coding standards and conventions

Explain your implementation approach:
I am doing changes in config files and analyze results

Checklist before requesting a review

  • I have added proper PR title and linked to the issue
  • I have performed a self-review of my code
  • I can explain the purpose of every function, class, and logic block I added
  • I understand why my changes work and have tested them thoroughly
  • I have considered potential edge cases and how my code handles them
  • If it is a core feature, I have added thorough tests
  • My code follows the project's style guidelines and conventions

Note: Please check Allow edits from maintainers if you would like us to assist in the PR.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Greptile code review

This repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md.

Run a review — add a PR comment with:

@greptile review

Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5.

Optional: automate with the greploop skill.

@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

@greptile-apps

greptile-apps Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR refactors the benchmark framework by splitting the monolithic adapters.py into focused modules (types.py, adapter_base.py, registry.py) with a backward-compat shim, and adds three new CloudOpsBench experiment configs (floor=0 full-N, trimmed-prompt pilot, trimmed-prompt full-N) alongside the infrastructure to support them.

  • Splits predictor.py into a predictor/ package (vocabulary.py, snapping.py, rerank.py, llm_call.py), adds the select_best_run majority-vote hook and format_final_answer predictor integration to the adapter, and introduces BenchInvestigationAgentTrimmedPrompt, BaselineLLMAloneAgent, and PureBaselineAgent agent classes for the three-arm comparison design.
  • Addresses previously flagged issues: seed field is present on CaseFilters, registry import errors are now logged at WARNING before suppression, and the agent_variant cross-field lint guard is implemented in config.py.

Confidence Score: 5/5

The refactoring is clean and backward-compatible; all previously flagged runtime bugs (missing seed field, silent registry suppression, missing lint guard) are resolved in this PR.

The module split is mechanically straightforward with a complete backward-compat shim, the new agent classes are well-isolated bench-only code, the three experiment configs are properly pre-registered and independently verified by thorough test suites. The only finding is a docstring that misstates the block ordering in _build_user_prompt for the opensre+llm path — the code itself is correct.

tests/benchmarks/cloudopsbench/predictor/llm_call.py — the _build_user_prompt docstring block-order description is inverted for the opensre+llm path relative to the implementation.

Important Files Changed

Filename Overview
tests/benchmarks/_framework/types.py Split out from the original adapters.py; contains CaseFilters (seed field now present, fixing prior AttributeError), RunResult, CaseScore, MetricSchema, and related dataclasses.
tests/benchmarks/_framework/registry.py ImportError on adapter load is now logged at WARNING before being suppressed; bootstrap sentinel prevents repeated imports; clean and correct.
tests/benchmarks/_framework/config.py Adds agent_variant field and its cross-field lint guard; adds min_tool_calls; system-path output_dir check; all previously flagged issues addressed.
tests/benchmarks/cloudopsbench/adapter.py Implements BenchmarkAdapter with CloudOpsBench-specific logic; adds format_final_answer predictor hook and majority-vote select_best_run; apply_config_overrides handles min_tool_calls and agent_variant knobs correctly.
tests/benchmarks/cloudopsbench/bench_agent.py Adds BenchInvestigationAgentTrimmedPrompt, BaselineLLMAloneAgent, PureBaselineAgent; MIN_TOOL_CALLS calibrated to 5; well-documented three-arm comparison design.
tests/benchmarks/cloudopsbench/predictor/llm_call.py Predictor LLM call, prompt construction, and response parsing. Docstring block-order description contradicts the opensre+llm path implementation (summary before perf block, not perf before summary).

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    CLI["CLI: bench run config.yml"] --> LC["load_config(path)"]
    LC --> AV["apply_config_overrides(config)"]
    AV -->|"min_tool_calls set"| MT["BenchInvestigationAgent.MIN_TOOL_CALLS = N"]
    AV -->|"agent_variant=trimmed_prompt"| TP["investigation_agent_class → TrimmedPromptAgent"]
    CLI --> RUN["BenchmarkRunner.run()"]
    RUN --> LC2["adapter.load_cases(CaseFilters + seed)"]
    LC2 --> CELL["per cell: mode x llm x run"]
    CELL -->|"opensre+llm"| BENCH["BenchInvestigationAgent\n(floor=MIN_TOOL_CALLS)"]
    CELL -->|"llm_alone"| BASE["BaselineLLMAloneAgent\n(no floor)"]
    CELL -->|"llm_alone_pure"| PURE["PureBaselineAgent\n(minimal prompt, no floor)"]
    BENCH --> FF["format_final_answer()\nemit_paper_predictions()"]
    BASE --> FF
    PURE --> FF
    FF --> SC["score_case()\n15 paper metrics + validity"]
    SC --> SBR["select_best_run()\nmajority vote on taxonomy"]
    SBR --> REPORT["render_report_dir()"]
Loading

Reviews (5): Last reviewed commit: "bench registry refactoring" | Re-trigger Greptile

Comment thread tests/benchmarks/_framework/config.py
Comment thread tests/benchmarks/_framework/cli.py Outdated
@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review


from __future__ import annotations

_TAXONOMY_CATEGORIES: tuple[str, ...] = (
"Performance_Fault",
)

_ROOT_CAUSES: tuple[str, ...] = (
# strings the LLM emits as long as they match the case's ground-truth
# exactly (post-normalize), but giving the LLM the universe of known
# values keeps it from inventing prefixes.
_FAULT_OBJECT_SERVICES: tuple[str, ...] = (
"ts-ticket-office-service",
)

_FAULT_OBJECT_NODES: tuple[str, ...] = ("master", "worker-01", "worker-02", "worker-03")
)

_FAULT_OBJECT_NODES: tuple[str, ...] = ("master", "worker-01", "worker-02", "worker-03")
_FAULT_OBJECT_NAMESPACES: tuple[str, ...] = ("boutique", "train-ticket")

from __future__ import annotations

_TAXONOMY_CATEGORIES: tuple[str, ...] = (
"Performance_Fault",
)

_ROOT_CAUSES: tuple[str, ...] = (
# strings the LLM emits as long as they match the case's ground-truth
# exactly (post-normalize), but giving the LLM the universe of known
# values keeps it from inventing prefixes.
_FAULT_OBJECT_SERVICES: tuple[str, ...] = (
"ts-ticket-office-service",
)

_FAULT_OBJECT_NODES: tuple[str, ...] = ("master", "worker-01", "worker-02", "worker-03")
)

_FAULT_OBJECT_NODES: tuple[str, ...] = ("master", "worker-01", "worker-02", "worker-03")
_FAULT_OBJECT_NAMESPACES: tuple[str, ...] = ("boutique", "train-ticket")
Comment thread tests/benchmarks/_framework/types.py
@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

Comment thread tests/benchmarks/_framework/registry.py Outdated
@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

@YauhenBichel YauhenBichel changed the title fix(bench): bench experiments fix(bench): bench experiments, refactoring, fixing code smell Jun 9, 2026
@YauhenBichel YauhenBichel marked this pull request as ready for review June 9, 2026 11:27
@YauhenBichel YauhenBichel merged commit 67999fb into main Jun 9, 2026
17 checks passed
@YauhenBichel YauhenBichel deleted the fix/2074-bench-experiment-floor0-full-run branch June 9, 2026 11:29
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

🧠 @YauhenBichel opened a PR. Maintainers feared them. CI genuflected. It merged. 🚨


👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Benchmark opensre+LLM vs LLM-alone (Cloudopsbench)

2 participants