fix(bench): bench experiments, refactoring, fixing code smell by YauhenBichel · Pull Request #2776 · Tracer-Cloud/opensre

YauhenBichel · 2026-06-09T09:57:13Z

Fixes #2074

Describe the changes you have made in this PR -

Cloudopsbnch benchmark experiments

Code Understanding and AI Usage

Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?

No, I wrote all the code myself
Yes, I used AI assistance (continue below)

If you used AI assistance:

I have reviewed every single line of the AI-generated code
I can explain the purpose and logic of each function/component I added
I have tested edge cases and understand how the code handles them
I have modified the AI output to follow this project's coding standards and conventions

Explain your implementation approach:
I am doing changes in config files and analyze results

Checklist before requesting a review

I have added proper PR title and linked to the issue
I have performed a self-review of my code
I can explain the purpose of every function, class, and logic block I added
I understand why my changes work and have tested them thoroughly
I have considered potential edge cases and how my code handles them
If it is a core feature, I have added thorough tests
My code follows the project's style guidelines and conventions

Note: Please check Allow edits from maintainers if you would like us to assist in the PR.

github-actions · 2026-06-09T09:57:25Z

Greptile code review

This repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md.

Run a review — add a PR comment with:

@greptile review

Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5.

Optional: automate with the greploop skill.

YauhenBichel · 2026-06-09T09:57:35Z

@greptile review

greptile-apps · 2026-06-09T10:01:36Z

Greptile Summary

This PR refactors the benchmark framework by splitting the monolithic adapters.py into focused modules (types.py, adapter_base.py, registry.py) with a backward-compat shim, and adds three new CloudOpsBench experiment configs (floor=0 full-N, trimmed-prompt pilot, trimmed-prompt full-N) alongside the infrastructure to support them.

Splits predictor.py into a predictor/ package (vocabulary.py, snapping.py, rerank.py, llm_call.py), adds the select_best_run majority-vote hook and format_final_answer predictor integration to the adapter, and introduces BenchInvestigationAgentTrimmedPrompt, BaselineLLMAloneAgent, and PureBaselineAgent agent classes for the three-arm comparison design.
Addresses previously flagged issues: seed field is present on CaseFilters, registry import errors are now logged at WARNING before suppression, and the agent_variant cross-field lint guard is implemented in config.py.

Confidence Score: 5/5

The refactoring is clean and backward-compatible; all previously flagged runtime bugs (missing seed field, silent registry suppression, missing lint guard) are resolved in this PR.

The module split is mechanically straightforward with a complete backward-compat shim, the new agent classes are well-isolated bench-only code, the three experiment configs are properly pre-registered and independently verified by thorough test suites. The only finding is a docstring that misstates the block ordering in _build_user_prompt for the opensre+llm path — the code itself is correct.

tests/benchmarks/cloudopsbench/predictor/llm_call.py — the _build_user_prompt docstring block-order description is inverted for the opensre+llm path relative to the implementation.

Important Files Changed

Filename	Overview
tests/benchmarks/_framework/types.py	Split out from the original adapters.py; contains CaseFilters (seed field now present, fixing prior AttributeError), RunResult, CaseScore, MetricSchema, and related dataclasses.
tests/benchmarks/_framework/registry.py	ImportError on adapter load is now logged at WARNING before being suppressed; bootstrap sentinel prevents repeated imports; clean and correct.
tests/benchmarks/_framework/config.py	Adds agent_variant field and its cross-field lint guard; adds min_tool_calls; system-path output_dir check; all previously flagged issues addressed.
tests/benchmarks/cloudopsbench/adapter.py	Implements BenchmarkAdapter with CloudOpsBench-specific logic; adds format_final_answer predictor hook and majority-vote select_best_run; apply_config_overrides handles min_tool_calls and agent_variant knobs correctly.
tests/benchmarks/cloudopsbench/bench_agent.py	Adds BenchInvestigationAgentTrimmedPrompt, BaselineLLMAloneAgent, PureBaselineAgent; MIN_TOOL_CALLS calibrated to 5; well-documented three-arm comparison design.
tests/benchmarks/cloudopsbench/predictor/llm_call.py	Predictor LLM call, prompt construction, and response parsing. Docstring block-order description contradicts the opensre+llm path implementation (summary before perf block, not perf before summary).

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    CLI["CLI: bench run config.yml"] --> LC["load_config(path)"]
    LC --> AV["apply_config_overrides(config)"]
    AV -->|"min_tool_calls set"| MT["BenchInvestigationAgent.MIN_TOOL_CALLS = N"]
    AV -->|"agent_variant=trimmed_prompt"| TP["investigation_agent_class → TrimmedPromptAgent"]
    CLI --> RUN["BenchmarkRunner.run()"]
    RUN --> LC2["adapter.load_cases(CaseFilters + seed)"]
    LC2 --> CELL["per cell: mode x llm x run"]
    CELL -->|"opensre+llm"| BENCH["BenchInvestigationAgent\n(floor=MIN_TOOL_CALLS)"]
    CELL -->|"llm_alone"| BASE["BaselineLLMAloneAgent\n(no floor)"]
    CELL -->|"llm_alone_pure"| PURE["PureBaselineAgent\n(minimal prompt, no floor)"]
    BENCH --> FF["format_final_answer()\nemit_paper_predictions()"]
    BASE --> FF
    PURE --> FF
    FF --> SC["score_case()\n15 paper metrics + validity"]
    SC --> SBR["select_best_run()\nmajority vote on taxonomy"]
    SBR --> REPORT["render_report_dir()"]

_{Reviews (5): Last reviewed commit: "bench registry refactoring" | Re-trigger Greptile}

YauhenBichel · 2026-06-09T10:39:57Z

@greptile review

+
+from __future__ import annotations
+
+_TAXONOMY_CATEGORIES: tuple[str, ...] = (


+    "Performance_Fault",
+)
+
+_ROOT_CAUSES: tuple[str, ...] = (


+# strings the LLM emits as long as they match the case's ground-truth
+# exactly (post-normalize), but giving the LLM the universe of known
+# values keeps it from inventing prefixes.
+_FAULT_OBJECT_SERVICES: tuple[str, ...] = (


+    "ts-ticket-office-service",
+)
+
+_FAULT_OBJECT_NODES: tuple[str, ...] = ("master", "worker-01", "worker-02", "worker-03")


+)
+
+_FAULT_OBJECT_NODES: tuple[str, ...] = ("master", "worker-01", "worker-02", "worker-03")
+_FAULT_OBJECT_NAMESPACES: tuple[str, ...] = ("boutique", "train-ticket")


+
+from __future__ import annotations
+
+_TAXONOMY_CATEGORIES: tuple[str, ...] = (


+    "Performance_Fault",
+)
+
+_ROOT_CAUSES: tuple[str, ...] = (


+# strings the LLM emits as long as they match the case's ground-truth
+# exactly (post-normalize), but giving the LLM the universe of known
+# values keeps it from inventing prefixes.
+_FAULT_OBJECT_SERVICES: tuple[str, ...] = (


+    "ts-ticket-office-service",
+)
+
+_FAULT_OBJECT_NODES: tuple[str, ...] = ("master", "worker-01", "worker-02", "worker-03")


+)
+
+_FAULT_OBJECT_NODES: tuple[str, ...] = ("master", "worker-01", "worker-02", "worker-03")
+_FAULT_OBJECT_NAMESPACES: tuple[str, ...] = ("boutique", "train-ticket")


YauhenBichel · 2026-06-09T10:59:45Z

@greptile review

YauhenBichel · 2026-06-09T11:21:22Z

@greptile review

github-actions · 2026-06-09T11:29:55Z

🧠 @YauhenBichel opened a PR. Maintainers feared them. CI genuflected. It merged. 🚨

👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

YauhenBichel added 2 commits June 8, 2026 16:28

added config for min tool calls 0 for full run using openai

dbd13a6

experiment: exp_trimmed_prompt

d899cf3

greptile-apps Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread tests/benchmarks/_framework/config.py

Comment thread tests/benchmarks/_framework/cli.py Outdated

YauhenBichel added 2 commits June 9, 2026 11:01

added new config experiment for full run

f64a3e8

fixing greptile issues and refactoring

3bce7d3

github-advanced-security AI found potential problems Jun 9, 2026

View reviewed changes

github-code-quality Bot found potential problems Jun 9, 2026

View reviewed changes

greptile-apps Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread tests/benchmarks/_framework/types.py

YauhenBichel added 2 commits June 9, 2026 11:54

fixing notes

a87d72e

fixing notes

e2d3d8d

greptile-apps Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread tests/benchmarks/_framework/registry.py Outdated

bench registry refactoring

6c1b991

YauhenBichel changed the title ~~fix(bench): bench experiments~~ fix(bench): bench experiments, refactoring, fixing code smell Jun 9, 2026

YauhenBichel marked this pull request as ready for review June 9, 2026 11:27

YauhenBichel merged commit 67999fb into main Jun 9, 2026
17 checks passed

YauhenBichel deleted the fix/2074-bench-experiment-floor0-full-run branch June 9, 2026 11:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bench): bench experiments, refactoring, fixing code smell#2776

fix(bench): bench experiments, refactoring, fixing code smell#2776
YauhenBichel merged 7 commits into
mainfrom
fix/2074-bench-experiment-floor0-full-run

YauhenBichel commented Jun 9, 2026

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

YauhenBichel commented Jun 9, 2026

Uh oh!

greptile-apps Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

YauhenBichel commented Jun 9, 2026

Uh oh!

Uh oh!

YauhenBichel commented Jun 9, 2026

Uh oh!

Uh oh!

YauhenBichel commented Jun 9, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		from __future__ import annotations

		_TAXONOMY_CATEGORIES: tuple[str, ...] = (

Conversation

YauhenBichel commented Jun 9, 2026

Describe the changes you have made in this PR -

Code Understanding and AI Usage

Explain your implementation approach: I am doing changes in config files and analyze results

Checklist before requesting a review

Uh oh!

github-actions Bot commented Jun 9, 2026

Greptile code review

Uh oh!

YauhenBichel commented Jun 9, 2026

Uh oh!

greptile-apps Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

YauhenBichel commented Jun 9, 2026

Uh oh!

Uh oh!

YauhenBichel commented Jun 9, 2026

Uh oh!

Uh oh!

YauhenBichel commented Jun 9, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Explain your implementation approach:
I am doing changes in config files and analyze results

greptile-apps Bot commented Jun 9, 2026 •

edited

Loading