refactor(bench): _framework decoupling — 5 phases + workflow/UX cleanup by YauhenBichel · Pull Request #2801 · Tracer-Cloud/opensre

YauhenBichel · 2026-06-12T12:34:13Z

Fixes #2074

Describe the changes you have made in this PR -

Before this PR, adding a new benchmark adapter to tests/benchmarks/
required editing _framework/ in 6+ places (hardcoded module path,
name-based config checks, direct CloudOpsBench imports for provenance
and overfit, etc.). This PR makes the framework adapter-agnostic
across five focused phases. The result: adding a new adapter is now
zero _framework/ edits.

Also clarifies the GitHub Actions workflows so operators can't mistake
them for CloudOpsBench-only, and introduces a per-experiment folder
layout for configs so each experiment's pre-registration, smokes,
full-N, and run ledger live in one place.

Code Understanding and AI Usage

Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?

No, I wrote all the code myself
Yes, I used AI assistance (continue below)

If you used AI assistance:

I have reviewed every single line of the AI-generated code
I can explain the purpose and logic of each function/component I added
I have tested edge cases and understand how the code handles them
I have modified the AI output to follow this project's coding standards and conventions

Explain your implementation approach:

After this PR, adding a new adapter (OpenRCA, ToolCallBench, internal
synthetic suites) is exactly this:

mkdir tests/benchmarks/<name>/
Write tests/benchmarks/<name>/adapter.py with the class,
capabilities, optional dimensions, optional provenance hook, and
the abstract methods.
End the file with register_adapter("<name>", MyAdapter).

That's it. No edits in _framework/. The walk picks it up; capability
lookups respect what it declares; overfit guards use its dimensions;
provenance captures its keys.

Every existing CloudOpsBench config and call site continues to work:

CloudOpsBench's AdapterCapabilities declaration matches the
previous hardcoded checks bit-for-bit.
OverfitDimensions defaults match CloudOpsBench's metadata schema,
so every pre-existing call to the overfit guards (none of which
passed dimensions=) is unaffected.
extend_provenance default is identity, so adapters that don't
override behave like the framework's old standard sections.
The shim at _framework/adapters.py re-exports
AdapterCapabilities + OverfitDimensions + capabilities_for
so external import paths remain stable.

Checklist before requesting a review

I have added proper PR title and linked to the issue
I have performed a self-review of my code
I can explain the purpose of every function, class, and logic block I added
I understand why my changes work and have tested them thoroughly
I have considered potential edge cases and how my code handles them
If it is a core feature, I have added thorough tests
My code follows the project's style guidelines and conventions

Note: Please check Allow edits from maintainers if you would like us to assist in the PR.

github-actions · 2026-06-12T12:34:23Z

Greptile code review

This repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md.

Run a review — add a PR comment with:

@greptile review

Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5.

Optional: automate with the greploop skill.

github-actions · 2026-06-12T12:34:43Z

terraform-bench plan

step	outcome
fmt	✅ `success`
init	✅ `success`
validate	✅ `success`
plan	✅ `success`

Plan output

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
-/+ destroy and then create replacement

Terraform will perform the following actions:

  # aws_ecs_task_definition.bench must be replaced
-/+ resource "aws_ecs_task_definition" "bench" {
      ~ arn                      = "arn:aws:ecs:us-east-1:395261708130:task-definition/opensre-bench:35" -> (known after apply)
      ~ arn_without_revision     = "arn:aws:ecs:us-east-1:395261708130:task-definition/opensre-bench" -> (known after apply)
      ~ container_definitions    = jsonencode(
          ~ [
              ~ {
                  ~ image            = "395261708130.dkr.ecr.us-east-1.amazonaws.com/opensre-bench:c27bee9" -> "395261708130.dkr.ecr.us-east-1.amazonaws.com/opensre-bench:bootstrap"
                  - mountPoints      = []
                    name             = "bench"
                  - portMappings     = []
                  - systemControls   = []
                  - volumesFrom      = []
                    # (4 unchanged attributes hidden)
                },
            ] # forces replacement
        )
      ~ enable_fault_injection   = false -> (known after apply)
      ~ id                       = "opensre-bench" -> (known after apply)
      ~ revision                 = 35 -> (known after apply)
      - tags                     = {} -> null
        # (10 unchanged attributes hidden)
    }

Plan: 1 to add, 0 to change, 1 to destroy.

Changes to Outputs:
  ~ task_definition_arn     = "arn:aws:ecs:us-east-1:395261708130:task-definition/opensre-bench:35" -> (known after apply)

Updated by terraform-bench.yml.

YauhenBichel · 2026-06-12T12:36:41Z

@greptile review

greptile-apps · 2026-06-12T12:41:34Z

Greptile Summary

This PR refactors tests/benchmarks/_framework/ to be fully adapter-agnostic across five focused phases, making it possible to add new benchmark adapters with zero edits to the framework. It also reorganizes configs into per-experiment folders and updates GitHub Actions workflows to drop CloudOpsBench-specific language.

Phase 3–5 (core): OverfitDimensions parameterizes metadata key names for overfit guards; extend_provenance hook moves adapter-specific provenance capture to the adapter; filesystem auto-discovery replaces the hardcoded _KNOWN_ADAPTER_MODULES tuple.
Phase 4: The framework's provenance.py no longer directly imports cloudopsbench.bench_agent; CloudOpsBench overrides extend_provenance to inject min_tool_calls itself.
Workflow/UX: Default config in benchmark-run.yml is removed (forcing intentional selection), doc paths updated to infra/bench/AWS_BENCH_SETUP.md, and per-experiment folder layout introduced for cleaner config/result traceability.

Confidence Score: 5/5

Safe to merge — the five decoupling phases are internally consistent, existing CloudOpsBench behaviour is preserved bit-for-bit, and the new hooks are covered by dedicated tests.

All three previous review comments are fully addressed: top_clusters output keys now use dims.system_key/dims.stratum_key, the overfit CLI has an --adapter flag, and smoke.py no longer hardcodes the CloudOpsBench import. The OverfitDimensions defaults, extend_provenance identity default, and AdapterCapabilities shim all maintain backward compatibility. The filesystem auto-discovery is sound and thoroughly tested with both live-tree and synthetic-tree cases.

smoke.py — the case listing diagnostic still hardcodes CloudOpsBench metadata keys; benign for today's single adapter but would confuse authors adding a second adapter.

Important Files Changed

Filename	Overview
tests/benchmarks/_framework/registry.py	Phase 5: replaces hardcoded `_KNOWN_ADAPTER_MODULES` with `_discover_adapter_modules()` filesystem walk; discovery rules (underscore-prefix skip, `adapter.py` requirement) are sound and well-tested.
tests/benchmarks/_framework/adapter_base.py	Adds `OverfitDimensions` (frozen Pydantic model) and two new `BenchmarkAdapter` hooks (`overfit_dimensions`, `extend_provenance`); defaults preserve CloudOpsBench back-compat; design is clean.
tests/benchmarks/_framework/overfit.py	Phase 3: Guards A/B/C now accept `dimensions: OverfitDimensions
tests/benchmarks/_framework/provenance.py	Phase 4: `_resolved_min_tool_calls` removed; `capture_provenance` now calls `adapter.extend_provenance(provenance)` as the last step; framework is cleanly decoupled from CloudOpsBench internals.
tests/benchmarks/_framework/smoke.py	Now adapter-agnostic via `--adapter` CLI flag and registry lookup; `_real_run_result` correctly uses only `BenchmarkAdapter` abstract methods; case listing diagnostic still hardcodes CloudOpsBench metadata keys (`system`, `fault_category`, `root_cause`).
tests/benchmarks/cloudopsbench/adapter.py	Correctly implements `extend_provenance` to inject `min_tool_calls`; defensive `isinstance(run_inputs, dict)` guard handles the edge case; existing CloudOpsBench logic untouched.
tests/benchmarks/_framework/tests/test_overfit_dimensions.py	Thorough coverage: frozen model, extra-field rejection, ABC default, all three guards with custom keys, `analyze` end-to-end; output-label test pins the previous hardcoded-key bug.
tests/benchmarks/_framework/tests/test_registry_discovery.py	Covers live-tree discovery (cloudopsbench found, _framework skipped, interactive_shell skipped, sorted order) and synthetic-tree walk via monkeypatching `__file__`; edge-case of empty tree handled.
tests/benchmarks/_framework/tests/test_provenance.py	Three new tests pin the `extend_provenance` hook: default identity, custom key injection, and CloudOpsBench end-to-end contract; `min_tool_calls` absence from framework-level output is explicitly asserted.
.github/workflows/benchmark-run.yml	Default config removed (forces explicit selection), doc path updated to `infra/bench/AWS_BENCH_SETUP.md`, workflow commentary now correctly describes adapter-agnostic dispatch.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI / Runner
    participant Reg as registry.py
    participant FS as Filesystem Walk
    participant Adp as adapter.py (any)
    participant OvF as overfit.py

    CLI->>Reg: build_adapter("cloudopsbench")
    Reg->>Reg: ensure_known_adapters_registered()
    Reg->>FS: _discover_adapter_modules()
    FS-->>Reg: ["tests.benchmarks.cloudopsbench.adapter", ...]
    Reg->>Adp: importlib.import_module(path)
    Adp->>Reg: register_adapter("cloudopsbench", CloudOpsBenchAdapter)
    Reg-->>CLI: CloudOpsBenchAdapter()

    CLI->>Adp: adapter.overfit_dimensions()
    Adp-->>CLI: "OverfitDimensions(system_key="system", ...)"

    CLI->>OvF: "analyze(baseline, variant, dimensions=dims)"
    OvF->>OvF: "per_system_uniformity(..., dimensions=dims)"
    OvF->>OvF: "per_stratum_uniformity(..., dimensions=dims)"
    OvF->>OvF: "flipped_loss_to_win_clusters(..., dimensions=dims)"
    OvF-->>CLI: OverfitReport

    CLI->>Adp: adapter.extend_provenance(provenance)
    Adp-->>CLI: "provenance + {run_inputs: {min_tool_calls: N}}"

_{Reviews (3): Last reviewed commit: "fixing issues" | Re-trigger Greptile}

YauhenBichel · 2026-06-12T12:46:44Z

@greptile review

github-actions · 2026-06-12T12:55:49Z

🍵 @YauhenBichel made tea, opened a PR, and merged before it cooled. No notes. ☕

👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

refactoring

5213461

YauhenBichel added 2 commits June 12, 2026 13:41

fixing naming convention

da63155

fixing issues

f408101

YauhenBichel marked this pull request as ready for review June 12, 2026 12:49

YauhenBichel merged commit 2d0c80b into main Jun 12, 2026
20 checks passed

YauhenBichel deleted the fix/2074-bench-framework-refactoring branch June 12, 2026 12:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(bench): _framework decoupling — 5 phases + workflow/UX cleanup#2801

refactor(bench): _framework decoupling — 5 phases + workflow/UX cleanup#2801
YauhenBichel merged 3 commits into
mainfrom
fix/2074-bench-framework-refactoring

YauhenBichel commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

YauhenBichel commented Jun 12, 2026

Uh oh!

greptile-apps Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

YauhenBichel commented Jun 12, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

YauhenBichel commented Jun 12, 2026

Describe the changes you have made in this PR -

Code Understanding and AI Usage

Checklist before requesting a review

Uh oh!

github-actions Bot commented Jun 12, 2026

Greptile code review

Uh oh!

github-actions Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

terraform-bench plan

Uh oh!

YauhenBichel commented Jun 12, 2026

Uh oh!

greptile-apps Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

YauhenBichel commented Jun 12, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Jun 12, 2026 •

edited

Loading

greptile-apps Bot commented Jun 12, 2026 •

edited

Loading