Skip to content

refactor(bench): _framework decoupling — 5 phases + workflow/UX cleanup#2801

Merged
YauhenBichel merged 3 commits into
mainfrom
fix/2074-bench-framework-refactoring
Jun 12, 2026
Merged

refactor(bench): _framework decoupling — 5 phases + workflow/UX cleanup#2801
YauhenBichel merged 3 commits into
mainfrom
fix/2074-bench-framework-refactoring

Conversation

@YauhenBichel

Copy link
Copy Markdown
Collaborator

Fixes #2074

Describe the changes you have made in this PR -

Before this PR, adding a new benchmark adapter to tests/benchmarks/
required editing _framework/ in 6+ places (hardcoded module path,
name-based config checks, direct CloudOpsBench imports for provenance
and overfit, etc.). This PR makes the framework adapter-agnostic
across five focused phases. The result: adding a new adapter is now
zero _framework/ edits.

Also clarifies the GitHub Actions workflows so operators can't mistake
them for CloudOpsBench-only, and introduces a per-experiment folder
layout for configs so each experiment's pre-registration, smokes,
full-N, and run ledger live in one place.


Code Understanding and AI Usage

Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?

  • No, I wrote all the code myself
  • Yes, I used AI assistance (continue below)

If you used AI assistance:

  • I have reviewed every single line of the AI-generated code
  • I can explain the purpose and logic of each function/component I added
  • I have tested edge cases and understand how the code handles them
  • I have modified the AI output to follow this project's coding standards and conventions

Explain your implementation approach:

After this PR, adding a new adapter (OpenRCA, ToolCallBench, internal
synthetic suites) is exactly this:

  1. mkdir tests/benchmarks/<name>/
  2. Write tests/benchmarks/<name>/adapter.py with the class,
    capabilities, optional dimensions, optional provenance hook, and
    the abstract methods.
  3. End the file with register_adapter("<name>", MyAdapter).

That's it. No edits in _framework/. The walk picks it up; capability
lookups respect what it declares; overfit guards use its dimensions;
provenance captures its keys.

Every existing CloudOpsBench config and call site continues to work:

  • CloudOpsBench's AdapterCapabilities declaration matches the
    previous hardcoded checks bit-for-bit.
  • OverfitDimensions defaults match CloudOpsBench's metadata schema,
    so every pre-existing call to the overfit guards (none of which
    passed dimensions=) is unaffected.
  • extend_provenance default is identity, so adapters that don't
    override behave like the framework's old standard sections.
  • The shim at _framework/adapters.py re-exports
    AdapterCapabilities + OverfitDimensions + capabilities_for
    so external import paths remain stable.

Checklist before requesting a review

  • I have added proper PR title and linked to the issue
  • I have performed a self-review of my code
  • I can explain the purpose of every function, class, and logic block I added
  • I understand why my changes work and have tested them thoroughly
  • I have considered potential edge cases and how my code handles them
  • If it is a core feature, I have added thorough tests
  • My code follows the project's style guidelines and conventions

Note: Please check Allow edits from maintainers if you would like us to assist in the PR.

@github-actions

Copy link
Copy Markdown
Contributor

Greptile code review

This repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md.

Run a review — add a PR comment with:

@greptile review

Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5.

Optional: automate with the greploop skill.

@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

terraform-bench plan

step outcome
fmt success
init success
validate success
plan success
Plan output
Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
-/+ destroy and then create replacement

Terraform will perform the following actions:

  # aws_ecs_task_definition.bench must be replaced
-/+ resource "aws_ecs_task_definition" "bench" {
      ~ arn                      = "arn:aws:ecs:us-east-1:395261708130:task-definition/opensre-bench:35" -> (known after apply)
      ~ arn_without_revision     = "arn:aws:ecs:us-east-1:395261708130:task-definition/opensre-bench" -> (known after apply)
      ~ container_definitions    = jsonencode(
          ~ [
              ~ {
                  ~ image            = "395261708130.dkr.ecr.us-east-1.amazonaws.com/opensre-bench:c27bee9" -> "395261708130.dkr.ecr.us-east-1.amazonaws.com/opensre-bench:bootstrap"
                  - mountPoints      = []
                    name             = "bench"
                  - portMappings     = []
                  - systemControls   = []
                  - volumesFrom      = []
                    # (4 unchanged attributes hidden)
                },
            ] # forces replacement
        )
      ~ enable_fault_injection   = false -> (known after apply)
      ~ id                       = "opensre-bench" -> (known after apply)
      ~ revision                 = 35 -> (known after apply)
      - tags                     = {} -> null
        # (10 unchanged attributes hidden)
    }

Plan: 1 to add, 0 to change, 1 to destroy.

Changes to Outputs:
  ~ task_definition_arn     = "arn:aws:ecs:us-east-1:395261708130:task-definition/opensre-bench:35" -> (known after apply)

Updated by terraform-bench.yml.

@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

@greptile-apps

greptile-apps Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR refactors tests/benchmarks/_framework/ to be fully adapter-agnostic across five focused phases, making it possible to add new benchmark adapters with zero edits to the framework. It also reorganizes configs into per-experiment folders and updates GitHub Actions workflows to drop CloudOpsBench-specific language.

  • Phase 3–5 (core): OverfitDimensions parameterizes metadata key names for overfit guards; extend_provenance hook moves adapter-specific provenance capture to the adapter; filesystem auto-discovery replaces the hardcoded _KNOWN_ADAPTER_MODULES tuple.
  • Phase 4: The framework's provenance.py no longer directly imports cloudopsbench.bench_agent; CloudOpsBench overrides extend_provenance to inject min_tool_calls itself.
  • Workflow/UX: Default config in benchmark-run.yml is removed (forcing intentional selection), doc paths updated to infra/bench/AWS_BENCH_SETUP.md, and per-experiment folder layout introduced for cleaner config/result traceability.

Confidence Score: 5/5

Safe to merge — the five decoupling phases are internally consistent, existing CloudOpsBench behaviour is preserved bit-for-bit, and the new hooks are covered by dedicated tests.

All three previous review comments are fully addressed: top_clusters output keys now use dims.system_key/dims.stratum_key, the overfit CLI has an --adapter flag, and smoke.py no longer hardcodes the CloudOpsBench import. The OverfitDimensions defaults, extend_provenance identity default, and AdapterCapabilities shim all maintain backward compatibility. The filesystem auto-discovery is sound and thoroughly tested with both live-tree and synthetic-tree cases.

smoke.py — the case listing diagnostic still hardcodes CloudOpsBench metadata keys; benign for today's single adapter but would confuse authors adding a second adapter.

Important Files Changed

Filename Overview
tests/benchmarks/_framework/registry.py Phase 5: replaces hardcoded _KNOWN_ADAPTER_MODULES with _discover_adapter_modules() filesystem walk; discovery rules (underscore-prefix skip, adapter.py requirement) are sound and well-tested.
tests/benchmarks/_framework/adapter_base.py Adds OverfitDimensions (frozen Pydantic model) and two new BenchmarkAdapter hooks (overfit_dimensions, extend_provenance); defaults preserve CloudOpsBench back-compat; design is clean.
tests/benchmarks/_framework/overfit.py Phase 3: Guards A/B/C now accept `dimensions: OverfitDimensions
tests/benchmarks/_framework/provenance.py Phase 4: _resolved_min_tool_calls removed; capture_provenance now calls adapter.extend_provenance(provenance) as the last step; framework is cleanly decoupled from CloudOpsBench internals.
tests/benchmarks/_framework/smoke.py Now adapter-agnostic via --adapter CLI flag and registry lookup; _real_run_result correctly uses only BenchmarkAdapter abstract methods; case listing diagnostic still hardcodes CloudOpsBench metadata keys (system, fault_category, root_cause).
tests/benchmarks/cloudopsbench/adapter.py Correctly implements extend_provenance to inject min_tool_calls; defensive isinstance(run_inputs, dict) guard handles the edge case; existing CloudOpsBench logic untouched.
tests/benchmarks/_framework/tests/test_overfit_dimensions.py Thorough coverage: frozen model, extra-field rejection, ABC default, all three guards with custom keys, analyze end-to-end; output-label test pins the previous hardcoded-key bug.
tests/benchmarks/_framework/tests/test_registry_discovery.py Covers live-tree discovery (cloudopsbench found, _framework skipped, interactive_shell skipped, sorted order) and synthetic-tree walk via monkeypatching __file__; edge-case of empty tree handled.
tests/benchmarks/_framework/tests/test_provenance.py Three new tests pin the extend_provenance hook: default identity, custom key injection, and CloudOpsBench end-to-end contract; min_tool_calls absence from framework-level output is explicitly asserted.
.github/workflows/benchmark-run.yml Default config removed (forces explicit selection), doc path updated to infra/bench/AWS_BENCH_SETUP.md, workflow commentary now correctly describes adapter-agnostic dispatch.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI / Runner
    participant Reg as registry.py
    participant FS as Filesystem Walk
    participant Adp as adapter.py (any)
    participant OvF as overfit.py

    CLI->>Reg: build_adapter("cloudopsbench")
    Reg->>Reg: ensure_known_adapters_registered()
    Reg->>FS: _discover_adapter_modules()
    FS-->>Reg: ["tests.benchmarks.cloudopsbench.adapter", ...]
    Reg->>Adp: importlib.import_module(path)
    Adp->>Reg: register_adapter("cloudopsbench", CloudOpsBenchAdapter)
    Reg-->>CLI: CloudOpsBenchAdapter()

    CLI->>Adp: adapter.overfit_dimensions()
    Adp-->>CLI: "OverfitDimensions(system_key="system", ...)"

    CLI->>OvF: "analyze(baseline, variant, dimensions=dims)"
    OvF->>OvF: "per_system_uniformity(..., dimensions=dims)"
    OvF->>OvF: "per_stratum_uniformity(..., dimensions=dims)"
    OvF->>OvF: "flipped_loss_to_win_clusters(..., dimensions=dims)"
    OvF-->>CLI: OverfitReport

    CLI->>Adp: adapter.extend_provenance(provenance)
    Adp-->>CLI: "provenance + {run_inputs: {min_tool_calls: N}}"
Loading

Reviews (3): Last reviewed commit: "fixing issues" | Re-trigger Greptile

@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

@YauhenBichel YauhenBichel marked this pull request as ready for review June 12, 2026 12:49
@YauhenBichel YauhenBichel merged commit 2d0c80b into main Jun 12, 2026
20 checks passed
@YauhenBichel YauhenBichel deleted the fix/2074-bench-framework-refactoring branch June 12, 2026 12:55
@github-actions

Copy link
Copy Markdown
Contributor

🍵 @YauhenBichel made tea, opened a PR, and merged before it cooled. No notes. ☕


👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Benchmark opensre+LLM vs LLM-alone (Cloudopsbench)

1 participant