refactor(bench): _framework decoupling — 5 phases + workflow/UX cleanup#2801
Conversation
Greptile code reviewThis repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md. Run a review — add a PR comment with: Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5. Optional: automate with the greploop skill. |
terraform-bench plan
Plan outputTerraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
-/+ destroy and then create replacement
Terraform will perform the following actions:
# aws_ecs_task_definition.bench must be replaced
-/+ resource "aws_ecs_task_definition" "bench" {
~ arn = "arn:aws:ecs:us-east-1:395261708130:task-definition/opensre-bench:35" -> (known after apply)
~ arn_without_revision = "arn:aws:ecs:us-east-1:395261708130:task-definition/opensre-bench" -> (known after apply)
~ container_definitions = jsonencode(
~ [
~ {
~ image = "395261708130.dkr.ecr.us-east-1.amazonaws.com/opensre-bench:c27bee9" -> "395261708130.dkr.ecr.us-east-1.amazonaws.com/opensre-bench:bootstrap"
- mountPoints = []
name = "bench"
- portMappings = []
- systemControls = []
- volumesFrom = []
# (4 unchanged attributes hidden)
},
] # forces replacement
)
~ enable_fault_injection = false -> (known after apply)
~ id = "opensre-bench" -> (known after apply)
~ revision = 35 -> (known after apply)
- tags = {} -> null
# (10 unchanged attributes hidden)
}
Plan: 1 to add, 0 to change, 1 to destroy.
Changes to Outputs:
~ task_definition_arn = "arn:aws:ecs:us-east-1:395261708130:task-definition/opensre-bench:35" -> (known after apply)
Updated by |
|
@greptile review |
Greptile SummaryThis PR refactors
Confidence Score: 5/5Safe to merge — the five decoupling phases are internally consistent, existing CloudOpsBench behaviour is preserved bit-for-bit, and the new hooks are covered by dedicated tests. All three previous review comments are fully addressed: top_clusters output keys now use dims.system_key/dims.stratum_key, the overfit CLI has an --adapter flag, and smoke.py no longer hardcodes the CloudOpsBench import. The OverfitDimensions defaults, extend_provenance identity default, and AdapterCapabilities shim all maintain backward compatibility. The filesystem auto-discovery is sound and thoroughly tested with both live-tree and synthetic-tree cases. smoke.py — the case listing diagnostic still hardcodes CloudOpsBench metadata keys; benign for today's single adapter but would confuse authors adding a second adapter. Important Files Changed
Sequence DiagramsequenceDiagram
participant CLI as CLI / Runner
participant Reg as registry.py
participant FS as Filesystem Walk
participant Adp as adapter.py (any)
participant OvF as overfit.py
CLI->>Reg: build_adapter("cloudopsbench")
Reg->>Reg: ensure_known_adapters_registered()
Reg->>FS: _discover_adapter_modules()
FS-->>Reg: ["tests.benchmarks.cloudopsbench.adapter", ...]
Reg->>Adp: importlib.import_module(path)
Adp->>Reg: register_adapter("cloudopsbench", CloudOpsBenchAdapter)
Reg-->>CLI: CloudOpsBenchAdapter()
CLI->>Adp: adapter.overfit_dimensions()
Adp-->>CLI: "OverfitDimensions(system_key="system", ...)"
CLI->>OvF: "analyze(baseline, variant, dimensions=dims)"
OvF->>OvF: "per_system_uniformity(..., dimensions=dims)"
OvF->>OvF: "per_stratum_uniformity(..., dimensions=dims)"
OvF->>OvF: "flipped_loss_to_win_clusters(..., dimensions=dims)"
OvF-->>CLI: OverfitReport
CLI->>Adp: adapter.extend_provenance(provenance)
Adp-->>CLI: "provenance + {run_inputs: {min_tool_calls: N}}"
Reviews (3): Last reviewed commit: "fixing issues" | Re-trigger Greptile |
|
@greptile review |
|
🍵 @YauhenBichel made tea, opened a PR, and merged before it cooled. No notes. ☕ 👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome. |

Fixes #2074
Describe the changes you have made in this PR -
Before this PR, adding a new benchmark adapter to
tests/benchmarks/required editing
_framework/in 6+ places (hardcoded module path,name-based config checks, direct CloudOpsBench imports for provenance
and overfit, etc.). This PR makes the framework adapter-agnostic
across five focused phases. The result: adding a new adapter is now
zero
_framework/edits.Also clarifies the GitHub Actions workflows so operators can't mistake
them for CloudOpsBench-only, and introduces a per-experiment folder
layout for configs so each experiment's pre-registration, smokes,
full-N, and run ledger live in one place.
Code Understanding and AI Usage
Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?
If you used AI assistance:
Explain your implementation approach:
After this PR, adding a new adapter (OpenRCA, ToolCallBench, internal
synthetic suites) is exactly this:
mkdir tests/benchmarks/<name>/tests/benchmarks/<name>/adapter.pywith the class,capabilities, optional dimensions, optional provenance hook, and
the abstract methods.
register_adapter("<name>", MyAdapter).That's it. No edits in
_framework/. The walk picks it up; capabilitylookups respect what it declares; overfit guards use its dimensions;
provenance captures its keys.
Every existing CloudOpsBench config and call site continues to work:
AdapterCapabilitiesdeclaration matches theprevious hardcoded checks bit-for-bit.
OverfitDimensionsdefaults match CloudOpsBench's metadata schema,so every pre-existing call to the overfit guards (none of which
passed
dimensions=) is unaffected.extend_provenancedefault is identity, so adapters that don'toverride behave like the framework's old standard sections.
_framework/adapters.pyre-exportsAdapterCapabilities+OverfitDimensions+capabilities_forso external import paths remain stable.
Checklist before requesting a review
Note: Please check Allow edits from maintainers if you would like us to assist in the PR.