Skip to content

refactor(bench): _framework decoupling Phase 1 + 2 — capability flags#2800

Merged
YauhenBichel merged 2 commits into
mainfrom
fix/2074-bench-low-coupling
Jun 12, 2026
Merged

refactor(bench): _framework decoupling Phase 1 + 2 — capability flags#2800
YauhenBichel merged 2 commits into
mainfrom
fix/2074-bench-low-coupling

Conversation

@YauhenBichel

Copy link
Copy Markdown
Collaborator

Fixes #2074

Describe the changes you have made in this PR -

First two of five planned phases to decouple _framework/ from
CloudOpsBench-specific assumptions. Today the framework hardcodes
if config.benchmark != "cloudopsbench" in multiple places — any new
adapter (OpenRCA, ToolCallBench) needs framework changes to opt in to
features it actually supports. This PR replaces name-based dispatch
with a capability flag layer.

Phase 1 — AdapterCapabilities model + ABC integration
  • New AdapterCapabilities pydantic model in _framework/adapter_base.py.
    Frozen, extra="forbid". Two flags to start: supports_agent_variant,
    supports_predictor_variant. Default all-False so a new adapter is
    locked down to the minimum surface until it opts in deliberately.
  • BenchmarkAdapter ABC gets capabilities: ClassVar[AdapterCapabilities]
    with all-False default.
  • CloudOpsBenchAdapter declares
    capabilities = AdapterCapabilities(supports_agent_variant=True, supports_predictor_variant=True) — exactly what the previous
    hardcoded check let through.
Phase 2 — Config validation goes capability-aware
  • New capabilities_for(name) helper in _framework/registry.py
    returns the registered adapter's flags without forcing the caller to
    instantiate one.
  • _framework/config.py replaces both if benchmark != "cloudopsbench"
    guards with if not adapter_caps.supports_<feature>. Unknown
    adapter → all-False default → guard refuses the gated knob (a typo
    in config.benchmark surfaces with a clear error for each gated
    knob, plus the underlying unknown-benchmark error at runner build
    time).

Backward compat

  • Existing CloudOpsBench configs (every *.yml in the repo) continue
    to validate cleanly — the capability declaration on the adapter
    matches the previous hardcoded behavior bit-for-bit.
  • The shim at _framework/adapters.py re-exports AdapterCapabilities
    • capabilities_for so external imports use the same path as before.
  • Pydantic extra="forbid" catches capability-name typos at adapter
    declaration time, not at run time.

Code Understanding and AI Usage

Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?

  • No, I wrote all the code myself
  • Yes, I used AI assistance (continue below)

If you used AI assistance:

  • I have reviewed every single line of the AI-generated code
  • I can explain the purpose and logic of each function/component I added
  • I have tested edge cases and understand how the code handles them
  • I have modified the AI output to follow this project's coding standards and conventions

Checklist before requesting a review

  • I have added proper PR title and linked to the issue
  • I have performed a self-review of my code
  • I can explain the purpose of every function, class, and logic block I added
  • I understand why my changes work and have tested them thoroughly
  • I have considered potential edge cases and how my code handles them
  • If it is a core feature, I have added thorough tests
  • My code follows the project's style guidelines and conventions

Note: Please check Allow edits from maintainers if you would like us to assist in the PR.

@github-actions

Copy link
Copy Markdown
Contributor

Greptile code review

This repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md.

Run a review — add a PR comment with:

@greptile review

Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5.

Optional: automate with the greploop skill.

@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

@greptile-apps

greptile-apps Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR replaces two hardcoded if config.benchmark != "cloudopsbench" guards in _framework/config.py with a generic capability-flag layer, making the framework open to new adapters without framework changes. All three concerns raised in the prior review round (unnecessary adapter instantiation to read a ClassVar, misleading late-import circular-dependency comment, and the resulting _capabilities_for_lint shim) are cleanly resolved.

  • Phase 1 adds a frozen, extra-forbidden AdapterCapabilities Pydantic model and attaches it as a ClassVar to BenchmarkAdapter with an all-False safe default.
  • Phase 2 adds capabilities_for to the registry — reads the ClassVar directly for class factories (no instantiation) and falls back to factory().capabilities for closure factories — and threads it through _capabilities_for_lint in config.py, replacing both name-based guards.
  • CloudOpsBenchAdapter declares supports_agent_variant=True, supports_predictor_variant=True, reproducing the previous hardcoded behaviour bit-for-bit; existing configs continue to validate cleanly.

Confidence Score: 5/5

Safe to merge — the refactor is strictly additive, backward compatibility is preserved bit-for-bit, and all previous review concerns have been addressed.

The capability-flag layer is well-scoped: frozen Pydantic model prevents runtime mutation, extra="forbid" catches declaration typos at construction time, and capabilities_for avoids adapter instantiation for the common class-factory pattern. Existing CloudOpsBench configs validate identically to before. The only gap is the closure-factory fallback branch in capabilities_for having no test coverage, but this path is explicitly marked uncommon and affects no current adapter.

No files require special attention. test_adapter_capabilities.py could add a closure-factory test to complete coverage of the two-branch capabilities_for implementation.

Important Files Changed

Filename Overview
tests/benchmarks/_framework/adapter_base.py Adds AdapterCapabilities frozen Pydantic model with extra="forbid" and ClassVar[AdapterCapabilities] on the ABC — well-designed; safe default (all-False) correctly locks down new adapters.
tests/benchmarks/_framework/registry.py New capabilities_for reads the ClassVar directly from the class when the factory IS the class (no instantiation); falls back to factory().capabilities for closure factories. Previous reviewer concern about unnecessary instantiation is fully resolved.
tests/benchmarks/_framework/config.py Replaces both benchmark != "cloudopsbench" guards with capability lookups via _capabilities_for_lint. Previous review's misleading late-import comment is gone; capabilities_for is imported at module level. Logic is correct.
tests/benchmarks/_framework/adapters.py Adds AdapterCapabilities and capabilities_for to __all__ and re-exports — clean shim update, backward-compatible.
tests/benchmarks/_framework/tests/test_adapter_capabilities.py Thorough tests for the new capability layer including the no-instantiation contract; closure-factory fallback path in capabilities_for has no test coverage.
tests/benchmarks/_framework/tests/test_config.py Test names and assertions updated to match capability-based error messages; new passing tests for CloudOpsBench's declared capabilities — well-aligned with the refactor.
tests/benchmarks/cloudopsbench/adapter.py Adds capabilities = AdapterCapabilities(supports_agent_variant=True, supports_predictor_variant=True) — exactly matches what the previous hardcoded checks allowed, preserving full backward compatibility.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["BenchmarkConfig.lint()"] --> B["_capabilities_for_lint(benchmark)"]
    B --> C["capabilities_for(name) — registry.py"]
    C --> D{Factory is a class?}
    D -- Yes --> E["factory.capabilities (ClassVar read, no __init__)"]
    D -- No --> F["factory().capabilities (closure fallback)"]
    E --> G["AdapterCapabilities"]
    F --> G
    C -- KeyError / ImportError --> H["AdapterCapabilities() — all-False default"]
    B --> G
    G --> I{agent_variant != 'default'?}
    I -- supports_agent_variant=False --> J["Error: adapter must declare supports_agent_variant=True"]
    I -- supports_agent_variant=True --> K["✓ Accepted"]
    G --> L{predictor_variant == 'structured'?}
    L -- supports_predictor_variant=False --> M["Error: adapter must declare supports_predictor_variant=True"]
    L -- supports_predictor_variant=True --> N["Check OpenAI LLM constraint"]
Loading

Reviews (3): Last reviewed commit: "fixed issues" | Re-trigger Greptile

Comment thread tests/benchmarks/_framework/registry.py Outdated
Comment thread tests/benchmarks/_framework/config.py Outdated
Comment thread tests/benchmarks/_framework/config.py
@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

@YauhenBichel YauhenBichel marked this pull request as ready for review June 12, 2026 11:36
@YauhenBichel YauhenBichel merged commit 2d9b732 into main Jun 12, 2026
16 checks passed
@YauhenBichel YauhenBichel deleted the fix/2074-bench-low-coupling branch June 12, 2026 11:36
@github-actions

Copy link
Copy Markdown
Contributor

🧑‍💻 @YauhenBichel has entered the contributor hall of fame. Merged. Done. Shipped. Go touch grass (then come back with another PR). 🌱


👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Benchmark opensre+LLM vs LLM-alone (Cloudopsbench)

1 participant