fix(bench): cloudopsbench vocab + scope rule + fix-a + taxonomy fix by YauhenBichel · Pull Request #2768 · Tracer-Cloud/opensre

YauhenBichel · 2026-06-07T11:21:53Z

Fixes #2074

Describe the changes you have made in this PR -

CloudOpsBench predictor + scoring patch closing the unseen-shape A@1 collapse
observed in the 2026-06-06 powered run (a1 = 0.014 across 96 unseen-shape
cases). Three logical groups in
tests/benchmarks/cloudopsbench/.

Group A — vocabulary additions + scope rule
(predictor.py, test_predictor_snapping.py)

Seven _ROOT_CAUSES tokens that match unseen-shape ground truths but were
absent on 2026-06-06: pod_network_delay, pod_cpu_overload,
namespace_cpu_quota_exceeded, namespace_memory_quota_exceeded,
namespace_pod_quota_exceeded, namespace_service_quota_exceeded,
namespace_storage_quota_exceeded. Without them ~93% of unseen-shape ground
truths had no legal target, and pod_network_delay mis-snapped onto
node_network_delay (Infrastructure_Fault — wrong family).

A "Scope rule" block in _build_system_prompt addressing namespace-vs-app
confusion in 51 of 169 A.3 cells: namespace_* root_cause requires
namespace/<X> object, multi-service same-namespace failure → namespace-scope
rank-1, plus a single-service carve-out (port / image / probe / secret) so
the rule doesn't over-fire on the 158 seen-shape cases where the matched
baseline scores 0.56.

Group B — Fix-A authoritative investigation framing
(predictor.py, adapter.py, analyze_validation.py, two new configs)

Makes opensre's investigation conclusion AUTHORITATIVE for the predictor's
rank-1 instead of letting the predictor re-diagnose and drop it. The 06-06
run leaked the correct component opensre named from the predictor's top-3
on 15.2% of opensre+llm failures vs 5.7% for llm_alone — a measurement
under-attribution of opensre's contribution. analyze_validation.py is a
read-only post-run analyzer that measures the leak directly per arm.

Validation surface added:

cloudopsbench_fixa_validation_openai.yml — paired 40-case two-arm slice
cloudopsbench_control_openai.yml — three-arm contrast at the chosen floor

Group D — Scoring taxonomy fix
(scoring.py, test_scoring_taxonomy.py)

missing_service_account was mapped to Scheduling_Fault but the dataset's
ground_truth.fault_taxonomy says Admission_Fault on all 10
boutique/admission/* cases (the apiserver rejects pod creation at admission
time). Standalone scorer bug; cost a1 on every affected case.

Plus three new bench configs supporting the pilot + follow-up:

cloudopsbench_vocabpilot_openai.yml
cloudopsbench_vocabpilot_anthropic.yml
cloudopsbench_postpatch_anthropic.yml (full-N follow-up on Claude)

Demo/Screenshot for feature changes and bug fixes -

Pilot on 60 unseen-shape cells, claude-4-sonnet, --dev, 52 min wall time:

metric	2026-06-06 baseline (gpt-4o, unseen-shape)	pilot (claude-4-sonnet)	lift
a1	0.014	0.467	33×
object_a1	0.354	0.667	1.9×
partial_a1	0.017	0.567	33×

28 of 60 cells scored a1 = 1.0. Per-stratum: admission a1 = 0.758,
boutique a1 = 0.718. Trainticket-performance residual is a known follow-up.

Test surface:
$ uv run pytest tests/benchmarks/cloudopsbench/
============================= 173 passed ==============================

Out-of-repo writeups (full audit + ANALYSIS.md addendum):
~/DevBox/tracer-cloud/bench-results-openai/2026-06-06T14-11-16Z_cloudopsbench/{ANALYSIS.md,audit_vocab_fix_lift.py}

Code Understanding and AI Usage

Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?

No, I wrote all the code myself
Yes, I used AI assistance (continue below)

If you used AI assistance:

I have reviewed every single line of the AI-generated code
I can explain the purpose and logic of each function/component I added
I have tested edge cases and understand how the code handles them
I have modified the AI output to follow this project's coding standards and conventions

Explain your implementation approach:

The 2026-06-06 three-arm run aborted at 9h (OpenAI credit exhausted) at 28%
coverage. The unseen-shape stratum collapsed to a1 ≈ 0.01 across all arms
while object_a1 was ~0.40 — a 35-point gap that looked structural. A
counterfactual audit on the 2,286 written cells showed 92.7% of unseen-shape
ground-truth tokens were missing from the predictor's controlled vocabulary,
so the agent localized the right component but the scorer's strict-match step
always failed. The seven Group A additions make those tokens legal targets in
the prompt and snap dictionary; the existing snap cutoff (0.8) and
blocked-concept pairs guard are preserved so vocabulary growth cannot regress
prior snaps.

A second audit found 51 of 169 A.3 cells (rank-1 object wrong) had
namespace-scope ground truth but app-scope rank-1 predictions; reports already
named the right namespace, the LLM just defaulted to picking an affected
service. A post-hoc family-based rewrite would have rescued ~1 of 51, so the
fix has to teach upstream — that's the Group A scope rule. The single-service
carve-out is load-bearing: without it the rule over-fires on the seen-shape
app-scope cases.

Group B is independent: even with correct vocab and scope, the predictor was
discarding opensre's named component on 15.2% of failures. Making the
investigation summary authoritative + measuring the leak directly (via
analyze_validation.py) closes that loop. cloudopsbench_fixa_validation_openai.yml
is the paired-arm config that quantifies Fix-A's contribution in isolation.

Group D is a standalone scorer bug; verified against
benchmark/<system>/<category>/<id>/metadata.json ground-truth files.

A deterministic performance-fault localizer was prototyped and held back from
this PR pending separate validation — see WIP branch wip/group-c-perf-localizer.

Checklist before requesting a review

I have added proper PR title and linked to the issue
I have performed a self-review of my code
I can explain the purpose of every function, class, and logic block I added
I understand why my changes work and have tested them thoroughly
I have considered potential edge cases and how my code handles them
If it is a core feature, I have added thorough tests
My code follows the project's style guidelines and conventions

Note: Please check Allow edits from maintainers if you would like us to assist in the PR.

github-actions · 2026-06-07T11:22:02Z

Greptile code review

This repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md.

Run a review — add a PR comment with:

@greptile review

Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5.

Optional: automate with the greploop skill.

YauhenBichel · 2026-06-07T11:25:59Z

@greptile review

greptile-apps · 2026-06-07T11:30:40Z

Greptile Summary

Three targeted fixes to the CloudOpsBench predictor/scorer pipeline, addressing the unseen-shape a1 collapse (a1 ≈ 0.01) observed in the 2026-06-06 run by correcting a vocabulary gap, a prompt framing bug, and a taxonomy misclassification.

Group A (vocab + scope rule): Adds 7 missing _ROOT_CAUSES tokens that caused ~93% of unseen-shape ground truths to have no legal target, and adds a namespace-vs-app scope rule to the system prompt with a single-service carve-out.
Group B (Fix-A framing): _summarize_investigation now leads with opensre's identified component; the system prompt marks a provided investigation summary as AUTHORITATIVE for rank-1; adds analyze_validation.py and performance_alert_localization.py.
Group D (taxonomy bug): missing_service_account remapped from Scheduling_Fault to Admission_Fault to match ground-truth metadata; five new benchmark configs added.

Confidence Score: 4/5

Safe to merge with one implementation gap to resolve: the namespace parameter in infer_performance_localization is documented as filtering node-level entities but is never used.

The core fixes are correct and well-tested. The namespace parameter in infer_performance_localization is accepted as required and documented as actively used for node-entity rejection, but is dead code — the implementation uses a hardcoded node-name set instead. Any benchmark environment with differently-named nodes would silently produce invalid app/<node-name> fault objects, zeroing a1 on those cells.

tests/benchmarks/cloudopsbench/performance_alert_localization.py — the namespace parameter and its node-filtering intent need reconciliation before results from heterogeneous clusters can be trusted.

Important Files Changed

Filename	Overview
tests/benchmarks/cloudopsbench/performance_alert_localization.py	New module for deterministic alert-driven performance localization. Contains a dead `namespace` parameter in `infer_performance_localization` — documented as used for node-entity rejection but never referenced; actual filtering uses a hardcoded node set.
tests/benchmarks/cloudopsbench/adapter.py	Group B Fix-A changes: `_summarize_investigation` now leads with the identified component, and `performance_context_for_case_dir` is called unconditionally before the `fault_category` guard discards results for non-performance cases.
tests/benchmarks/cloudopsbench/predictor.py	Adds 7 missing vocab tokens to `_ROOT_CAUSES`, new scope rule and performance-disambiguation block in the system prompt. The investigation-summary header in `_build_user_prompt` leaks 'performance localization block below' language into non-performance prompts.
tests/benchmarks/cloudopsbench/scoring.py	Standalone taxonomy bug fix: `missing_service_account` moved from `Scheduling_Fault` to `Admission_Fault` to match ground-truth metadata on all 10 boutique/admission cases.
tests/benchmarks/cloudopsbench/analyze_validation.py	New read-only post-run analyzer for Fix-A validation. The `scen_a1` closure already uses the default-argument capture idiom to pin the loop variable correctly.
tests/benchmarks/cloudopsbench/test_predictor_snapping.py	New tests pin the 7 vocab additions as snap-stable, verify `pod_network_delay` doesn't collapse onto `node_network_delay`, and guard the scope-rule and carve-out prompt directives.
tests/benchmarks/cloudopsbench/test_scoring_taxonomy.py	Updated taxonomy test: `missing_service_account` mapped to `Admission_Fault`, old `Scheduling_Fault` assertion removed. Clean.
tests/benchmarks/cloudopsbench/test_performance_alert_localization.py	New tests for the performance localization module; parametrize against real benchmark case IDs with a skip guard for missing data.
tests/benchmarks/configs/cloudopsbench_postpatch_anthropic.yml	New full-N Anthropic config for publication-grade run; well-documented with cost projections and decision rules.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[CloudOpsBenchAdapter.post_process_run] --> B[build_alert]
    A --> C[_require_case]
    A --> D[_summarize_investigation]
    C --> E{fault_category == performance?}
    E -- yes --> F[performance_context_for_case_dir]
    F --> G[metric_alerts]
    F --> H[perf_hint]
    E -- no --> I[metric_alerts=empty, perf_hint=None]
    D --> J[emit_paper_predictions]
    G --> J
    H --> J
    I --> J
    J --> K[_build_system_prompt]
    J --> L[_build_user_prompt]
    L --> M[LLM call]
    M --> N[top_3_predictions]
    N --> O[enriched RunResult]

_{Reviews (4): Last reviewed commit: "fix(bench): guard metric_alerts + obj co..." | Re-trigger Greptile}

YauhenBichel · 2026-06-07T11:36:29Z

@greptile review

YauhenBichel · 2026-06-07T11:47:34Z

@greptile review

YauhenBichel · 2026-06-07T11:56:39Z

@greptile review

github-actions · 2026-06-07T12:04:00Z

🐉 Legend says enough merged PRs and you ascend. @YauhenBichel is dangerously close. 🌤️

👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

fix bench config

2c13c17

greptile-apps Bot reviewed Jun 7, 2026

View reviewed changes

Comment thread tests/benchmarks/cloudopsbench/analyze_validation.py Outdated

Comment thread tests/benchmarks/cloudopsbench/analyze_validation.py

Comment thread tests/benchmarks/cloudopsbench/analyze_validation.py

YauhenBichel changed the title ~~fix(bench): cloudopsbench predictor — vocab + scope rule for unseen-shape lift~~ fix(bench): cloudopsbench vocab + scope rule + fix-a + taxonomy fix Jun 7, 2026

improving bench predictor

92c24ac

github-advanced-security AI found potential problems Jun 7, 2026

View reviewed changes

Comment thread tests/benchmarks/cloudopsbench/test_performance_alert_localization.py Fixed

github-code-quality Bot found potential problems Jun 7, 2026

View reviewed changes

Comment thread tests/benchmarks/cloudopsbench/test_performance_alert_localization.py Fixed

refactoring

279876b

YauhenBichel added 3 commits June 7, 2026 12:48

format fixed

c1bccf2

fix(bench): skip perf-localization tests when corpus is absent

68cca15

fix(bench): guard metric_alerts + obj counter symmetry

da6e442

YauhenBichel marked this pull request as ready for review June 7, 2026 11:56

YauhenBichel merged commit 02a1fdd into main Jun 7, 2026
17 checks passed

YauhenBichel deleted the fix/2074-bench-openai-vocab-fargat branch June 7, 2026 12:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bench): cloudopsbench vocab + scope rule + fix-a + taxonomy fix#2768

fix(bench): cloudopsbench vocab + scope rule + fix-a + taxonomy fix#2768
YauhenBichel merged 6 commits into
mainfrom
fix/2074-bench-openai-vocab-fargat

YauhenBichel commented Jun 7, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 7, 2026

Uh oh!

YauhenBichel commented Jun 7, 2026

Uh oh!

greptile-apps Bot commented Jun 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YauhenBichel commented Jun 7, 2026

Uh oh!

Uh oh!

Uh oh!

YauhenBichel commented Jun 7, 2026

Uh oh!

YauhenBichel commented Jun 7, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

YauhenBichel commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the changes you have made in this PR -

Demo/Screenshot for feature changes and bug fixes -

Code Understanding and AI Usage

Checklist before requesting a review

Uh oh!

github-actions Bot commented Jun 7, 2026

Greptile code review

Uh oh!

YauhenBichel commented Jun 7, 2026

Uh oh!

greptile-apps Bot commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YauhenBichel commented Jun 7, 2026

Uh oh!

Uh oh!

Uh oh!

YauhenBichel commented Jun 7, 2026

Uh oh!

YauhenBichel commented Jun 7, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

YauhenBichel commented Jun 7, 2026 •

edited

Loading

greptile-apps Bot commented Jun 7, 2026 •

edited

Loading