Skip to content

fix(bench): cloudopsbench vocab + scope rule + fix-a + taxonomy fix#2768

Merged
YauhenBichel merged 6 commits into
mainfrom
fix/2074-bench-openai-vocab-fargat
Jun 7, 2026
Merged

fix(bench): cloudopsbench vocab + scope rule + fix-a + taxonomy fix#2768
YauhenBichel merged 6 commits into
mainfrom
fix/2074-bench-openai-vocab-fargat

Conversation

@YauhenBichel

@YauhenBichel YauhenBichel commented Jun 7, 2026

Copy link
Copy Markdown
Collaborator

Fixes #2074

Describe the changes you have made in this PR -

CloudOpsBench predictor + scoring patch closing the unseen-shape A@1 collapse
observed in the 2026-06-06 powered run (a1 = 0.014 across 96 unseen-shape
cases). Three logical groups in
tests/benchmarks/cloudopsbench/.

Group A — vocabulary additions + scope rule
(predictor.py, test_predictor_snapping.py)

Seven _ROOT_CAUSES tokens that match unseen-shape ground truths but were
absent on 2026-06-06: pod_network_delay, pod_cpu_overload,
namespace_cpu_quota_exceeded, namespace_memory_quota_exceeded,
namespace_pod_quota_exceeded, namespace_service_quota_exceeded,
namespace_storage_quota_exceeded. Without them ~93% of unseen-shape ground
truths had no legal target, and pod_network_delay mis-snapped onto
node_network_delay (Infrastructure_Fault — wrong family).

A "Scope rule" block in _build_system_prompt addressing namespace-vs-app
confusion in 51 of 169 A.3 cells: namespace_* root_cause requires
namespace/<X> object, multi-service same-namespace failure → namespace-scope
rank-1, plus a single-service carve-out (port / image / probe / secret) so
the rule doesn't over-fire on the 158 seen-shape cases where the matched
baseline scores 0.56.

Group B — Fix-A authoritative investigation framing
(predictor.py, adapter.py, analyze_validation.py, two new configs)

Makes opensre's investigation conclusion AUTHORITATIVE for the predictor's
rank-1 instead of letting the predictor re-diagnose and drop it. The 06-06
run leaked the correct component opensre named from the predictor's top-3
on 15.2% of opensre+llm failures vs 5.7% for llm_alone — a measurement
under-attribution of opensre's contribution. analyze_validation.py is a
read-only post-run analyzer that measures the leak directly per arm.

Validation surface added:

  • cloudopsbench_fixa_validation_openai.yml — paired 40-case two-arm slice
  • cloudopsbench_control_openai.yml — three-arm contrast at the chosen floor

Group D — Scoring taxonomy fix
(scoring.py, test_scoring_taxonomy.py)

missing_service_account was mapped to Scheduling_Fault but the dataset's
ground_truth.fault_taxonomy says Admission_Fault on all 10
boutique/admission/* cases (the apiserver rejects pod creation at admission
time). Standalone scorer bug; cost a1 on every affected case.

Plus three new bench configs supporting the pilot + follow-up:

  • cloudopsbench_vocabpilot_openai.yml
  • cloudopsbench_vocabpilot_anthropic.yml
  • cloudopsbench_postpatch_anthropic.yml (full-N follow-up on Claude)

Demo/Screenshot for feature changes and bug fixes -

Pilot on 60 unseen-shape cells, claude-4-sonnet, --dev, 52 min wall time:

metric 2026-06-06 baseline (gpt-4o, unseen-shape) pilot (claude-4-sonnet) lift
a1 0.014 0.467 33×
object_a1 0.354 0.667 1.9×
partial_a1 0.017 0.567 33×

28 of 60 cells scored a1 = 1.0. Per-stratum: admission a1 = 0.758,
boutique a1 = 0.718. Trainticket-performance residual is a known follow-up.

Test surface:
$ uv run pytest tests/benchmarks/cloudopsbench/
============================= 173 passed ==============================

Out-of-repo writeups (full audit + ANALYSIS.md addendum):
~/DevBox/tracer-cloud/bench-results-openai/2026-06-06T14-11-16Z_cloudopsbench/{ANALYSIS.md,audit_vocab_fix_lift.py}


Code Understanding and AI Usage

Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?

  • No, I wrote all the code myself
  • Yes, I used AI assistance (continue below)

If you used AI assistance:

  • I have reviewed every single line of the AI-generated code
  • I can explain the purpose and logic of each function/component I added
  • I have tested edge cases and understand how the code handles them
  • I have modified the AI output to follow this project's coding standards and conventions

Explain your implementation approach:

The 2026-06-06 three-arm run aborted at 9h (OpenAI credit exhausted) at 28%
coverage. The unseen-shape stratum collapsed to a1 ≈ 0.01 across all arms
while object_a1 was ~0.40 — a 35-point gap that looked structural. A
counterfactual audit on the 2,286 written cells showed 92.7% of unseen-shape
ground-truth tokens were missing from the predictor's controlled vocabulary,
so the agent localized the right component but the scorer's strict-match step
always failed. The seven Group A additions make those tokens legal targets in
the prompt and snap dictionary; the existing snap cutoff (0.8) and
blocked-concept pairs guard are preserved so vocabulary growth cannot regress
prior snaps.

A second audit found 51 of 169 A.3 cells (rank-1 object wrong) had
namespace-scope ground truth but app-scope rank-1 predictions; reports already
named the right namespace, the LLM just defaulted to picking an affected
service. A post-hoc family-based rewrite would have rescued ~1 of 51, so the
fix has to teach upstream — that's the Group A scope rule. The single-service
carve-out is load-bearing: without it the rule over-fires on the seen-shape
app-scope cases.

Group B is independent: even with correct vocab and scope, the predictor was
discarding opensre's named component on 15.2% of failures. Making the
investigation summary authoritative + measuring the leak directly (via
analyze_validation.py) closes that loop. cloudopsbench_fixa_validation_openai.yml
is the paired-arm config that quantifies Fix-A's contribution in isolation.

Group D is a standalone scorer bug; verified against
benchmark/<system>/<category>/<id>/metadata.json ground-truth files.

A deterministic performance-fault localizer was prototyped and held back from
this PR pending separate validation — see WIP branch wip/group-c-perf-localizer.


Checklist before requesting a review

  • I have added proper PR title and linked to the issue
  • I have performed a self-review of my code
  • I can explain the purpose of every function, class, and logic block I added
  • I understand why my changes work and have tested them thoroughly
  • I have considered potential edge cases and how my code handles them
  • If it is a core feature, I have added thorough tests
  • My code follows the project's style guidelines and conventions

Note: Please check Allow edits from maintainers if you would like us to assist in the PR.

@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Greptile code review

This repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md.

Run a review — add a PR comment with:

@greptile review

Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5.

Optional: automate with the greploop skill.

@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

@greptile-apps

greptile-apps Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Three targeted fixes to the CloudOpsBench predictor/scorer pipeline, addressing the unseen-shape a1 collapse (a1 ≈ 0.01) observed in the 2026-06-06 run by correcting a vocabulary gap, a prompt framing bug, and a taxonomy misclassification.

  • Group A (vocab + scope rule): Adds 7 missing _ROOT_CAUSES tokens that caused ~93% of unseen-shape ground truths to have no legal target, and adds a namespace-vs-app scope rule to the system prompt with a single-service carve-out.
  • Group B (Fix-A framing): _summarize_investigation now leads with opensre's identified component; the system prompt marks a provided investigation summary as AUTHORITATIVE for rank-1; adds analyze_validation.py and performance_alert_localization.py.
  • Group D (taxonomy bug): missing_service_account remapped from Scheduling_Fault to Admission_Fault to match ground-truth metadata; five new benchmark configs added.

Confidence Score: 4/5

Safe to merge with one implementation gap to resolve: the namespace parameter in infer_performance_localization is documented as filtering node-level entities but is never used.

The core fixes are correct and well-tested. The namespace parameter in infer_performance_localization is accepted as required and documented as actively used for node-entity rejection, but is dead code — the implementation uses a hardcoded node-name set instead. Any benchmark environment with differently-named nodes would silently produce invalid app/<node-name> fault objects, zeroing a1 on those cells.

tests/benchmarks/cloudopsbench/performance_alert_localization.py — the namespace parameter and its node-filtering intent need reconciliation before results from heterogeneous clusters can be trusted.

Important Files Changed

Filename Overview
tests/benchmarks/cloudopsbench/performance_alert_localization.py New module for deterministic alert-driven performance localization. Contains a dead namespace parameter in infer_performance_localization — documented as used for node-entity rejection but never referenced; actual filtering uses a hardcoded node set.
tests/benchmarks/cloudopsbench/adapter.py Group B Fix-A changes: _summarize_investigation now leads with the identified component, and performance_context_for_case_dir is called unconditionally before the fault_category guard discards results for non-performance cases.
tests/benchmarks/cloudopsbench/predictor.py Adds 7 missing vocab tokens to _ROOT_CAUSES, new scope rule and performance-disambiguation block in the system prompt. The investigation-summary header in _build_user_prompt leaks 'performance localization block below' language into non-performance prompts.
tests/benchmarks/cloudopsbench/scoring.py Standalone taxonomy bug fix: missing_service_account moved from Scheduling_Fault to Admission_Fault to match ground-truth metadata on all 10 boutique/admission cases.
tests/benchmarks/cloudopsbench/analyze_validation.py New read-only post-run analyzer for Fix-A validation. The scen_a1 closure already uses the default-argument capture idiom to pin the loop variable correctly.
tests/benchmarks/cloudopsbench/test_predictor_snapping.py New tests pin the 7 vocab additions as snap-stable, verify pod_network_delay doesn't collapse onto node_network_delay, and guard the scope-rule and carve-out prompt directives.
tests/benchmarks/cloudopsbench/test_scoring_taxonomy.py Updated taxonomy test: missing_service_account mapped to Admission_Fault, old Scheduling_Fault assertion removed. Clean.
tests/benchmarks/cloudopsbench/test_performance_alert_localization.py New tests for the performance localization module; parametrize against real benchmark case IDs with a skip guard for missing data.
tests/benchmarks/configs/cloudopsbench_postpatch_anthropic.yml New full-N Anthropic config for publication-grade run; well-documented with cost projections and decision rules.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[CloudOpsBenchAdapter.post_process_run] --> B[build_alert]
    A --> C[_require_case]
    A --> D[_summarize_investigation]
    C --> E{fault_category == performance?}
    E -- yes --> F[performance_context_for_case_dir]
    F --> G[metric_alerts]
    F --> H[perf_hint]
    E -- no --> I[metric_alerts=empty, perf_hint=None]
    D --> J[emit_paper_predictions]
    G --> J
    H --> J
    I --> J
    J --> K[_build_system_prompt]
    J --> L[_build_user_prompt]
    L --> M[LLM call]
    M --> N[top_3_predictions]
    N --> O[enriched RunResult]
Loading

Reviews (4): Last reviewed commit: "fix(bench): guard metric_alerts + obj co..." | Re-trigger Greptile

Comment thread tests/benchmarks/cloudopsbench/analyze_validation.py Outdated
Comment thread tests/benchmarks/cloudopsbench/analyze_validation.py
Comment thread tests/benchmarks/cloudopsbench/analyze_validation.py
@YauhenBichel YauhenBichel changed the title fix(bench): cloudopsbench predictor — vocab + scope rule for unseen-shape lift fix(bench): cloudopsbench vocab + scope rule + fix-a + taxonomy fix Jun 7, 2026
@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

@YauhenBichel YauhenBichel marked this pull request as ready for review June 7, 2026 11:56
@YauhenBichel YauhenBichel merged commit 02a1fdd into main Jun 7, 2026
17 checks passed
@YauhenBichel YauhenBichel deleted the fix/2074-bench-openai-vocab-fargat branch June 7, 2026 12:03
@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

🐉 Legend says enough merged PRs and you ascend. @YauhenBichel is dangerously close. 🌤️


👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Benchmark opensre+LLM vs LLM-alone (Cloudopsbench)

2 participants