fix(bench): cloudopsbench vocab + scope rule + fix-a + taxonomy fix#2768
Conversation
Greptile code reviewThis repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md. Run a review — add a PR comment with: Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5. Optional: automate with the greploop skill. |
|
@greptile review |
Greptile SummaryThree targeted fixes to the CloudOpsBench predictor/scorer pipeline, addressing the unseen-shape a1 collapse (a1 ≈ 0.01) observed in the 2026-06-06 run by correcting a vocabulary gap, a prompt framing bug, and a taxonomy misclassification.
Confidence Score: 4/5Safe to merge with one implementation gap to resolve: the The core fixes are correct and well-tested. The tests/benchmarks/cloudopsbench/performance_alert_localization.py — the Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[CloudOpsBenchAdapter.post_process_run] --> B[build_alert]
A --> C[_require_case]
A --> D[_summarize_investigation]
C --> E{fault_category == performance?}
E -- yes --> F[performance_context_for_case_dir]
F --> G[metric_alerts]
F --> H[perf_hint]
E -- no --> I[metric_alerts=empty, perf_hint=None]
D --> J[emit_paper_predictions]
G --> J
H --> J
I --> J
J --> K[_build_system_prompt]
J --> L[_build_user_prompt]
L --> M[LLM call]
M --> N[top_3_predictions]
N --> O[enriched RunResult]
Reviews (4): Last reviewed commit: "fix(bench): guard metric_alerts + obj co..." | Re-trigger Greptile |
|
@greptile review |
|
@greptile review |
|
@greptile review |
|
🐉 Legend says enough merged PRs and you ascend. @YauhenBichel is dangerously close. 🌤️ 👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome. |

Fixes #2074
Describe the changes you have made in this PR -
CloudOpsBench predictor + scoring patch closing the unseen-shape A@1 collapse
observed in the 2026-06-06 powered run (a1 = 0.014 across 96 unseen-shape
cases). Three logical groups in
tests/benchmarks/cloudopsbench/.Group A — vocabulary additions + scope rule
(
predictor.py,test_predictor_snapping.py)Seven
_ROOT_CAUSEStokens that match unseen-shape ground truths but wereabsent on 2026-06-06:
pod_network_delay,pod_cpu_overload,namespace_cpu_quota_exceeded,namespace_memory_quota_exceeded,namespace_pod_quota_exceeded,namespace_service_quota_exceeded,namespace_storage_quota_exceeded. Without them ~93% of unseen-shape groundtruths had no legal target, and
pod_network_delaymis-snapped ontonode_network_delay(Infrastructure_Fault — wrong family).A "Scope rule" block in
_build_system_promptaddressing namespace-vs-appconfusion in 51 of 169 A.3 cells:
namespace_*root_cause requiresnamespace/<X>object, multi-service same-namespace failure → namespace-scoperank-1, plus a single-service carve-out (port / image / probe / secret) so
the rule doesn't over-fire on the 158 seen-shape cases where the matched
baseline scores 0.56.
Group B — Fix-A authoritative investigation framing
(
predictor.py,adapter.py,analyze_validation.py, two new configs)Makes opensre's investigation conclusion AUTHORITATIVE for the predictor's
rank-1 instead of letting the predictor re-diagnose and drop it. The 06-06
run leaked the correct component opensre named from the predictor's top-3
on 15.2% of opensre+llm failures vs 5.7% for llm_alone — a measurement
under-attribution of opensre's contribution.
analyze_validation.pyis aread-only post-run analyzer that measures the leak directly per arm.
Validation surface added:
cloudopsbench_fixa_validation_openai.yml— paired 40-case two-arm slicecloudopsbench_control_openai.yml— three-arm contrast at the chosen floorGroup D — Scoring taxonomy fix
(
scoring.py,test_scoring_taxonomy.py)missing_service_accountwas mapped toScheduling_Faultbut the dataset'sground_truth.fault_taxonomysaysAdmission_Faulton all 10boutique/admission/*cases (the apiserver rejects pod creation at admissiontime). Standalone scorer bug; cost a1 on every affected case.
Plus three new bench configs supporting the pilot + follow-up:
cloudopsbench_vocabpilot_openai.ymlcloudopsbench_vocabpilot_anthropic.ymlcloudopsbench_postpatch_anthropic.yml(full-N follow-up on Claude)Demo/Screenshot for feature changes and bug fixes -
Pilot on 60 unseen-shape cells, claude-4-sonnet,
--dev, 52 min wall time:28 of 60 cells scored a1 = 1.0. Per-stratum: admission a1 = 0.758,
boutique a1 = 0.718. Trainticket-performance residual is a known follow-up.
Test surface:
$ uv run pytest tests/benchmarks/cloudopsbench/
============================= 173 passed ==============================
Out-of-repo writeups (full audit + ANALYSIS.md addendum):
~/DevBox/tracer-cloud/bench-results-openai/2026-06-06T14-11-16Z_cloudopsbench/{ANALYSIS.md,audit_vocab_fix_lift.py}Code Understanding and AI Usage
Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?
If you used AI assistance:
Explain your implementation approach:
The 2026-06-06 three-arm run aborted at 9h (OpenAI credit exhausted) at 28%
coverage. The unseen-shape stratum collapsed to a1 ≈ 0.01 across all arms
while
object_a1was ~0.40 — a 35-point gap that looked structural. Acounterfactual audit on the 2,286 written cells showed 92.7% of unseen-shape
ground-truth tokens were missing from the predictor's controlled vocabulary,
so the agent localized the right component but the scorer's strict-match step
always failed. The seven Group A additions make those tokens legal targets in
the prompt and snap dictionary; the existing snap cutoff (0.8) and
blocked-concept pairs guard are preserved so vocabulary growth cannot regress
prior snaps.
A second audit found 51 of 169 A.3 cells (rank-1 object wrong) had
namespace-scope ground truth but app-scope rank-1 predictions; reports already
named the right namespace, the LLM just defaulted to picking an affected
service. A post-hoc family-based rewrite would have rescued ~1 of 51, so the
fix has to teach upstream — that's the Group A scope rule. The single-service
carve-out is load-bearing: without it the rule over-fires on the seen-shape
app-scope cases.
Group B is independent: even with correct vocab and scope, the predictor was
discarding opensre's named component on 15.2% of failures. Making the
investigation summary authoritative + measuring the leak directly (via
analyze_validation.py) closes that loop.cloudopsbench_fixa_validation_openai.ymlis the paired-arm config that quantifies Fix-A's contribution in isolation.
Group D is a standalone scorer bug; verified against
benchmark/<system>/<category>/<id>/metadata.jsonground-truth files.A deterministic performance-fault localizer was prototyped and held back from
this PR pending separate validation — see WIP branch
wip/group-c-perf-localizer.Checklist before requesting a review
Note: Please check Allow edits from maintainers if you would like us to assist in the PR.