fix(bench): fixed vocabulary and prepared config for next experiment#2787
Conversation
Greptile code reviewThis repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md. Run a review — add a PR comment with: Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5. Optional: automate with the greploop skill. |
|
@greptile review |
Greptile SummaryThis PR fixes a vocabulary completeness bug in the cloudopsbench predictor where 33.7% of benchmark cells (714/2118) had a ground-truth
Confidence Score: 5/5Safe to merge — vocabulary expansion is a pure additive correction to a static tuple, and the YAML arm addition is isolated to experiment configuration with no runtime side effects on existing arms. Both changes are narrow and well-justified: the vocabulary tuple is extended with services already present in the benchmark dataset, and the config YAML gains one extra mode that does not alter the existing two arms. No logic, data flow, or scoring paths are modified. No files require special attention — the only open item is a stale header comment in the YAML config that could mislead experiment-design readers. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[LLM emits fault_object] --> B{snap_fault_object}
B --> C{Service in vocab?}
C -->|exact match| D[Return canonical service]
C -->|no match - fuzzy snap| E[Nearest in-vocab substitute]
E --> F[Wrong service scored - A1 zero on affected cells]
D --> G[Correct score]
subgraph Before - 19 services
E
F
end
subgraph After - 43 services
D
G
end
Reviews (2): Last reviewed commit: "fixed vocabulary and prepared config for..." | Re-trigger Greptile |
|
🏄 Some PRs rot in review for six weeks. @YauhenBichel's said "not today" and merged like it owned the place. 🌊 👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome. |

Fixes #2074
CaseFilters.seedsilent drop + silentImportErrorin adapter registryllm_alonearm to the trimmed-prompt definitive configseedbugs during reviewDescribe the changes you have made in this PR -
Root cause — vocabulary completeness
Loss-mode triage on the n=2118 case-level results from
dev-2026-06-09T10-09-04Z_cloudopsbench(full N, trimmed prompt) found that 62% of opensre+llm Runtime losses areOBJECT_MISS(wrong service entirely). Drilling in: GT=app/tsdb-mysqlaccounts for 90 cells at A@1=0.00 — the predictor was substituting the nearest in-vocab service (ts-inside-payment-service) every time._FAULT_OBJECT_SERVICESlisted 19 services. The dataset has 41 distinctfault_objectvalues. The system prompt declares those 19 as "service is one of: ..." so the LLM treats it as a closed set. 714 of 2118 cells (33.7%) had a GT not in the vocab.Predicted impact
The vocab fix is a scoring-artifact correction, not an opensre-side improvement. Both arms gain equally. The structural opensre-vs-LLM-alone question remains for the Tier 1 experiments scoped in the prior issue comment.
Code Understanding and AI Usage
Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?
If you used AI assistance:
Checklist before requesting a review
Note: Please check Allow edits from maintainers if you would like us to assist in the PR.