feat(bench): structured-outputs predictor + overfit controls#2794
Conversation
Greptile code reviewThis repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md. Run a review — add a PR comment with: Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5. Optional: automate with the greploop skill. |
|
@greptile review |
Greptile SummaryThis PR introduces an OpenAI structured-outputs predictor variant for the CloudOpsBench harness, targeting the residual 24% predictor drift (OBJECT_HIT_RC_MISS) on the post-vocab-fix baseline. It also reverts a prior DB-localization prompt rule (confirmed net-negative) and first-classes five overfit-control guards (
Confidence Score: 5/5Bench-only change with no production code paths touched; all previously flagged issues are fully resolved in this revision. The core new module is well-isolated with a clear None-on-failure contract preserving the existing keyword-bridge fallback, and its anti-overfit invariants are locked by 12 tests. The overfit guards are first-class code with thorough coverage including the tricky single-stratum edge case. Only two minor documentation inconsistencies remain, neither of which affects runtime behavior. No files require special attention. Important Files Changed
Reviews (8): Last reviewed commit: "fixed float division" | Re-trigger Greptile |
|
@greptile review |
|
@greptile review |
|
@greptile review |
|
@greptile review |
|
@greptile review |
|
@greptile review |
|
💼 Interviewer: describe a time you shipped something impactful. @YauhenBichel: points at this PR Interviewer: you're hired. 🤝 👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome. |

Fixes #2074 (experiments)
Describe the changes you have made in this PR -
root_causeandfault_taxonomy. Targets the residual 24% predictor drift measured on the post-vocab-fix baseline.Mechanism
OpenAI's
client.beta.chat.completions.parse()with a Pydantic schema whoseroot_causeandfault_taxonomyareLiteral[...]enums built programmatically fromvocabulary.py(single source of truth with the scorer). The LLM literally cannot emit out-of-enum tokens; adjacent-vocabulary cells (e.g.mysql_invalid_port → db_connection_exhaustion) are blocked at the sampler, not at the parser.Bench-only — production opensre is untouched. The dispatcher in
adapter.pyroutes to this variant only whenpredictor_variant: structuredis set AND the configured LLM is OpenAI (cross-field lint enforces).Code Understanding and AI Usage
Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?
If you used AI assistance:
Explain your implementation approach:
Changes
predictor/llm_call.pypredictor/llm_call_structured_openai.py(NEW)Literalenums; cost-tracking via_emit_usagepredictor/__init__.pyemit_paper_predictions_structured; documents multi-provider roadmap_framework/config.pypredictor_variant: Literal["default", "structured"]field + cross-field lint refusing non-OpenAI + non-cloudopsbenchcloudopsbench/adapter.pyself._predictor_variantset inapply_config_overrides; dispatch inscore_caseconfigs/cloudopsbench_structured_outputs_smoke_openai.yml(NEW)configs/cloudopsbench_structured_outputs_openai.yml(NEW)configs/preregistrations/exp_structured_outputs_v1.yml(NEW)tests/test_predictor_structured.py(NEW)scripts/overfit_attribution.py(NEW)Anti-overfit discipline (the differentiator)
Nine guards fire across four phases. Each one has a pre-registered threshold; the variant has to clear all of them to promote to default.
vocabulary._ROOT_CAUSES/_TAXONOMY_CATEGORIESis-identity check — structured variant uses text variant's prompt builders by reference, not copypredictor_variant=structuredwith non-OpenAI LLMs or non-cloudopsbenchheld_out_lift / optimize_lift ≥ 0.70to ship;< 0.30rejects as overfit (BDIL Phase F;held_out_seed=42, never re-rolled)The pre-reg locks decision rules BEFORE the run — promote / null / reject-as-overfit / reject-as-regression branches are committed, so the call doesn't depend on post-hoc interpretation.
Checklist before requesting a review
Note: Please check Allow edits from maintainers if you would like us to assist in the PR.