Skip to content

feat(bench): structured-outputs predictor + overfit controls#2794

Merged
YauhenBichel merged 10 commits into
mainfrom
fix/2074-bench-predictor-structured-outputs
Jun 10, 2026
Merged

feat(bench): structured-outputs predictor + overfit controls#2794
YauhenBichel merged 10 commits into
mainfrom
fix/2074-bench-predictor-structured-outputs

Conversation

@YauhenBichel

Copy link
Copy Markdown
Collaborator

Fixes #2074 (experiments)

Describe the changes you have made in this PR -

  • Add OpenAI structured-outputs predictor variant — grammar-constrained sampling at the API layer for root_cause and fault_taxonomy. Targets the residual 24% predictor drift measured on the post-vocab-fix baseline.
  • Revert the prior DB-localization prompt rule (negative result: aggregate A@1 -2.5pp, Runtime -6.9pp, DB over-firing 73%). Details in linked issue comment.
  • Land six overfit-control guards as first-class code so future bench experiments can't ship a hidden stratum-level regression as an aggregate "win".

Mechanism

OpenAI's client.beta.chat.completions.parse() with a Pydantic schema whose root_cause and fault_taxonomy are Literal[...] enums built programmatically from vocabulary.py (single source of truth with the scorer). The LLM literally cannot emit out-of-enum tokens; adjacent-vocabulary cells (e.g. mysql_invalid_port → db_connection_exhaustion) are blocked at the sampler, not at the parser.

Bench-only — production opensre is untouched. The dispatcher in adapter.py routes to this variant only when predictor_variant: structured is set AND the configured LLM is OpenAI (cross-field lint enforces).


Code Understanding and AI Usage

Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?

  • No, I wrote all the code myself
  • Yes, I used AI assistance (continue below)

If you used AI assistance:

  • I have reviewed every single line of the AI-generated code
  • I can explain the purpose and logic of each function/component I added
  • I have tested edge cases and understand how the code handles them
  • I have modified the AI output to follow this project's coding standards and conventions

Explain your implementation approach:

Changes

File Change
predictor/llm_call.py Revert DB-localization prompt rule
predictor/llm_call_structured_openai.py (NEW) OpenAI structured-outputs predictor; Pydantic schema with vocabulary-derived Literal enums; cost-tracking via _emit_usage
predictor/__init__.py Re-export emit_paper_predictions_structured; documents multi-provider roadmap
_framework/config.py predictor_variant: Literal["default", "structured"] field + cross-field lint refusing non-OpenAI + non-cloudopsbench
cloudopsbench/adapter.py self._predictor_variant set in apply_config_overrides; dispatch in score_case
configs/cloudopsbench_structured_outputs_smoke_openai.yml (NEW) 40-case three-arm smoke
configs/cloudopsbench_structured_outputs_openai.yml (NEW) Full-N three-arm
configs/preregistrations/exp_structured_outputs_v1.yml (NEW) Pre-reg with 6 overfit guards + decision matrix
tests/test_predictor_structured.py (NEW) 12 invariant tests — schema/vocabulary parity, shared-prompt identity, dispatch routing, fallback behavior
scripts/overfit_attribution.py (NEW) Runtime per-system / per-stratum / cluster / held-out analysis

Anti-overfit discipline (the differentiator)

Nine guards fire across four phases. Each one has a pre-registered threshold; the variant has to clear all of them to promote to default.

Phase Guard Mechanism
Pre-commit (CI) Schema-vocabulary invariant Pydantic enum == vocabulary._ROOT_CAUSES / _TAXONOMY_CATEGORIES
Pre-commit (CI) Shared prompt invariant is-identity check — structured variant uses text variant's prompt builders by reference, not copy
Pre-dispatch Cross-field lint Refuses predictor_variant=structured with non-OpenAI LLMs or non-cloudopsbench
Smoke (n=40) OBJECT_HIT_RC_MISS share drop Direct mechanism check before paying for full-N
Full-N analysis Per-system uniformity boutique / trainticket lift spread ≤ 0.05
Full-N analysis Per-stratum uniformity max-stratum-lift / median-stratum-lift ≤ 2×
Full-N analysis Per-case attribution clustering No single (system, category, GT-service-prefix) cluster owns > 60% of loss→win flips
Full-N analysis Held-out generalization gate held_out_lift / optimize_lift ≥ 0.70 to ship; < 0.30 rejects as overfit (BDIL Phase F; held_out_seed=42, never re-rolled)
Full-N analysis A/A consistency Two-seed aggregate diff < 0.02 (bounds the noise floor)

The pre-reg locks decision rules BEFORE the run — promote / null / reject-as-overfit / reject-as-regression branches are committed, so the call doesn't depend on post-hoc interpretation.


Checklist before requesting a review

  • I have added proper PR title and linked to the issue
  • I have performed a self-review of my code
  • I can explain the purpose of every function, class, and logic block I added
  • I understand why my changes work and have tested them thoroughly
  • I have considered potential edge cases and how my code handles them
  • If it is a core feature, I have added thorough tests
  • My code follows the project's style guidelines and conventions

Note: Please check Allow edits from maintainers if you would like us to assist in the PR.

@github-actions

Copy link
Copy Markdown
Contributor

Greptile code review

This repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md.

Run a review — add a PR comment with:

@greptile review

Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5.

Optional: automate with the greploop skill.

@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

@greptile-apps

greptile-apps Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR introduces an OpenAI structured-outputs predictor variant for the CloudOpsBench harness, targeting the residual 24% predictor drift (OBJECT_HIT_RC_MISS) on the post-vocab-fix baseline. It also reverts a prior DB-localization prompt rule (confirmed net-negative) and first-classes five overfit-control guards (overfit.py) with their test suite.

  • New predictor (llm_call_structured_openai.py): uses client.beta.chat.completions.parse() with Pydantic Literal enums built programmatically from vocabulary.py; prompts are shared by reference with the text predictor (invariant enforced by test); no API-level seed= to keep replicate variance meaningful.
  • Overfit guards (overfit.py): five independent guards (per-system uniformity, per-stratum uniformity, cluster concentration, held-out generalization, A/A consistency) now live as first-class framework code, each with pre-registered thresholds and an exhaustive test suite.
  • Config lint (config.py): predictor_variant field + cross-field validator that rejects structured for non-OpenAI or non-cloudopsbench configs, now including o-series prefixes.

Confidence Score: 5/5

Bench-only change with no production code paths touched; all previously flagged issues are fully resolved in this revision.

The core new module is well-isolated with a clear None-on-failure contract preserving the existing keyword-bridge fallback, and its anti-overfit invariants are locked by 12 tests. The overfit guards are first-class code with thorough coverage including the tricky single-stratum edge case. Only two minor documentation inconsistencies remain, neither of which affects runtime behavior.

No files require special attention.

Important Files Changed

Filename Overview
tests/benchmarks/cloudopsbench/predictor/llm_call_structured_openai.py New structured-outputs predictor: Pydantic schema with vocabulary-derived Literal enums, no hardcoded seed, prompts imported by reference from llm_call.py, cost tracking via _emit_usage, graceful None-on-failure contract preserved.
tests/benchmarks/_framework/overfit.py Five overfit guards (A–E) as first-class framework code; Guard C correctly uses scenario-level mean aggregation (fixing the prior run_index collapse), Guard B has explicit single-stratum branch. Minor stale docstring in OverfitReport.
tests/benchmarks/_framework/tests/test_overfit.py Comprehensive test suite for all five guards; previously flagged float exact-equality is fixed with pytest.approx; multi-replicate and partial-rescue scenarios are explicitly covered.
tests/benchmarks/cloudopsbench/adapter.py Dispatch logic in format_final_answer correctly routes to the structured predictor when _predictor_variant=="structured", forwarding run.model_version; None fallback on failure preserved.
tests/benchmarks/_framework/config.py predictor_variant field and cross-field validator added; o-series prefixes (o1, o3, o4) included; stale filename corrected to llm_call_structured_openai.py.
tests/benchmarks/cloudopsbench/predictor/init.py Re-exports updated to include emit_paper_predictions_structured; module docstring on line 5 still references the old planned filename llm_call_structured.py instead of the actual llm_call_structured_openai.py.
tests/benchmarks/cloudopsbench/tests/test_predictor_structured.py 12 invariant tests covering schema-vocabulary parity, off-vocab rejection, shared-prompt identity, happy-path shape, failure-fallback, and taxonomy-override behavior.
tests/benchmarks/cloudopsbench/configs/preregistrations/exp_structured_outputs_v1.yml Pre-registration with locked decision rules, six overfit guards with thresholds, adversarial pre-mortem, and explicit will-not-do list; held_out_seed=42 matches overfit.py constant.

Reviews (8): Last reviewed commit: "fixed float division" | Re-trigger Greptile

Comment thread tests/benchmarks/_framework/config.py Outdated
Comment thread tests/benchmarks/_framework/config.py Outdated
Comment thread tests/benchmarks/_framework/tests/test_overfit.py Fixed
Comment thread tests/benchmarks/_framework/tests/test_overfit.py Fixed
@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

Comment thread tests/benchmarks/_framework/overfit.py
@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

Comment thread tests/benchmarks/_framework/tests/test_overfit.py Outdated
@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

@YauhenBichel YauhenBichel marked this pull request as ready for review June 10, 2026 12:01
@YauhenBichel YauhenBichel merged commit 1ba680c into main Jun 10, 2026
20 checks passed
@YauhenBichel YauhenBichel deleted the fix/2074-bench-predictor-structured-outputs branch June 10, 2026 20:15
@github-actions

Copy link
Copy Markdown
Contributor

💼 Interviewer: describe a time you shipped something impactful.

@YauhenBichel: points at this PR

Interviewer: you're hired. 🤝


👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Benchmark opensre+LLM vs LLM-alone (Cloudopsbench)

2 participants