feat(bench): structured-outputs predictor + overfit controls by YauhenBichel · Pull Request #2794 · Tracer-Cloud/opensre

YauhenBichel · 2026-06-10T09:54:22Z

Fixes #2074 (experiments)

Describe the changes you have made in this PR -

Add OpenAI structured-outputs predictor variant — grammar-constrained sampling at the API layer for root_cause and fault_taxonomy. Targets the residual 24% predictor drift measured on the post-vocab-fix baseline.
Revert the prior DB-localization prompt rule (negative result: aggregate A@1 -2.5pp, Runtime -6.9pp, DB over-firing 73%). Details in linked issue comment.
Land six overfit-control guards as first-class code so future bench experiments can't ship a hidden stratum-level regression as an aggregate "win".

Mechanism

OpenAI's client.beta.chat.completions.parse() with a Pydantic schema whose root_cause and fault_taxonomy are Literal[...] enums built programmatically from vocabulary.py (single source of truth with the scorer). The LLM literally cannot emit out-of-enum tokens; adjacent-vocabulary cells (e.g. mysql_invalid_port → db_connection_exhaustion) are blocked at the sampler, not at the parser.

Bench-only — production opensre is untouched. The dispatcher in adapter.py routes to this variant only when predictor_variant: structured is set AND the configured LLM is OpenAI (cross-field lint enforces).

Code Understanding and AI Usage

Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?

No, I wrote all the code myself
Yes, I used AI assistance (continue below)

If you used AI assistance:

I have reviewed every single line of the AI-generated code
I can explain the purpose and logic of each function/component I added
I have tested edge cases and understand how the code handles them
I have modified the AI output to follow this project's coding standards and conventions

Explain your implementation approach:

Changes

File	Change
`predictor/llm_call.py`	Revert DB-localization prompt rule
`predictor/llm_call_structured_openai.py` (NEW)	OpenAI structured-outputs predictor; Pydantic schema with vocabulary-derived `Literal` enums; cost-tracking via `_emit_usage`
`predictor/__init__.py`	Re-export `emit_paper_predictions_structured`; documents multi-provider roadmap
`_framework/config.py`	`predictor_variant: Literal["default", "structured"]` field + cross-field lint refusing non-OpenAI + non-cloudopsbench
`cloudopsbench/adapter.py`	`self._predictor_variant` set in `apply_config_overrides`; dispatch in `score_case`
`configs/cloudopsbench_structured_outputs_smoke_openai.yml` (NEW)	40-case three-arm smoke
`configs/cloudopsbench_structured_outputs_openai.yml` (NEW)	Full-N three-arm
`configs/preregistrations/exp_structured_outputs_v1.yml` (NEW)	Pre-reg with 6 overfit guards + decision matrix
`tests/test_predictor_structured.py` (NEW)	12 invariant tests — schema/vocabulary parity, shared-prompt identity, dispatch routing, fallback behavior
`scripts/overfit_attribution.py` (NEW)	Runtime per-system / per-stratum / cluster / held-out analysis

Anti-overfit discipline (the differentiator)

Nine guards fire across four phases. Each one has a pre-registered threshold; the variant has to clear all of them to promote to default.

Phase	Guard	Mechanism
Pre-commit (CI)	Schema-vocabulary invariant	Pydantic enum == `vocabulary._ROOT_CAUSES` / `_TAXONOMY_CATEGORIES`
Pre-commit (CI)	Shared prompt invariant	`is`-identity check — structured variant uses text variant's prompt builders by reference, not copy
Pre-dispatch	Cross-field lint	Refuses `predictor_variant=structured` with non-OpenAI LLMs or non-cloudopsbench
Smoke (n=40)	OBJECT_HIT_RC_MISS share drop	Direct mechanism check before paying for full-N
Full-N analysis	Per-system uniformity	boutique / trainticket lift spread ≤ 0.05
Full-N analysis	Per-stratum uniformity	max-stratum-lift / median-stratum-lift ≤ 2×
Full-N analysis	Per-case attribution clustering	No single (system, category, GT-service-prefix) cluster owns > 60% of loss→win flips
Full-N analysis	Held-out generalization gate	`held_out_lift / optimize_lift ≥ 0.70` to ship; `< 0.30` rejects as overfit (BDIL Phase F; `held_out_seed=42`, never re-rolled)
Full-N analysis	A/A consistency	Two-seed aggregate diff < 0.02 (bounds the noise floor)

The pre-reg locks decision rules BEFORE the run — promote / null / reject-as-overfit / reject-as-regression branches are committed, so the call doesn't depend on post-hoc interpretation.

Checklist before requesting a review

I have added proper PR title and linked to the issue
I have performed a self-review of my code
I can explain the purpose of every function, class, and logic block I added
I understand why my changes work and have tested them thoroughly
I have considered potential edge cases and how my code handles them
If it is a core feature, I have added thorough tests
My code follows the project's style guidelines and conventions

Note: Please check Allow edits from maintainers if you would like us to assist in the PR.

github-actions · 2026-06-10T09:54:31Z

Greptile code review

This repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md.

Run a review — add a PR comment with:

@greptile review

Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5.

Optional: automate with the greploop skill.

YauhenBichel · 2026-06-10T09:54:42Z

@greptile review

greptile-apps · 2026-06-10T10:00:19Z

Greptile Summary

This PR introduces an OpenAI structured-outputs predictor variant for the CloudOpsBench harness, targeting the residual 24% predictor drift (OBJECT_HIT_RC_MISS) on the post-vocab-fix baseline. It also reverts a prior DB-localization prompt rule (confirmed net-negative) and first-classes five overfit-control guards (overfit.py) with their test suite.

New predictor (llm_call_structured_openai.py): uses client.beta.chat.completions.parse() with Pydantic Literal enums built programmatically from vocabulary.py; prompts are shared by reference with the text predictor (invariant enforced by test); no API-level seed= to keep replicate variance meaningful.
Overfit guards (overfit.py): five independent guards (per-system uniformity, per-stratum uniformity, cluster concentration, held-out generalization, A/A consistency) now live as first-class framework code, each with pre-registered thresholds and an exhaustive test suite.
Config lint (config.py): predictor_variant field + cross-field validator that rejects structured for non-OpenAI or non-cloudopsbench configs, now including o-series prefixes.

Confidence Score: 5/5

Bench-only change with no production code paths touched; all previously flagged issues are fully resolved in this revision.

The core new module is well-isolated with a clear None-on-failure contract preserving the existing keyword-bridge fallback, and its anti-overfit invariants are locked by 12 tests. The overfit guards are first-class code with thorough coverage including the tricky single-stratum edge case. Only two minor documentation inconsistencies remain, neither of which affects runtime behavior.

No files require special attention.

Important Files Changed

Filename	Overview
tests/benchmarks/cloudopsbench/predictor/llm_call_structured_openai.py	New structured-outputs predictor: Pydantic schema with vocabulary-derived Literal enums, no hardcoded seed, prompts imported by reference from llm_call.py, cost tracking via _emit_usage, graceful None-on-failure contract preserved.
tests/benchmarks/_framework/overfit.py	Five overfit guards (A–E) as first-class framework code; Guard C correctly uses scenario-level mean aggregation (fixing the prior run_index collapse), Guard B has explicit single-stratum branch. Minor stale docstring in OverfitReport.
tests/benchmarks/_framework/tests/test_overfit.py	Comprehensive test suite for all five guards; previously flagged float exact-equality is fixed with pytest.approx; multi-replicate and partial-rescue scenarios are explicitly covered.
tests/benchmarks/cloudopsbench/adapter.py	Dispatch logic in format_final_answer correctly routes to the structured predictor when _predictor_variant=="structured", forwarding run.model_version; None fallback on failure preserved.
tests/benchmarks/_framework/config.py	predictor_variant field and cross-field validator added; o-series prefixes (o1, o3, o4) included; stale filename corrected to llm_call_structured_openai.py.
tests/benchmarks/cloudopsbench/predictor/init.py	Re-exports updated to include emit_paper_predictions_structured; module docstring on line 5 still references the old planned filename llm_call_structured.py instead of the actual llm_call_structured_openai.py.
tests/benchmarks/cloudopsbench/tests/test_predictor_structured.py	12 invariant tests covering schema-vocabulary parity, off-vocab rejection, shared-prompt identity, happy-path shape, failure-fallback, and taxonomy-override behavior.
tests/benchmarks/cloudopsbench/configs/preregistrations/exp_structured_outputs_v1.yml	Pre-registration with locked decision rules, six overfit guards with thresholds, adversarial pre-mortem, and explicit will-not-do list; held_out_seed=42 matches overfit.py constant.

_{Reviews (8): Last reviewed commit: "fixed float division" | Re-trigger Greptile}

YauhenBichel · 2026-06-10T10:22:05Z

@greptile review

YauhenBichel · 2026-06-10T10:37:56Z

@greptile review

YauhenBichel · 2026-06-10T10:56:53Z

@greptile review

YauhenBichel · 2026-06-10T11:09:06Z

@greptile review

YauhenBichel · 2026-06-10T11:29:49Z

@greptile review

YauhenBichel · 2026-06-10T11:43:50Z

@greptile review

github-actions · 2026-06-10T20:15:20Z

💼 Interviewer: describe a time you shipped something impactful.

@YauhenBichel: points at this PR

Interviewer: you're hired. 🤝

👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

full experiment package (revert + new variant + overfit controls)

ef51daa

greptile-apps Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread tests/benchmarks/cloudopsbench/predictor/llm_call_structured_openai.py

Comment thread tests/benchmarks/_framework/config.py Outdated

Comment thread tests/benchmarks/_framework/config.py Outdated

added overfit into bench framework

a9c2b0a

github-code-quality Bot found potential problems Jun 10, 2026

View reviewed changes

Comment thread tests/benchmarks/_framework/tests/test_overfit.py Fixed

github-advanced-security AI found potential problems Jun 10, 2026

View reviewed changes

Comment thread tests/benchmarks/_framework/tests/test_overfit.py Fixed

YauhenBichel added 2 commits June 10, 2026 11:12

ix(bench): address greptile review on structured-outputs PR

bc52607

fixed lint issues

3ec6cab

fixed notes

dd3cda8

fixed A/A variant issue

2e18e92

greptile-apps Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread tests/benchmarks/_framework/overfit.py

fixed greptile note

ed32633

fixed description in config

bc9b85b

greptile-apps Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread tests/benchmarks/_framework/tests/test_overfit.py Outdated

fixed float division

284a7c2

YauhenBichel marked this pull request as ready for review June 10, 2026 12:01

the same experiment but for N=100

f8fce04

YauhenBichel merged commit 1ba680c into main Jun 10, 2026
20 checks passed

YauhenBichel deleted the fix/2074-bench-predictor-structured-outputs branch June 10, 2026 20:15

Conversation

YauhenBichel commented Jun 10, 2026

Describe the changes you have made in this PR -

Mechanism

Code Understanding and AI Usage

Changes

Anti-overfit discipline (the differentiator)

Checklist before requesting a review

Uh oh!

github-actions Bot commented Jun 10, 2026

Greptile code review

Uh oh!

YauhenBichel commented Jun 10, 2026

Uh oh!

greptile-apps Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YauhenBichel commented Jun 10, 2026

Uh oh!

YauhenBichel commented Jun 10, 2026

Uh oh!

YauhenBichel commented Jun 10, 2026

Uh oh!

Uh oh!

YauhenBichel commented Jun 10, 2026

Uh oh!

YauhenBichel commented Jun 10, 2026

Uh oh!

Uh oh!

YauhenBichel commented Jun 10, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented Jun 10, 2026 •

edited

Loading