Skip to content

feat(bench): DB-evidence pipeline (planner rule + B1 handoff)#2796

Merged
YauhenBichel merged 5 commits into
mainfrom
fix/2074-bench-admission-traige
Jun 11, 2026
Merged

feat(bench): DB-evidence pipeline (planner rule + B1 handoff)#2796
YauhenBichel merged 5 commits into
mainfrom
fix/2074-bench-admission-traige

Conversation

@YauhenBichel

@YauhenBichel YauhenBichel commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Fixes #2074

Describe the changes you have made in this PR -

  • Add a coordinated pair of mechanisms targeting the 50.6% DB-localization share of Runtime opensre+llm losses identified in the post-exp_structured_outputs triage.

  • Layer 1: dependency-traversal planner rule in BenchInvestigationAgentTrimmedPrompt's system prompt — expands what evidence the investigation gathers when DB-shaped symptoms appear; does NOT bias localization.

  • Layer 2: B1 investigation-handoff post-pass on top_3_predictions — promotes rank-2 to rank-1 when the investigation prose supports it better (token-overlap scoring with _MIN_PROMOTION_SCORE = 2). Catches the runtime/56 translation-loss class where the predictor LLM re-diagnoses from the alert and buries the right answer at rank-2.

  • Ship together — the two layers form an evidence-collection → evidence-translation pipeline. Smoke (cloudopsbench_db_pod_logs_smoke_openai.yml) tests their JOINT effect; layer-1-only ablation deferred to follow-up if smoke advances to full-N.

  • Amend the locked exp_structured_outputs_v1.yml pre-registration after the n=100 smoke revealed the OBJECT_HIT_RC_MISS share gate was conceptually flawed (measures a transitional artifact, not mechanism effectiveness).

  • Replace one indirect gate with two direct mechanism checks (aggregate A@1 lift + per-pattern mysql confusion reduction) — tightening, not loosening.

  • Document the Admission stratum -15pp regression as expected mechanism cost of grammar-constrained sampling on the namespace_*_quota_exceeded family — pre-commit to acknowledging the cost so it can't be hidden in the final report.

  • Single-file change — only tests/benchmarks/cloudopsbench/configs/preregistrations/exp_structured_outputs_v1.yml. No code changes.

Why this isn't a re-test of the rejected DB-localization rule

The prompt rule rejected on 2026-06-09 operated at the labeling layer (fault_object MUST be the DB) and over-fired 73% of the time. The two layers in this PR are structurally distinct:

Prior rejected rule This PR
Forced an outcome regardless of evidence Layer 1 adds evidence; Layer 2 promotes only when prose actually supports the alt
Over-fired because trigger ("DB-shaped symptoms") was too broad → false DB predictions Trigger is broad, but Layer 1's action is "gather evidence" (token cost only) and Layer 2's action is "re-rank within already-emitted top-3" — neither can localize to a service the predictor didn't already candidate
Decision signal Smoke's (d) no-over-firing check (FP rate ≤ 20%) catches it directly if either layer regresses to the prior failure mode

Code Understanding and AI Usage

Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?

  • No, I wrote all the code myself
  • Yes, I used AI assistance (continue below)

If you used AI assistance:

  • I have reviewed every single line of the AI-generated code
  • I can explain the purpose and logic of each function/component I added
  • I have tested edge cases and understand how the code handles them
  • I have modified the AI output to follow this project's coding standards and conventions

Checklist before requesting a review

  • I have added proper PR title and linked to the issue
  • I have performed a self-review of my code
  • I can explain the purpose of every function, class, and logic block I added
  • I understand why my changes work and have tested them thoroughly
  • I have considered potential edge cases and how my code handles them
  • If it is a core feature, I have added thorough tests
  • My code follows the project's style guidelines and conventions

Note: Please check Allow edits from maintainers if you would like us to assist in the PR.

@github-actions

Copy link
Copy Markdown
Contributor

Greptile code review

This repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md.

Run a review — add a PR comment with:

@greptile review

Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5.

Optional: automate with the greploop skill.

@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

@greptile-apps

greptile-apps Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR ships a two-layer DB-evidence pipeline targeting the 50.6% DB-localization share of Runtime opensre+llm losses: a dependency-traversal planner rule added to the investigation agent's system prompt (Layer 1), and a deterministic B1 investigation-handoff post-pass on top_3_predictions that promotes rank-2 when investigation prose better supports it (Layer 2). It also amends the locked exp_structured_outputs_v1.yml pre-registration to replace the conceptually-flawed OBJECT_HIT_RC_MISS share gate with direct mechanism checks and fixes the previously inconsistent stopping_rules.null_result reference.

  • investigation_handoff.py: New module implementing token-overlap scoring with a _MIN_PROMOTION_SCORE = 2 threshold, a conclusion-lines-only object gate to prevent spurious cross-object promotion, and a two-stage apply_investigation_handoff pipeline (B1 align → conservative rerank). B1 is correctly gated to predictor_variant == "default" in adapter.py so control arms and the rejected structured variant remain unaffected.
  • exp_structured_outputs_v1.yml: Replaces the object_hit_rc_miss_share_max gate with mysql_confusion_reduction_pct ≥ 80%, lowers admission_a1_min from 0.55 → 0.40 to acknowledge the documented namespace_*_quota_exceeded regression, and aligns stopping_rules.null_result with the amended decision_rules — resolving the previously flagged contradiction.
  • cloudopsbench_db_pod_logs_smoke_openai.yml: New 40-case three-arm smoke config with a joint mechanism gate (A@1 lift + b1 planner fire rate + b2 handoff fire rate) and a Fargate-only execution policy for provenance integrity.

Confidence Score: 5/5

Safe to merge. All changes are in the benchmark evaluation layer, with no production code paths touched. The B1 handoff is correctly gated and the pre-registration amendment resolves previously flagged inconsistencies.

The core logic in investigation_handoff.py is deterministic, well-tested across 8 scenarios including both cross-object promotion paths, and gated so it only applies to the default predictor variant. The pre-registration amendment resolves the previously noted stopping_rules contradiction. The two style-level observations (unreachable guard, substring false-positive risk) have no impact on correctness for the expected real-world data distribution.

No files require special attention. investigation_handoff.py has a minor dead-code guard and a low-risk substring matching note, neither of which affects correctness in the targeted failure class.

Important Files Changed

Filename Overview
tests/benchmarks/cloudopsbench/predictor/investigation_handoff.py New B1 handoff module: deterministic token-overlap promotion of rank-2 predictions with object gate and min-score guard. One unreachable guard in align_predictions_to_investigation (post-loop best_alt_score <= rank1_score check).
tests/benchmarks/cloudopsbench/adapter.py Wires B1 handoff into the pipeline, correctly gated to predictor_variant == "default" with a well-documented lazy import. Control arms and the structured variant remain unaffected.
tests/benchmarks/cloudopsbench/bench_agent.py Appends the dependency-traversal planner rule to _TRIMMED_BENCH_SYSTEM_PROMPT. Rule text explicitly limits its scope to evidence-gathering, not localization.
tests/benchmarks/cloudopsbench/configs/cloudopsbench_db_pod_logs_smoke_openai.yml New smoke config (40 cases, three-arm, Fargate-only). Well-documented with joint mechanism gate conditions (a, b1, b2) and explicit trade-off acknowledgments.
tests/benchmarks/cloudopsbench/configs/preregistrations/exp_structured_outputs_v1.yml Amendment 1 replaces the flawed OBJECT_HIT_RC_MISS gate with direct mechanism checks, lowers admission_a1_min, and fixes stopping_rules.null_result to match decision_rules. All previously flagged inconsistencies are resolved.
tests/benchmarks/cloudopsbench/predictor/init.py Adds align_predictions_to_investigation and apply_investigation_handoff to the package's public re-exports.
tests/benchmarks/cloudopsbench/tests/test_investigation_handoff.py Comprehensive tests covering the happy path, empty-summary no-op, rank-1-already-best, cross-object blocking (two variants), DB-object-named-in-conclusion allow, rank-3 spurious hits, and adapter gate verification.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Investigation run] --> B{predictor_variant == default?}
    B -- No --> C[Use top_3_predictions as-is]
    B -- Yes --> D[emit_paper_predictions]
    D --> E{payload is None?}
    E -- Yes --> F[Return run unchanged]
    E -- No --> G[apply_investigation_handoff]
    G --> H[align_predictions_to_investigation
B1: score each prediction
vs investigation prose]
    H --> I{best_alt strictly outscores rank-1
AND score >= _MIN_PROMOTION_SCORE?}
    I -- No --> J[rerank_predictions_by_evidence
conservative gate]
    I -- Yes --> K{object_gate_allows_promotion?
check conclusion lines only}
    K -- No --> J
    K -- Yes --> L[Promote alt to rank-1
rewrite rank + fault_taxonomy]
    L --> J
    J --> M[enriched_diagnosis with final top_3]
    C --> M
    M --> N[Return updated run]
Loading

Reviews (4): Last reviewed commit: "fixed issues" | Re-trigger Greptile

@YauhenBichel YauhenBichel changed the title docs(bench): exp_structured_outputs pre-reg amendment (n=100 triage) feat(bench): DB-evidence pipeline (planner rule + B1 handoff) Jun 11, 2026
@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

@YauhenBichel YauhenBichel marked this pull request as ready for review June 11, 2026 11:01
@YauhenBichel YauhenBichel merged commit 6a562f1 into main Jun 11, 2026
19 checks passed
@YauhenBichel YauhenBichel deleted the fix/2074-bench-admission-traige branch June 11, 2026 11:03
@github-actions

Copy link
Copy Markdown
Contributor

🎯 Bullseye. @YauhenBichel opened a PR, kept the vibes clean, and got it merged. Absolute cinema. 🎬


👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Benchmark opensre+LLM vs LLM-alone (Cloudopsbench)

1 participant