feat(bench): DB-evidence pipeline (planner rule + B1 handoff)#2796
Conversation
Greptile code reviewThis repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md. Run a review — add a PR comment with: Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5. Optional: automate with the greploop skill. |
|
@greptile review |
Greptile SummaryThis PR ships a two-layer DB-evidence pipeline targeting the 50.6% DB-localization share of Runtime opensre+llm losses: a dependency-traversal planner rule added to the investigation agent's system prompt (Layer 1), and a deterministic B1 investigation-handoff post-pass on
Confidence Score: 5/5Safe to merge. All changes are in the benchmark evaluation layer, with no production code paths touched. The B1 handoff is correctly gated and the pre-registration amendment resolves previously flagged inconsistencies. The core logic in No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Investigation run] --> B{predictor_variant == default?}
B -- No --> C[Use top_3_predictions as-is]
B -- Yes --> D[emit_paper_predictions]
D --> E{payload is None?}
E -- Yes --> F[Return run unchanged]
E -- No --> G[apply_investigation_handoff]
G --> H[align_predictions_to_investigation
B1: score each prediction
vs investigation prose]
H --> I{best_alt strictly outscores rank-1
AND score >= _MIN_PROMOTION_SCORE?}
I -- No --> J[rerank_predictions_by_evidence
conservative gate]
I -- Yes --> K{object_gate_allows_promotion?
check conclusion lines only}
K -- No --> J
K -- Yes --> L[Promote alt to rank-1
rewrite rank + fault_taxonomy]
L --> J
J --> M[enriched_diagnosis with final top_3]
C --> M
M --> N[Return updated run]
Reviews (4): Last reviewed commit: "fixed issues" | Re-trigger Greptile |
|
@greptile review |
|
@greptile review |
|
🎯 Bullseye. @YauhenBichel opened a PR, kept the vibes clean, and got it merged. Absolute cinema. 🎬 👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome. |

Fixes #2074
Describe the changes you have made in this PR -
Add a coordinated pair of mechanisms targeting the 50.6% DB-localization share of Runtime opensre+llm losses identified in the post-
exp_structured_outputstriage.Layer 1: dependency-traversal planner rule in
BenchInvestigationAgentTrimmedPrompt's system prompt — expands what evidence the investigation gathers when DB-shaped symptoms appear; does NOT bias localization.Layer 2: B1 investigation-handoff post-pass on
top_3_predictions— promotes rank-2 to rank-1 when the investigation prose supports it better (token-overlap scoring with_MIN_PROMOTION_SCORE = 2). Catches theruntime/56translation-loss class where the predictor LLM re-diagnoses from the alert and buries the right answer at rank-2.Ship together — the two layers form an evidence-collection → evidence-translation pipeline. Smoke (
cloudopsbench_db_pod_logs_smoke_openai.yml) tests their JOINT effect; layer-1-only ablation deferred to follow-up if smoke advances to full-N.Amend the locked
exp_structured_outputs_v1.ymlpre-registration after the n=100 smoke revealed the OBJECT_HIT_RC_MISS share gate was conceptually flawed (measures a transitional artifact, not mechanism effectiveness).Replace one indirect gate with two direct mechanism checks (aggregate A@1 lift + per-pattern mysql confusion reduction) — tightening, not loosening.
Document the Admission stratum -15pp regression as expected mechanism cost of grammar-constrained sampling on the namespace_*_quota_exceeded family — pre-commit to acknowledging the cost so it can't be hidden in the final report.
Single-file change — only
tests/benchmarks/cloudopsbench/configs/preregistrations/exp_structured_outputs_v1.yml. No code changes.Why this isn't a re-test of the rejected DB-localization rule
The prompt rule rejected on 2026-06-09 operated at the labeling layer (
fault_object MUST be the DB) and over-fired 73% of the time. The two layers in this PR are structurally distinct:(d)no-over-firing check (FP rate ≤ 20%) catches it directly if either layer regresses to the prior failure modeCode Understanding and AI Usage
Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?
If you used AI assistance:
Checklist before requesting a review
Note: Please check Allow edits from maintainers if you would like us to assist in the PR.