feat(bench): DB-evidence pipeline (planner rule + B1 handoff) by YauhenBichel · Pull Request #2796 · Tracer-Cloud/opensre

YauhenBichel · 2026-06-10T20:33:10Z

Fixes #2074

Describe the changes you have made in this PR -

Add a coordinated pair of mechanisms targeting the 50.6% DB-localization share of Runtime opensre+llm losses identified in the post-exp_structured_outputs triage.
Layer 1: dependency-traversal planner rule in BenchInvestigationAgentTrimmedPrompt's system prompt — expands what evidence the investigation gathers when DB-shaped symptoms appear; does NOT bias localization.
Layer 2: B1 investigation-handoff post-pass on top_3_predictions — promotes rank-2 to rank-1 when the investigation prose supports it better (token-overlap scoring with _MIN_PROMOTION_SCORE = 2). Catches the runtime/56 translation-loss class where the predictor LLM re-diagnoses from the alert and buries the right answer at rank-2.
Ship together — the two layers form an evidence-collection → evidence-translation pipeline. Smoke (cloudopsbench_db_pod_logs_smoke_openai.yml) tests their JOINT effect; layer-1-only ablation deferred to follow-up if smoke advances to full-N.
Amend the locked exp_structured_outputs_v1.yml pre-registration after the n=100 smoke revealed the OBJECT_HIT_RC_MISS share gate was conceptually flawed (measures a transitional artifact, not mechanism effectiveness).
Replace one indirect gate with two direct mechanism checks (aggregate A@1 lift + per-pattern mysql confusion reduction) — tightening, not loosening.
Document the Admission stratum -15pp regression as expected mechanism cost of grammar-constrained sampling on the namespace_*_quota_exceeded family — pre-commit to acknowledging the cost so it can't be hidden in the final report.
Single-file change — only tests/benchmarks/cloudopsbench/configs/preregistrations/exp_structured_outputs_v1.yml. No code changes.

Why this isn't a re-test of the rejected DB-localization rule

The prompt rule rejected on 2026-06-09 operated at the labeling layer (fault_object MUST be the DB) and over-fired 73% of the time. The two layers in this PR are structurally distinct:

Prior rejected rule	This PR
Forced an outcome regardless of evidence	Layer 1 adds evidence; Layer 2 promotes only when prose actually supports the alt
Over-fired because trigger ("DB-shaped symptoms") was too broad → false DB predictions	Trigger is broad, but Layer 1's action is "gather evidence" (token cost only) and Layer 2's action is "re-rank within already-emitted top-3" — neither can localize to a service the predictor didn't already candidate
Decision signal	Smoke's `(d)` no-over-firing check (FP rate ≤ 20%) catches it directly if either layer regresses to the prior failure mode

Code Understanding and AI Usage

Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?

No, I wrote all the code myself
Yes, I used AI assistance (continue below)

If you used AI assistance:

I have reviewed every single line of the AI-generated code
I can explain the purpose and logic of each function/component I added
I have tested edge cases and understand how the code handles them
I have modified the AI output to follow this project's coding standards and conventions

Checklist before requesting a review

I have added proper PR title and linked to the issue
I have performed a self-review of my code
I can explain the purpose of every function, class, and logic block I added
I understand why my changes work and have tested them thoroughly
I have considered potential edge cases and how my code handles them
If it is a core feature, I have added thorough tests
My code follows the project's style guidelines and conventions

Note: Please check Allow edits from maintainers if you would like us to assist in the PR.

…e triage

github-actions · 2026-06-10T20:33:20Z

Greptile code review

This repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md.

Run a review — add a PR comment with:

@greptile review

Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5.

Optional: automate with the greploop skill.

YauhenBichel · 2026-06-10T20:34:27Z

@greptile review

greptile-apps · 2026-06-10T20:37:36Z

Greptile Summary

This PR ships a two-layer DB-evidence pipeline targeting the 50.6% DB-localization share of Runtime opensre+llm losses: a dependency-traversal planner rule added to the investigation agent's system prompt (Layer 1), and a deterministic B1 investigation-handoff post-pass on top_3_predictions that promotes rank-2 when investigation prose better supports it (Layer 2). It also amends the locked exp_structured_outputs_v1.yml pre-registration to replace the conceptually-flawed OBJECT_HIT_RC_MISS share gate with direct mechanism checks and fixes the previously inconsistent stopping_rules.null_result reference.

investigation_handoff.py: New module implementing token-overlap scoring with a _MIN_PROMOTION_SCORE = 2 threshold, a conclusion-lines-only object gate to prevent spurious cross-object promotion, and a two-stage apply_investigation_handoff pipeline (B1 align → conservative rerank). B1 is correctly gated to predictor_variant == "default" in adapter.py so control arms and the rejected structured variant remain unaffected.
exp_structured_outputs_v1.yml: Replaces the object_hit_rc_miss_share_max gate with mysql_confusion_reduction_pct ≥ 80%, lowers admission_a1_min from 0.55 → 0.40 to acknowledge the documented namespace_*_quota_exceeded regression, and aligns stopping_rules.null_result with the amended decision_rules — resolving the previously flagged contradiction.
cloudopsbench_db_pod_logs_smoke_openai.yml: New 40-case three-arm smoke config with a joint mechanism gate (A@1 lift + b1 planner fire rate + b2 handoff fire rate) and a Fargate-only execution policy for provenance integrity.

Confidence Score: 5/5

Safe to merge. All changes are in the benchmark evaluation layer, with no production code paths touched. The B1 handoff is correctly gated and the pre-registration amendment resolves previously flagged inconsistencies.

The core logic in investigation_handoff.py is deterministic, well-tested across 8 scenarios including both cross-object promotion paths, and gated so it only applies to the default predictor variant. The pre-registration amendment resolves the previously noted stopping_rules contradiction. The two style-level observations (unreachable guard, substring false-positive risk) have no impact on correctness for the expected real-world data distribution.

No files require special attention. investigation_handoff.py has a minor dead-code guard and a low-risk substring matching note, neither of which affects correctness in the targeted failure class.

Important Files Changed

Filename	Overview
tests/benchmarks/cloudopsbench/predictor/investigation_handoff.py	New B1 handoff module: deterministic token-overlap promotion of rank-2 predictions with object gate and min-score guard. One unreachable guard in `align_predictions_to_investigation` (post-loop `best_alt_score <= rank1_score` check).
tests/benchmarks/cloudopsbench/adapter.py	Wires B1 handoff into the pipeline, correctly gated to `predictor_variant == "default"` with a well-documented lazy import. Control arms and the structured variant remain unaffected.
tests/benchmarks/cloudopsbench/bench_agent.py	Appends the dependency-traversal planner rule to `_TRIMMED_BENCH_SYSTEM_PROMPT`. Rule text explicitly limits its scope to evidence-gathering, not localization.
tests/benchmarks/cloudopsbench/configs/cloudopsbench_db_pod_logs_smoke_openai.yml	New smoke config (40 cases, three-arm, Fargate-only). Well-documented with joint mechanism gate conditions (a, b1, b2) and explicit trade-off acknowledgments.
tests/benchmarks/cloudopsbench/configs/preregistrations/exp_structured_outputs_v1.yml	Amendment 1 replaces the flawed OBJECT_HIT_RC_MISS gate with direct mechanism checks, lowers `admission_a1_min`, and fixes `stopping_rules.null_result` to match `decision_rules`. All previously flagged inconsistencies are resolved.
tests/benchmarks/cloudopsbench/predictor/init.py	Adds `align_predictions_to_investigation` and `apply_investigation_handoff` to the package's public re-exports.
tests/benchmarks/cloudopsbench/tests/test_investigation_handoff.py	Comprehensive tests covering the happy path, empty-summary no-op, rank-1-already-best, cross-object blocking (two variants), DB-object-named-in-conclusion allow, rank-3 spurious hits, and adapter gate verification.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Investigation run] --> B{predictor_variant == default?}
    B -- No --> C[Use top_3_predictions as-is]
    B -- Yes --> D[emit_paper_predictions]
    D --> E{payload is None?}
    E -- Yes --> F[Return run unchanged]
    E -- No --> G[apply_investigation_handoff]
    G --> H[align_predictions_to_investigation
B1: score each prediction
vs investigation prose]
    H --> I{best_alt strictly outscores rank-1
AND score >= _MIN_PROMOTION_SCORE?}
    I -- No --> J[rerank_predictions_by_evidence
conservative gate]
    I -- Yes --> K{object_gate_allows_promotion?
check conclusion lines only}
    K -- No --> J
    K -- Yes --> L[Promote alt to rank-1
rewrite rank + fault_taxonomy]
    L --> J
    J --> M[enriched_diagnosis with final top_3]
    C --> M
    M --> N[Return updated run]

_{Reviews (4): Last reviewed commit: "fixed issues" | Re-trigger Greptile}

YauhenBichel · 2026-06-11T10:24:16Z

@greptile review

YauhenBichel · 2026-06-11T10:51:52Z

@greptile review

github-actions · 2026-06-11T11:03:22Z

🎯 Bullseye. @YauhenBichel opened a PR, kept the vibes clean, and got it merged. Absolute cinema. 🎬

👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

docs(bench): amend exp_structured_outputs_v1 pre-reg after n=100 smok…

39879cd

…e triage

greptile-apps Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread tests/benchmarks/cloudopsbench/configs/preregistrations/exp_structured_outputs_v1.yml

experiment: exp_db_evidence_pipeline

cd14ceb

YauhenBichel changed the title ~~docs(bench): exp_structured_outputs pre-reg amendment (n=100 triage)~~ feat(bench): DB-evidence pipeline (planner rule + B1 handoff) Jun 11, 2026

fixed lint

fe3bfb0

YauhenBichel added 2 commits June 11, 2026 11:28

fixed lint issues

1360493

fixed issues

6a92295

YauhenBichel marked this pull request as ready for review June 11, 2026 11:01

YauhenBichel merged commit 6a562f1 into main Jun 11, 2026
19 checks passed

YauhenBichel deleted the fix/2074-bench-admission-traige branch June 11, 2026 11:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): DB-evidence pipeline (planner rule + B1 handoff)#2796

feat(bench): DB-evidence pipeline (planner rule + B1 handoff)#2796
YauhenBichel merged 5 commits into
mainfrom
fix/2074-bench-admission-traige

YauhenBichel commented Jun 10, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

YauhenBichel commented Jun 10, 2026

Uh oh!

greptile-apps Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

YauhenBichel commented Jun 11, 2026

Uh oh!

YauhenBichel commented Jun 11, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

YauhenBichel commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the changes you have made in this PR -

Why this isn't a re-test of the rejected DB-localization rule

Code Understanding and AI Usage

Checklist before requesting a review

Uh oh!

github-actions Bot commented Jun 10, 2026

Greptile code review

Uh oh!

YauhenBichel commented Jun 10, 2026

Uh oh!

greptile-apps Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

YauhenBichel commented Jun 11, 2026

Uh oh!

YauhenBichel commented Jun 11, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

YauhenBichel commented Jun 10, 2026 •

edited

Loading

greptile-apps Bot commented Jun 10, 2026 •

edited

Loading