fix(bench): experiments: false-healthy guard with plumbing and bridge contract tests, floor 0 tools; refactoring#2770
Conversation
Greptile code reviewThis repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md. Run a review — add a PR comment with: Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5. Optional: automate with the greploop skill. |
|
@greptile review |
Greptile SummaryThis PR adds the B2 false-healthy guard plumbing and a suite of bridge/predictor contract tests. The core changes are: a new
Confidence Score: 4/5Safe to merge with one fix recommended: cloudopsbench_floor0_ablation_openai.yml is missing min_tool_calls: 0, so running it without the env var silently produces floor=5 data under a floor=0 label. The newly added cloudopsbench_floor0_ablation_openai.yml is self-inconsistent: its entire purpose is the floor=0 ablation hypothesis, but it lacks min_tool_calls: 0 — the exact mechanism this PR introduces. Running it without BENCH_MIN_TOOL_CALLS=0 produces a silent duplicate of the floor=5 baseline, invalidating the experiment without any error. The v2 config was added in the same PR specifically because floor0_ablation was broken in this way. All other changes — config-field plumbing, evidence truncation, test reorganization, and new predictor contract tests — are straightforward and correct. tests/benchmarks/cloudopsbench/configs/cloudopsbench_floor0_ablation_openai.yml needs min_tool_calls: 0 added. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[CLI: _cmd_run] --> B{config.min_tool_calls not None AND benchmark == cloudopsbench?}
B -- Yes --> C[BenchInvestigationAgent.MIN_TOOL_CALLS = config.min_tool_calls]
B -- No --> D[Agent uses default / env-var floor]
C --> E[BenchmarkRunner.run]
D --> E
E --> F[RunResult with evidence_entries]
F --> G[_cell_to_dict]
G --> H[_truncate_evidence_entries - cap data keys to 2000 chars]
H --> I[Cell artifact JSON: evidence_entries_count + truncated evidence_entries]
F --> J[B2 guard reads full in-memory evidence_entries at runtime]
I --> K[analyze_validation.py: _b2_fired matches downgrade signature in persisted final_diagnosis]
Reviews (5): Last reviewed commit: "move configs/ + cloudopsbench AWS docs i..." | Re-trigger Greptile |
|
@greptile review |
|
@greptile review |
|
@greptile review |
|
💼 Interviewer: describe a time you shipped something impactful. @YauhenBichel: points at this PR Interviewer: you're hired. 🤝 👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome. |

Fixes #2074
Describe the changes you have made in this PR -
After the Fix-A validation showed a null aggregate contrast and 6 seen-shape losses, a deeper failure analysis surfaced that opensre+llm sometimes declared the cluster healthy while its own tool observations disagreed (CrashLoopBackOff, ImagePullBackOff, Pending pods). The B2 guard in
false_healthy_guard.pydowngrades these conclusions to unresolved BEFORE the predictor runs, applied toBenchInvestigationAgentonly so the matched control arm is unchanged.Offline analysis against the 240 Fargate Fix-A cells (the same n=40 slice, gpt-4o, seed 42):
Net ceiling: +10 cells / Δa1 ≈ +0.083, pushing the seen-shape contrast from −0.083 → ~0.000 in the best case. Validation re-run on Fargate is the next data point.
Also fixed: the cell artifact JSON had dropped
evidence_entries(persisting only the count). Now persists a truncated copy so post-hoc analyzers can verify which cells the guard fired on.Code Understanding and AI Usage
Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?
If you used AI assistance:
Checklist before requesting a review
Note: Please check Allow edits from maintainers if you would like us to assist in the PR.