fix(bench): experiments: false-healthy guard with plumbing and bridge contract tests, floor 0 tools; refactoring by YauhenBichel · Pull Request #2770 · Tracer-Cloud/opensre

YauhenBichel · 2026-06-07T14:07:34Z

Fixes #2074

Describe the changes you have made in this PR -

After the Fix-A validation showed a null aggregate contrast and 6 seen-shape losses, a deeper failure analysis surfaced that opensre+llm sometimes declared the cluster healthy while its own tool observations disagreed (CrashLoopBackOff, ImagePullBackOff, Pending pods). The B2 guard in false_healthy_guard.py downgrades these conclusions to unresolved BEFORE the predictor runs, applied to BenchInvestigationAgent only so the matched control arm is unchanged.

Offline analysis against the 240 Fargate Fix-A cells (the same n=40 slice, gpt-4o, seed 42):

stratum	opensre+llm cells where guard would fire (phrase match)	currently a1=0 (potential rescue)	currently a1=1 (regress risk)
all	14 / 120	11	3 (2 protected by Group C perf_hint)

Net ceiling: +10 cells / Δa1 ≈ +0.083, pushing the seen-shape contrast from −0.083 → ~0.000 in the best case. Validation re-run on Fargate is the next data point.

Also fixed: the cell artifact JSON had dropped evidence_entries (persisting only the count). Now persists a truncated copy so post-hoc analyzers can verify which cells the guard fired on.

Code Understanding and AI Usage

Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?

No, I wrote all the code myself
Yes, I used AI assistance (continue below)

If you used AI assistance:

I have reviewed every single line of the AI-generated code
I can explain the purpose and logic of each function/component I added
I have tested edge cases and understand how the code handles them
I have modified the AI output to follow this project's coding standards and conventions

Checklist before requesting a review

I have added proper PR title and linked to the issue
I have performed a self-review of my code
I can explain the purpose of every function, class, and logic block I added
I understand why my changes work and have tested them thoroughly
I have considered potential edge cases and how my code handles them
If it is a core feature, I have added thorough tests
My code follows the project's style guidelines and conventions

Note: Please check Allow edits from maintainers if you would like us to assist in the PR.

github-actions · 2026-06-07T14:07:43Z

Greptile code review

This repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md.

Run a review — add a PR comment with:

@greptile review

Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5.

Optional: automate with the greploop skill.

YauhenBichel · 2026-06-07T14:08:45Z

@greptile review

greptile-apps · 2026-06-07T14:13:30Z

Greptile Summary

This PR adds the B2 false-healthy guard plumbing and a suite of bridge/predictor contract tests. The core changes are: a new min_tool_calls config field + CLI override that lets floor-ablation experiments be self-describing rather than env-var-dependent, truncated evidence_entries now persisted in cell artifacts for post-hoc guard auditing, and B2 guard activation statistics in analyze_validation.py. Bulk of the diff is test-file reorganization into tests/ subdirectories with zero logic change.

min_tool_calls plumbing (config.py, cli.py): adds an optional int | None field with ge=0 and wires it into a one-shot BenchInvestigationAgent.MIN_TOOL_CALLS override in the CLI, gated strictly to the cloudopsbench adapter.
Evidence truncation (runner.py): _truncate_evidence_entries() caps large string keys to 2000 chars and persists the result alongside evidence_entries_count; the in-memory full list continues to feed the B2 guard at runtime.
New configs (cloudopsbench_definitive_openai.yml, cloudopsbench_floor0_ablation_openai.yml, cloudopsbench_floor_ablation_v2_openai.yml): document the full-corpus comparison and two iterations of the floor-ablation experiment; the v2 correctly uses min_tool_calls: 0 while the original does not.

Confidence Score: 4/5

Safe to merge with one fix recommended: cloudopsbench_floor0_ablation_openai.yml is missing min_tool_calls: 0, so running it without the env var silently produces floor=5 data under a floor=0 label.

The newly added cloudopsbench_floor0_ablation_openai.yml is self-inconsistent: its entire purpose is the floor=0 ablation hypothesis, but it lacks min_tool_calls: 0 — the exact mechanism this PR introduces. Running it without BENCH_MIN_TOOL_CALLS=0 produces a silent duplicate of the floor=5 baseline, invalidating the experiment without any error. The v2 config was added in the same PR specifically because floor0_ablation was broken in this way. All other changes — config-field plumbing, evidence truncation, test reorganization, and new predictor contract tests — are straightforward and correct.

tests/benchmarks/cloudopsbench/configs/cloudopsbench_floor0_ablation_openai.yml needs min_tool_calls: 0 added.

Important Files Changed

Filename	Overview
tests/benchmarks/_framework/config.py	Adds optional min_tool_calls: int
tests/benchmarks/_framework/cli.py	Injects config-driven MIN_TOOL_CALLS override into BenchInvestigationAgent after adapter import; gated on config.benchmark == cloudopsbench and config.min_tool_calls is not None, so other adapters are unaffected.
tests/benchmarks/_framework/runner.py	Adds _truncate_evidence_entries() (caps output/content/text/message keys at 2000 chars) and persists truncated list alongside the existing count in cell artifact JSON; in-memory guard continues to read the full list.
tests/benchmarks/cloudopsbench/analyze_validation.py	Adds B2 guard activation section; detection phrase in _b2_fired is hardcoded and not imported from false_healthy_guard, creating a silent-fail coupling if the downgrade signature ever changes.
tests/benchmarks/cloudopsbench/configs/cloudopsbench_floor0_ablation_openai.yml	New config for floor=0 ablation but missing min_tool_calls: 0; silently runs at the default floor when the required BENCH_MIN_TOOL_CALLS=0 env var is absent, invalidating the experiment.
tests/benchmarks/cloudopsbench/configs/cloudopsbench_floor_ablation_v2_openai.yml	Correct v2 retry using min_tool_calls: 0 to bake the floor into the config; well-documented pre-registered predictions and decision rules.
tests/benchmarks/cloudopsbench/tests/test_predictor.py	Adds bridge contract tests (positive/negative cases for infer_final_answer_from_opensre_text), taxonomy derivation override tests, and rate-limit retry tests; well-organized and clearly documented.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[CLI: _cmd_run] --> B{config.min_tool_calls not None AND benchmark == cloudopsbench?}
    B -- Yes --> C[BenchInvestigationAgent.MIN_TOOL_CALLS = config.min_tool_calls]
    B -- No --> D[Agent uses default / env-var floor]
    C --> E[BenchmarkRunner.run]
    D --> E
    E --> F[RunResult with evidence_entries]
    F --> G[_cell_to_dict]
    G --> H[_truncate_evidence_entries - cap data keys to 2000 chars]
    H --> I[Cell artifact JSON: evidence_entries_count + truncated evidence_entries]
    F --> J[B2 guard reads full in-memory evidence_entries at runtime]
    I --> K[analyze_validation.py: _b2_fired matches downgrade signature in persisted final_diagnosis]

_{Reviews (5): Last reviewed commit: "move configs/ + cloudopsbench AWS docs i..." | Re-trigger Greptile}

YauhenBichel · 2026-06-07T14:17:33Z

@greptile review

…config

YauhenBichel · 2026-06-08T13:17:43Z

@greptile review

YauhenBichel · 2026-06-08T13:39:21Z

@greptile review

github-actions · 2026-06-08T13:50:18Z

💼 Interviewer: describe a time you shipped something impactful.

@YauhenBichel: points at this PR

Interviewer: you're hired. 🤝

👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

fix(bench): false-healthy guard + plumbing + bridge contract tests

6755018

added B2-fire stats column

769cb98

greptile-apps Bot reviewed Jun 7, 2026

View reviewed changes

fixed greptile notes

f62abc2

YauhenBichel added 6 commits June 7, 2026 15:35

fix(bench): gate corpus-required false-healthy tests + B2 validation …

494c02c

…config

revert false helath as it did not work

d1a0880

config for openai comparison

f063cb1

bench openai config for floor 0

8e54ced

added config for experiemnt with floor 0 of tools

b7945c7

fix(bench): add min_tool_calls config field + CLI override

e7950be

YauhenBichel added 2 commits June 8, 2026 14:31

chore(bench): move configs/ into cloudopsbench/configs/

14ab696

move configs/ + cloudopsbench AWS docs into cloudopsbench/

c234f7e

YauhenBichel marked this pull request as ready for review June 8, 2026 13:48

YauhenBichel changed the title ~~fix(bench): false-healthy guard with plumbing and bridge contract tests~~ fix(bench): experiments: false-healthy guard with plumbing and bridge contract tests, floor 0 tools, refactoring Jun 8, 2026

YauhenBichel changed the title ~~fix(bench): experiments: false-healthy guard with plumbing and bridge contract tests, floor 0 tools, refactoring~~ fix(bench): experiments: false-healthy guard with plumbing and bridge contract tests, floor 0 tools; refactoring Jun 8, 2026

YauhenBichel merged commit c0f07fa into main Jun 8, 2026
16 checks passed

YauhenBichel deleted the fix/2074-bench-running-openai-fixa-validation branch June 8, 2026 13:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bench): experiments: false-healthy guard with plumbing and bridge contract tests, floor 0 tools; refactoring#2770

fix(bench): experiments: false-healthy guard with plumbing and bridge contract tests, floor 0 tools; refactoring#2770
YauhenBichel merged 11 commits into
mainfrom
fix/2074-bench-running-openai-fixa-validation

YauhenBichel commented Jun 7, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 7, 2026

Uh oh!

YauhenBichel commented Jun 7, 2026

Uh oh!

greptile-apps Bot commented Jun 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YauhenBichel commented Jun 7, 2026

Uh oh!

YauhenBichel commented Jun 8, 2026

Uh oh!

YauhenBichel commented Jun 8, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

YauhenBichel commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the changes you have made in this PR -

Code Understanding and AI Usage

Checklist before requesting a review

Uh oh!

github-actions Bot commented Jun 7, 2026

Greptile code review

Uh oh!

YauhenBichel commented Jun 7, 2026

Uh oh!

greptile-apps Bot commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YauhenBichel commented Jun 7, 2026

Uh oh!

YauhenBichel commented Jun 8, 2026

Uh oh!

YauhenBichel commented Jun 8, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

YauhenBichel commented Jun 7, 2026 •

edited

Loading

greptile-apps Bot commented Jun 7, 2026 •

edited

Loading