Skip to content

fix(bench): experiments: false-healthy guard with plumbing and bridge contract tests, floor 0 tools; refactoring#2770

Merged
YauhenBichel merged 11 commits into
mainfrom
fix/2074-bench-running-openai-fixa-validation
Jun 8, 2026
Merged

fix(bench): experiments: false-healthy guard with plumbing and bridge contract tests, floor 0 tools; refactoring#2770
YauhenBichel merged 11 commits into
mainfrom
fix/2074-bench-running-openai-fixa-validation

Conversation

@YauhenBichel

@YauhenBichel YauhenBichel commented Jun 7, 2026

Copy link
Copy Markdown
Collaborator

Fixes #2074

Describe the changes you have made in this PR -

After the Fix-A validation showed a null aggregate contrast and 6 seen-shape losses, a deeper failure analysis surfaced that opensre+llm sometimes declared the cluster healthy while its own tool observations disagreed (CrashLoopBackOff, ImagePullBackOff, Pending pods). The B2 guard in false_healthy_guard.py downgrades these conclusions to unresolved BEFORE the predictor runs, applied to BenchInvestigationAgent only so the matched control arm is unchanged.

Offline analysis against the 240 Fargate Fix-A cells (the same n=40 slice, gpt-4o, seed 42):

stratum opensre+llm cells where guard would fire (phrase match) currently a1=0 (potential rescue) currently a1=1 (regress risk)
all 14 / 120 11 3 (2 protected by Group C perf_hint)

Net ceiling: +10 cells / Δa1 ≈ +0.083, pushing the seen-shape contrast from −0.083 → ~0.000 in the best case. Validation re-run on Fargate is the next data point.

Also fixed: the cell artifact JSON had dropped evidence_entries (persisting only the count). Now persists a truncated copy so post-hoc analyzers can verify which cells the guard fired on.

Code Understanding and AI Usage

Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?

  • No, I wrote all the code myself
  • Yes, I used AI assistance (continue below)

If you used AI assistance:

  • I have reviewed every single line of the AI-generated code
  • I can explain the purpose and logic of each function/component I added
  • I have tested edge cases and understand how the code handles them
  • I have modified the AI output to follow this project's coding standards and conventions

Checklist before requesting a review

  • I have added proper PR title and linked to the issue
  • I have performed a self-review of my code
  • I can explain the purpose of every function, class, and logic block I added
  • I understand why my changes work and have tested them thoroughly
  • I have considered potential edge cases and how my code handles them
  • If it is a core feature, I have added thorough tests
  • My code follows the project's style guidelines and conventions

Note: Please check Allow edits from maintainers if you would like us to assist in the PR.

@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Greptile code review

This repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md.

Run a review — add a PR comment with:

@greptile review

Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5.

Optional: automate with the greploop skill.

@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

@greptile-apps

greptile-apps Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds the B2 false-healthy guard plumbing and a suite of bridge/predictor contract tests. The core changes are: a new min_tool_calls config field + CLI override that lets floor-ablation experiments be self-describing rather than env-var-dependent, truncated evidence_entries now persisted in cell artifacts for post-hoc guard auditing, and B2 guard activation statistics in analyze_validation.py. Bulk of the diff is test-file reorganization into tests/ subdirectories with zero logic change.

  • min_tool_calls plumbing (config.py, cli.py): adds an optional int | None field with ge=0 and wires it into a one-shot BenchInvestigationAgent.MIN_TOOL_CALLS override in the CLI, gated strictly to the cloudopsbench adapter.
  • Evidence truncation (runner.py): _truncate_evidence_entries() caps large string keys to 2000 chars and persists the result alongside evidence_entries_count; the in-memory full list continues to feed the B2 guard at runtime.
  • New configs (cloudopsbench_definitive_openai.yml, cloudopsbench_floor0_ablation_openai.yml, cloudopsbench_floor_ablation_v2_openai.yml): document the full-corpus comparison and two iterations of the floor-ablation experiment; the v2 correctly uses min_tool_calls: 0 while the original does not.

Confidence Score: 4/5

Safe to merge with one fix recommended: cloudopsbench_floor0_ablation_openai.yml is missing min_tool_calls: 0, so running it without the env var silently produces floor=5 data under a floor=0 label.

The newly added cloudopsbench_floor0_ablation_openai.yml is self-inconsistent: its entire purpose is the floor=0 ablation hypothesis, but it lacks min_tool_calls: 0 — the exact mechanism this PR introduces. Running it without BENCH_MIN_TOOL_CALLS=0 produces a silent duplicate of the floor=5 baseline, invalidating the experiment without any error. The v2 config was added in the same PR specifically because floor0_ablation was broken in this way. All other changes — config-field plumbing, evidence truncation, test reorganization, and new predictor contract tests — are straightforward and correct.

tests/benchmarks/cloudopsbench/configs/cloudopsbench_floor0_ablation_openai.yml needs min_tool_calls: 0 added.

Important Files Changed

Filename Overview
tests/benchmarks/_framework/config.py Adds optional min_tool_calls: int
tests/benchmarks/_framework/cli.py Injects config-driven MIN_TOOL_CALLS override into BenchInvestigationAgent after adapter import; gated on config.benchmark == cloudopsbench and config.min_tool_calls is not None, so other adapters are unaffected.
tests/benchmarks/_framework/runner.py Adds _truncate_evidence_entries() (caps output/content/text/message keys at 2000 chars) and persists truncated list alongside the existing count in cell artifact JSON; in-memory guard continues to read the full list.
tests/benchmarks/cloudopsbench/analyze_validation.py Adds B2 guard activation section; detection phrase in _b2_fired is hardcoded and not imported from false_healthy_guard, creating a silent-fail coupling if the downgrade signature ever changes.
tests/benchmarks/cloudopsbench/configs/cloudopsbench_floor0_ablation_openai.yml New config for floor=0 ablation but missing min_tool_calls: 0; silently runs at the default floor when the required BENCH_MIN_TOOL_CALLS=0 env var is absent, invalidating the experiment.
tests/benchmarks/cloudopsbench/configs/cloudopsbench_floor_ablation_v2_openai.yml Correct v2 retry using min_tool_calls: 0 to bake the floor into the config; well-documented pre-registered predictions and decision rules.
tests/benchmarks/cloudopsbench/tests/test_predictor.py Adds bridge contract tests (positive/negative cases for infer_final_answer_from_opensre_text), taxonomy derivation override tests, and rate-limit retry tests; well-organized and clearly documented.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[CLI: _cmd_run] --> B{config.min_tool_calls not None AND benchmark == cloudopsbench?}
    B -- Yes --> C[BenchInvestigationAgent.MIN_TOOL_CALLS = config.min_tool_calls]
    B -- No --> D[Agent uses default / env-var floor]
    C --> E[BenchmarkRunner.run]
    D --> E
    E --> F[RunResult with evidence_entries]
    F --> G[_cell_to_dict]
    G --> H[_truncate_evidence_entries - cap data keys to 2000 chars]
    H --> I[Cell artifact JSON: evidence_entries_count + truncated evidence_entries]
    F --> J[B2 guard reads full in-memory evidence_entries at runtime]
    I --> K[analyze_validation.py: _b2_fired matches downgrade signature in persisted final_diagnosis]
Loading

Reviews (5): Last reviewed commit: "move configs/ + cloudopsbench AWS docs i..." | Re-trigger Greptile

Comment thread tests/benchmarks/cloudopsbench/false_healthy_guard.py Outdated
Comment thread tests/benchmarks/cloudopsbench/false_healthy_guard.py Outdated
Comment thread tests/benchmarks/cloudopsbench/false_healthy_guard.py Outdated
Comment thread tests/benchmarks/cloudopsbench/test_predictor.py Outdated
@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

@YauhenBichel YauhenBichel marked this pull request as ready for review June 8, 2026 13:48
@YauhenBichel YauhenBichel changed the title fix(bench): false-healthy guard with plumbing and bridge contract tests fix(bench): experiments: false-healthy guard with plumbing and bridge contract tests, floor 0 tools, refactoring Jun 8, 2026
@YauhenBichel YauhenBichel changed the title fix(bench): experiments: false-healthy guard with plumbing and bridge contract tests, floor 0 tools, refactoring fix(bench): experiments: false-healthy guard with plumbing and bridge contract tests, floor 0 tools; refactoring Jun 8, 2026
@YauhenBichel YauhenBichel merged commit c0f07fa into main Jun 8, 2026
16 checks passed
@YauhenBichel YauhenBichel deleted the fix/2074-bench-running-openai-fixa-validation branch June 8, 2026 13:50
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

💼 Interviewer: describe a time you shipped something impactful.

@YauhenBichel: points at this PR

Interviewer: you're hired. 🤝


👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Benchmark opensre+LLM vs LLM-alone (Cloudopsbench)

1 participant