Skip to content

feat(bench): investigation_a1 + translation_loss metrics#2798

Merged
YauhenBichel merged 4 commits into
mainfrom
fix/2074-bench-investigation-score
Jun 11, 2026
Merged

feat(bench): investigation_a1 + translation_loss metrics#2798
YauhenBichel merged 4 commits into
mainfrom
fix/2074-bench-investigation-score

Conversation

@YauhenBichel

Copy link
Copy Markdown
Collaborator

Fixes #2074

Describe the changes you have made in this PR -

Adds metrics that separate opensre's investigation quality from the LLM predictor that wraps its prose into top_3_predictions. Today's headline a1 measures the whole pipeline; a lift on a1 can't tell us whether opensre or the predictor got better.

  • investigation_a1 — single triple parsed directly from opensre's prose (report + root_cause + final_state) via the keyword bridge. Uses include_predictor_output=False (new flag on infer_final_answer_from_opensre_text) so the predictor's structured JSON cannot feed back through and contaminate the metric.
  • investigation_partial_a1 / investigation_object_a1 — partial + object-only variants on the same path.
  • translation_loss — fires when investigation_a1 == 1 AND a1 == 0. The predictor lost what opensre named.

analyze_validation.py gets a new "L0 vs L1" panel: per arm shows
inv_a1, a1, the gap, and translation-loss rate.

The experiment chain to date is predictor-mediated; this
PR is what makes future experiments answer "is opensre getting better?"
honestly:

Pattern What it means
both metrics up opensre got better (real win)
only a1 up LLM predictor got better, not opensre
only investigation_a1 up opensre got better, predictor dropping it
translation_loss up predictor bleeding investigation gains

Code Understanding and AI Usage

Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?

  • No, I wrote all the code myself
  • Yes, I used AI assistance (continue below)

If you used AI assistance:

  • I have reviewed every single line of the AI-generated code
  • I can explain the purpose and logic of each function/component I added
  • I have tested edge cases and understand how the code handles them
  • I have modified the AI output to follow this project's coding standards and conventions

Checklist before requesting a review

  • I have added proper PR title and linked to the issue
  • I have performed a self-review of my code
  • I can explain the purpose of every function, class, and logic block I added
  • I understand why my changes work and have tested them thoroughly
  • I have considered potential edge cases and how my code handles them
  • If it is a core feature, I have added thorough tests
  • My code follows the project's style guidelines and conventions

Note: Please check Allow edits from maintainers if you would like us to assist in the PR.

@github-actions

Copy link
Copy Markdown
Contributor

Greptile code review

This repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md.

Run a review — add a PR comment with:

@greptile review

Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5.

Optional: automate with the greploop skill.

@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

@YauhenBichel YauhenBichel marked this pull request as ready for review June 11, 2026 11:49
@greptile-apps

greptile-apps Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR separates opensre's investigation quality from the downstream LLM predictor by adding investigation_a1 / investigation_partial_a1 / investigation_object_a1 metrics (keyword-parsed from opensre's prose with include_predictor_output=False) and a translation_loss flag (set when investigation_a1=1 but a1=0). A new "L0 vs L1" panel in analyze_validation.py prints both scores side-by-side per arm so future experiments can cleanly attribute gains to the investigator vs. the formalizer.

  • scoring.py: Adds _score_investigation_native, extends CloudOpsMetrics, imports and length-sorts _FAULT_OBJECT_SERVICES from vocabulary (addressing the per-call sorting noted in prior review), and restores the \"namespace\" in text precision guard with explicit documentation.
  • analyze_validation.py: Adds _investigation_a1_hit with a scored-metric primary path and a legacy-artifact fallback, expands arm detection to handle llm_alone_pure, and adds the L0/L1 panel plus paired contrast on both metrics.
  • test_scoring_investigation.py (new): Covers contamination guard, backward-compat default behavior, longest-name-first service matching, and the namespace anchor-word guard; test_suite.py adds two end-to-end translation_loss cases.

Confidence Score: 5/5

Safe to merge — the new metrics are additive, all existing callers keep backward-compatible defaults, and the contamination guard is well-tested.

The core scoring change is isolated to a new helper that cannot affect existing metrics. The include_predictor_output flag defaults to True, preserving all pre-existing caller behavior. The vocabulary import and length-sort are computed at module load. The namespace anchor guard is explicitly restored and covered by a regression test. The only finding is a NaN/misleading-verdict edge case in the exploratory analysis script when two arms share no case IDs in a stratum.

No files require special attention; the analysis script NaN edge case is the only non-trivial item.

Important Files Changed

Filename Overview
tests/benchmarks/cloudopsbench/scoring.py Adds investigation-native scoring path and translation_loss metric; imports vocabulary constants at module level and precomputes length-sorted service list; restores namespace anchor guard in _infer_fault_object with clear documentation.
tests/benchmarks/cloudopsbench/analyze_validation.py Adds L0 vs L1 panel, dynamic arm detection, and paired contrast on both a1 and investigation_a1; the deltas empty-list edge produces misleading NaN output.
tests/benchmarks/cloudopsbench/tests/test_scoring_investigation.py New test file covering contamination guard, backward-compat default, longest-name-first matching, namespace anchor-word guard, and conservative-floor contract.
tests/benchmarks/cloudopsbench/tests/test_suite.py Adds two end-to-end translation_loss tests confirming the separation works correctly in the full score_case path.
tests/benchmarks/_framework/runner.py Adds optional inv_a1 display to the per-case print line; safe, no logic changes.
tests/benchmarks/cloudopsbench/adapter.py Registers the four new metrics in MetricSchema; translation_loss correctly marked higher_is_better=False.
tests/benchmarks/cloudopsbench/configs/cloudopsbench_db_pod_logs_smoke_openai.yml Updates smoke-gate comment to make investigation_a1 the primary gate criterion; documentation-only change.

Reviews (4): Last reviewed commit: "fixed greptile issues" | Re-trigger Greptile

Comment thread tests/benchmarks/cloudopsbench/scoring.py Outdated
Comment on lines +193 to +206
for seen in (True, False, None):
label = {True: "seen", False: "unseen", None: "all"}[seen]

def scen_hit(
arm: str,
seen: bool | None = seen,
hit_fn: Callable[[dict], int] = hit_fn,
) -> dict[str, float]:
by: dict[str, list[int]] = {}
for r in rows:
if r["run"]["mode"] != arm:
continue
if seen is not None and r["case"].get("seen_shape") is not seen:
continue

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 scen_hit is correctly capturing hit_fn and seen via default-arg bindings, so there is no runtime closure bug. However, redefining the function on every iteration of the for seen loop makes the capture semantics non-obvious to the next reader. Extracting scen_hit as a module-level helper with explicit parameters would make the intent immediately clear.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment thread tests/benchmarks/cloudopsbench/analyze_validation.py
@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

@greptile-apps

greptile-apps Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds three investigation-native accuracy metrics (investigation_a1, investigation_partial_a1, investigation_object_a1) and a translation_loss diagnostic that decouple opensre's investigation quality from the LLM predictor that formalizes its prose into top_3_predictions. The separation is achieved via a new include_predictor_output=False flag on infer_final_answer_from_opensre_text, which prevents the predictor's structured JSON from feeding back through the keyword parser and contaminating the investigation score.

  • _infer_fault_object is refactored to pull its closed vocabulary from predictor.vocabulary (single source of truth), and the namespace matching guard requiring "namespace" in the text is dropped — replaced by a simple substring match on the cluster name, which improves recall for namespace-GT cases where opensre never uses the word "namespace".
  • analyze_validation.py gains a new L0 vs L1 panel showing inv_a1, a1, the gap, and translation-loss rate per arm, plus paired bootstrap CIs on both metrics.
  • New tests in test_scoring_investigation.py cover the contamination guard, the zero-floor contract, and backwards-compat for existing callers; test_suite.py gets end-to-end translation-loss firing/non-firing tests.

Confidence Score: 4/5

Safe to merge; the new metrics are additive and the contamination guard is well-tested. The main change worth watching is the removal of the "namespace" keyword guard in the fault-object matcher.

The core logic — separating investigation scoring from the predictor via include_predictor_output=False — is correct and well-covered by the new test cases, including an explicit contamination guard test. The one behavioral change worth understanding is the dropped namespace-in-text guard in _infer_fault_object: non-namespace investigations where prose mentions 'boutique' or 'train-ticket' generically will now infer a namespace object instead of returning no triple, subtly changing what 'conservative lower bound' means. The vocabulary import inside the function body and the repeated sorted() call per invocation are minor quality nits with no correctness impact.

The namespace-matching change in scoring.py lines 280-282 is worth a second read to confirm the new lower-bound semantics are acceptable.

Important Files Changed

Filename Overview
tests/benchmarks/cloudopsbench/scoring.py Adds investigation_a1/partial/object and translation_loss metrics; refactors _infer_fault_object to use vocabulary constants with a lazy intra-function import and re-sorts the service list on every call; removes the "namespace" keyword guard for namespace fault-object matching.
tests/benchmarks/cloudopsbench/analyze_validation.py Adds L0 vs L1 panel, dynamic arm detection, paired contrasts on both a1 and investigation_a1; nested scen_hit function correctly uses default-arg capture to freeze seen/hit_fn per iteration.
tests/benchmarks/cloudopsbench/adapter.py Registers new metrics in schema; translation_loss correctly has higher_is_better=False; placed in outcome_metrics alongside the accuracy metrics it's derived from.
tests/benchmarks/cloudopsbench/tests/test_scoring_investigation.py New test file; comprehensive coverage of contamination guard, empty-input floor, default/flag-off behavior, namespace vocabulary, and root-cause mismatch floor contract.
tests/benchmarks/cloudopsbench/tests/test_suite.py Adds two end-to-end translation_loss tests (fires and does not fire) using a canned cartservice case with correct and incorrect predictor output.
tests/benchmarks/_framework/runner.py Minor display change: prints inv_a1 alongside a1 in the per-case run log when the metric is present.
tests/benchmarks/cloudopsbench/configs/cloudopsbench_db_pod_logs_smoke_openai.yml Smoke gate (a) updated to use investigation_a1 as the PRIMARY threshold; post-run analysis command added for the L0 vs L1 panel.

Reviews (2): Last reviewed commit: "fixed namespace match issue" | Re-trigger Greptile

Comment thread tests/benchmarks/cloudopsbench/scoring.py Outdated
Comment thread tests/benchmarks/cloudopsbench/scoring.py Outdated
Comment thread tests/benchmarks/cloudopsbench/scoring.py Outdated
@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

@YauhenBichel YauhenBichel merged commit 49e81dc into main Jun 11, 2026
17 checks passed
@YauhenBichel YauhenBichel deleted the fix/2074-bench-investigation-score branch June 11, 2026 12:28
@github-actions

Copy link
Copy Markdown
Contributor

💜 One more reason the project grows. Thanks @YauhenBichel — your contribution just landed!


👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Benchmark opensre+LLM vs LLM-alone (Cloudopsbench)

1 participant