feat(bench): investigation_a1 + translation_loss metrics by YauhenBichel · Pull Request #2798 · Tracer-Cloud/opensre

YauhenBichel · 2026-06-11T11:38:57Z

Fixes #2074

Describe the changes you have made in this PR -

Adds metrics that separate opensre's investigation quality from the LLM predictor that wraps its prose into top_3_predictions. Today's headline a1 measures the whole pipeline; a lift on a1 can't tell us whether opensre or the predictor got better.

investigation_a1 — single triple parsed directly from opensre's prose (report + root_cause + final_state) via the keyword bridge. Uses include_predictor_output=False (new flag on infer_final_answer_from_opensre_text) so the predictor's structured JSON cannot feed back through and contaminate the metric.
investigation_partial_a1 / investigation_object_a1 — partial + object-only variants on the same path.
translation_loss — fires when investigation_a1 == 1 AND a1 == 0. The predictor lost what opensre named.

analyze_validation.py gets a new "L0 vs L1" panel: per arm shows
inv_a1, a1, the gap, and translation-loss rate.

The experiment chain to date is predictor-mediated; this
PR is what makes future experiments answer "is opensre getting better?"
honestly:

Pattern	What it means
both metrics up	opensre got better (real win)
only `a1` up	LLM predictor got better, not opensre
only `investigation_a1` up	opensre got better, predictor dropping it
translation_loss up	predictor bleeding investigation gains

Code Understanding and AI Usage

Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?

No, I wrote all the code myself
Yes, I used AI assistance (continue below)

If you used AI assistance:

I have reviewed every single line of the AI-generated code
I can explain the purpose and logic of each function/component I added
I have tested edge cases and understand how the code handles them
I have modified the AI output to follow this project's coding standards and conventions

Checklist before requesting a review

I have added proper PR title and linked to the issue
I have performed a self-review of my code
I can explain the purpose of every function, class, and logic block I added
I understand why my changes work and have tested them thoroughly
I have considered potential edge cases and how my code handles them
If it is a core feature, I have added thorough tests
My code follows the project's style guidelines and conventions

Note: Please check Allow edits from maintainers if you would like us to assist in the PR.

github-actions · 2026-06-11T11:39:06Z

Greptile code review

This repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md.

Run a review — add a PR comment with:

@greptile review

Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5.

Optional: automate with the greploop skill.

YauhenBichel · 2026-06-11T11:39:28Z

@greptile review

greptile-apps · 2026-06-11T11:52:24Z

Greptile Summary

This PR separates opensre's investigation quality from the downstream LLM predictor by adding investigation_a1 / investigation_partial_a1 / investigation_object_a1 metrics (keyword-parsed from opensre's prose with include_predictor_output=False) and a translation_loss flag (set when investigation_a1=1 but a1=0). A new "L0 vs L1" panel in analyze_validation.py prints both scores side-by-side per arm so future experiments can cleanly attribute gains to the investigator vs. the formalizer.

scoring.py: Adds _score_investigation_native, extends CloudOpsMetrics, imports and length-sorts _FAULT_OBJECT_SERVICES from vocabulary (addressing the per-call sorting noted in prior review), and restores the \"namespace\" in text precision guard with explicit documentation.
analyze_validation.py: Adds _investigation_a1_hit with a scored-metric primary path and a legacy-artifact fallback, expands arm detection to handle llm_alone_pure, and adds the L0/L1 panel plus paired contrast on both metrics.
test_scoring_investigation.py (new): Covers contamination guard, backward-compat default behavior, longest-name-first service matching, and the namespace anchor-word guard; test_suite.py adds two end-to-end translation_loss cases.

Confidence Score: 5/5

Safe to merge — the new metrics are additive, all existing callers keep backward-compatible defaults, and the contamination guard is well-tested.

The core scoring change is isolated to a new helper that cannot affect existing metrics. The include_predictor_output flag defaults to True, preserving all pre-existing caller behavior. The vocabulary import and length-sort are computed at module load. The namespace anchor guard is explicitly restored and covered by a regression test. The only finding is a NaN/misleading-verdict edge case in the exploratory analysis script when two arms share no case IDs in a stratum.

No files require special attention; the analysis script NaN edge case is the only non-trivial item.

Important Files Changed

Filename	Overview
tests/benchmarks/cloudopsbench/scoring.py	Adds investigation-native scoring path and translation_loss metric; imports vocabulary constants at module level and precomputes length-sorted service list; restores namespace anchor guard in _infer_fault_object with clear documentation.
tests/benchmarks/cloudopsbench/analyze_validation.py	Adds L0 vs L1 panel, dynamic arm detection, and paired contrast on both a1 and investigation_a1; the deltas empty-list edge produces misleading NaN output.
tests/benchmarks/cloudopsbench/tests/test_scoring_investigation.py	New test file covering contamination guard, backward-compat default, longest-name-first matching, namespace anchor-word guard, and conservative-floor contract.
tests/benchmarks/cloudopsbench/tests/test_suite.py	Adds two end-to-end translation_loss tests confirming the separation works correctly in the full score_case path.
tests/benchmarks/_framework/runner.py	Adds optional inv_a1 display to the per-case print line; safe, no logic changes.
tests/benchmarks/cloudopsbench/adapter.py	Registers the four new metrics in MetricSchema; translation_loss correctly marked higher_is_better=False.
tests/benchmarks/cloudopsbench/configs/cloudopsbench_db_pod_logs_smoke_openai.yml	Updates smoke-gate comment to make investigation_a1 the primary gate criterion; documentation-only change.

_{Reviews (4): Last reviewed commit: "fixed greptile issues" | Re-trigger Greptile}

greptile-apps · 2026-06-11T11:52:29Z

+            for seen in (True, False, None):
+                label = {True: "seen", False: "unseen", None: "all"}[seen]
+
+                def scen_hit(
+                    arm: str,
+                    seen: bool | None = seen,
+                    hit_fn: Callable[[dict], int] = hit_fn,
+                ) -> dict[str, float]:
+                    by: dict[str, list[int]] = {}
+                    for r in rows:
+                        if r["run"]["mode"] != arm:
+                            continue
+                        if seen is not None and r["case"].get("seen_shape") is not seen:
+                            continue


scen_hit is correctly capturing hit_fn and seen via default-arg bindings, so there is no runtime closure bug. However, redefining the function on every iteration of the for seen loop makes the capture semantics non-obvious to the next reader. Extracting scen_hit as a module-level helper with explicit parameters would make the intent immediately clear.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

YauhenBichel · 2026-06-11T11:58:54Z

@greptile review

greptile-apps · 2026-06-11T12:01:38Z

Greptile Summary

This PR adds three investigation-native accuracy metrics (investigation_a1, investigation_partial_a1, investigation_object_a1) and a translation_loss diagnostic that decouple opensre's investigation quality from the LLM predictor that formalizes its prose into top_3_predictions. The separation is achieved via a new include_predictor_output=False flag on infer_final_answer_from_opensre_text, which prevents the predictor's structured JSON from feeding back through the keyword parser and contaminating the investigation score.

_infer_fault_object is refactored to pull its closed vocabulary from predictor.vocabulary (single source of truth), and the namespace matching guard requiring "namespace" in the text is dropped — replaced by a simple substring match on the cluster name, which improves recall for namespace-GT cases where opensre never uses the word "namespace".
analyze_validation.py gains a new L0 vs L1 panel showing inv_a1, a1, the gap, and translation-loss rate per arm, plus paired bootstrap CIs on both metrics.
New tests in test_scoring_investigation.py cover the contamination guard, the zero-floor contract, and backwards-compat for existing callers; test_suite.py gets end-to-end translation-loss firing/non-firing tests.

Confidence Score: 4/5

Safe to merge; the new metrics are additive and the contamination guard is well-tested. The main change worth watching is the removal of the "namespace" keyword guard in the fault-object matcher.

The core logic — separating investigation scoring from the predictor via include_predictor_output=False — is correct and well-covered by the new test cases, including an explicit contamination guard test. The one behavioral change worth understanding is the dropped namespace-in-text guard in _infer_fault_object: non-namespace investigations where prose mentions 'boutique' or 'train-ticket' generically will now infer a namespace object instead of returning no triple, subtly changing what 'conservative lower bound' means. The vocabulary import inside the function body and the repeated sorted() call per invocation are minor quality nits with no correctness impact.

The namespace-matching change in scoring.py lines 280-282 is worth a second read to confirm the new lower-bound semantics are acceptable.

Important Files Changed

Filename	Overview
tests/benchmarks/cloudopsbench/scoring.py	Adds investigation_a1/partial/object and translation_loss metrics; refactors _infer_fault_object to use vocabulary constants with a lazy intra-function import and re-sorts the service list on every call; removes the "namespace" keyword guard for namespace fault-object matching.
tests/benchmarks/cloudopsbench/analyze_validation.py	Adds L0 vs L1 panel, dynamic arm detection, paired contrasts on both a1 and investigation_a1; nested scen_hit function correctly uses default-arg capture to freeze seen/hit_fn per iteration.
tests/benchmarks/cloudopsbench/adapter.py	Registers new metrics in schema; translation_loss correctly has higher_is_better=False; placed in outcome_metrics alongside the accuracy metrics it's derived from.
tests/benchmarks/cloudopsbench/tests/test_scoring_investigation.py	New test file; comprehensive coverage of contamination guard, empty-input floor, default/flag-off behavior, namespace vocabulary, and root-cause mismatch floor contract.
tests/benchmarks/cloudopsbench/tests/test_suite.py	Adds two end-to-end translation_loss tests (fires and does not fire) using a canned cartservice case with correct and incorrect predictor output.
tests/benchmarks/_framework/runner.py	Minor display change: prints inv_a1 alongside a1 in the per-case run log when the metric is present.
tests/benchmarks/cloudopsbench/configs/cloudopsbench_db_pod_logs_smoke_openai.yml	Smoke gate (a) updated to use investigation_a1 as the PRIMARY threshold; post-run analysis command added for the L0 vs L1 panel.

_{Reviews (2): Last reviewed commit: "fixed namespace match issue" | Re-trigger Greptile}

YauhenBichel · 2026-06-11T12:10:37Z

@greptile review

github-actions · 2026-06-11T12:28:15Z

💜 One more reason the project grows. Thanks @YauhenBichel — your contribution just landed!

👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

added L0 level for opensre score

8dac68c

fix format

459848d

YauhenBichel marked this pull request as ready for review June 11, 2026 11:49

greptile-apps Bot reviewed Jun 11, 2026

View reviewed changes

fixed namespace match issue

0086adc

greptile-apps Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread tests/benchmarks/cloudopsbench/scoring.py Outdated

Comment thread tests/benchmarks/cloudopsbench/scoring.py Outdated

Comment thread tests/benchmarks/cloudopsbench/scoring.py Outdated

fixed greptile issues

bbdd0a9

YauhenBichel merged commit 49e81dc into main Jun 11, 2026
17 checks passed

YauhenBichel deleted the fix/2074-bench-investigation-score branch June 11, 2026 12:28

YauhenBichel mentioned this pull request Jun 12, 2026

feat(bench): full-N readiness + SHA capture fix #2799

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): investigation_a1 + translation_loss metrics#2798

feat(bench): investigation_a1 + translation_loss metrics#2798
YauhenBichel merged 4 commits into
mainfrom
fix/2074-bench-investigation-score

YauhenBichel commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

YauhenBichel commented Jun 11, 2026

Uh oh!

greptile-apps Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

greptile-apps Bot Jun 11, 2026

Uh oh!

Uh oh!

YauhenBichel commented Jun 11, 2026

Uh oh!

greptile-apps Bot commented Jun 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YauhenBichel commented Jun 11, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

YauhenBichel commented Jun 11, 2026

Describe the changes you have made in this PR -

Code Understanding and AI Usage

Checklist before requesting a review

Uh oh!

github-actions Bot commented Jun 11, 2026

Greptile code review

Uh oh!

YauhenBichel commented Jun 11, 2026

Uh oh!

greptile-apps Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Uh oh!

greptile-apps Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

YauhenBichel commented Jun 11, 2026

Uh oh!

greptile-apps Bot commented Jun 11, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YauhenBichel commented Jun 11, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Jun 11, 2026 •

edited

Loading