feat(bench): investigation_a1 + translation_loss metrics#2798
Conversation
Greptile code reviewThis repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md. Run a review — add a PR comment with: Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5. Optional: automate with the greploop skill. |
|
@greptile review |
Greptile SummaryThis PR separates opensre's investigation quality from the downstream LLM predictor by adding
Confidence Score: 5/5Safe to merge — the new metrics are additive, all existing callers keep backward-compatible defaults, and the contamination guard is well-tested. The core scoring change is isolated to a new helper that cannot affect existing metrics. The include_predictor_output flag defaults to True, preserving all pre-existing caller behavior. The vocabulary import and length-sort are computed at module load. The namespace anchor guard is explicitly restored and covered by a regression test. The only finding is a NaN/misleading-verdict edge case in the exploratory analysis script when two arms share no case IDs in a stratum. No files require special attention; the analysis script NaN edge case is the only non-trivial item. Important Files Changed
Reviews (4): Last reviewed commit: "fixed greptile issues" | Re-trigger Greptile |
| for seen in (True, False, None): | ||
| label = {True: "seen", False: "unseen", None: "all"}[seen] | ||
|
|
||
| def scen_hit( | ||
| arm: str, | ||
| seen: bool | None = seen, | ||
| hit_fn: Callable[[dict], int] = hit_fn, | ||
| ) -> dict[str, float]: | ||
| by: dict[str, list[int]] = {} | ||
| for r in rows: | ||
| if r["run"]["mode"] != arm: | ||
| continue | ||
| if seen is not None and r["case"].get("seen_shape") is not seen: | ||
| continue |
There was a problem hiding this comment.
scen_hit is correctly capturing hit_fn and seen via default-arg bindings, so there is no runtime closure bug. However, redefining the function on every iteration of the for seen loop makes the capture semantics non-obvious to the next reader. Extracting scen_hit as a module-level helper with explicit parameters would make the intent immediately clear.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
|
@greptile review |
Greptile SummaryThis PR adds three investigation-native accuracy metrics (
Confidence Score: 4/5Safe to merge; the new metrics are additive and the contamination guard is well-tested. The main change worth watching is the removal of the "namespace" keyword guard in the fault-object matcher. The core logic — separating investigation scoring from the predictor via include_predictor_output=False — is correct and well-covered by the new test cases, including an explicit contamination guard test. The one behavioral change worth understanding is the dropped namespace-in-text guard in _infer_fault_object: non-namespace investigations where prose mentions 'boutique' or 'train-ticket' generically will now infer a namespace object instead of returning no triple, subtly changing what 'conservative lower bound' means. The vocabulary import inside the function body and the repeated sorted() call per invocation are minor quality nits with no correctness impact. The namespace-matching change in scoring.py lines 280-282 is worth a second read to confirm the new lower-bound semantics are acceptable. Important Files Changed
Reviews (2): Last reviewed commit: "fixed namespace match issue" | Re-trigger Greptile |
|
@greptile review |
|
💜 One more reason the project grows. Thanks @YauhenBichel — your contribution just landed! 👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome. |

Fixes #2074
Describe the changes you have made in this PR -
Adds metrics that separate opensre's investigation quality from the LLM predictor that wraps its prose into
top_3_predictions. Today's headlinea1measures the whole pipeline; a lift ona1can't tell us whether opensre or the predictor got better.investigation_a1— single triple parsed directly from opensre's prose (report+root_cause+final_state) via the keyword bridge. Usesinclude_predictor_output=False(new flag oninfer_final_answer_from_opensre_text) so the predictor's structured JSON cannot feed back through and contaminate the metric.investigation_partial_a1/investigation_object_a1— partial + object-only variants on the same path.translation_loss— fires wheninvestigation_a1 == 1ANDa1 == 0. The predictor lost what opensre named.analyze_validation.pygets a new "L0 vs L1" panel: per arm showsinv_a1,a1, the gap, and translation-loss rate.The experiment chain to date is predictor-mediated; this
PR is what makes future experiments answer "is opensre getting better?"
honestly:
a1upinvestigation_a1upCode Understanding and AI Usage
Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?
If you used AI assistance:
Checklist before requesting a review
Note: Please check Allow edits from maintainers if you would like us to assist in the PR.