fix(bench): fixed vocabulary and prepared config for next experiment by YauhenBichel · Pull Request #2787 · Tracer-Cloud/opensre

YauhenBichel · 2026-06-09T12:07:24Z

Fixes #2074

Fix predictor vocabulary completeness bug capping A@1 on 33.7% of cells (Runtime/Performance/Startup strata)
Fix two greptile P1 findings: CaseFilters.seed silent drop + silent ImportError in adapter registry
Add missing llm_alone arm to the trimmed-prompt definitive config
Land the framework split + predictor package refactor that surfaced the registry / seed bugs during review

Describe the changes you have made in this PR -

Root cause — vocabulary completeness

Loss-mode triage on the n=2118 case-level results from dev-2026-06-09T10-09-04Z_cloudopsbench (full N, trimmed prompt) found that 62% of opensre+llm Runtime losses are OBJECT_MISS (wrong service entirely). Drilling in: GT=app/tsdb-mysql accounts for 90 cells at A@1=0.00 — the predictor was substituting the nearest in-vocab service (ts-inside-payment-service) every time.

_FAULT_OBJECT_SERVICES listed 19 services. The dataset has 41 distinct fault_object values. The system prompt declares those 19 as "service is one of: ..." so the LLM treats it as a closed set. 714 of 2118 cells (33.7%) had a GT not in the vocab.

Stratum	Cells with GT not in vocab	% affected
Admission	0	0.0%
Performance	180	44.1%
Runtime	426	50.4%
Startup	108	20.9%

Predicted impact

Metric	Current	Predicted post-fix
opensre+llm aggregate A@1	0.65	0.70-0.73
llm_alone_pure aggregate A@1	0.62	0.67-0.70
opensre+llm Runtime A@1	0.43	0.55-0.65
Δ(opensre+llm − llm_alone_pure)	+0.03 (ns)	~+0.03 (both arms gain equally)

The vocab fix is a scoring-artifact correction, not an opensre-side improvement. Both arms gain equally. The structural opensre-vs-LLM-alone question remains for the Tier 1 experiments scoped in the prior issue comment.

Code Understanding and AI Usage

Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?

No, I wrote all the code myself
Yes, I used AI assistance (continue below)

If you used AI assistance:

I have reviewed every single line of the AI-generated code
I can explain the purpose and logic of each function/component I added
I have tested edge cases and understand how the code handles them
I have modified the AI output to follow this project's coding standards and conventions

Checklist before requesting a review

I have added proper PR title and linked to the issue
I have performed a self-review of my code
I can explain the purpose of every function, class, and logic block I added
I understand why my changes work and have tested them thoroughly
I have considered potential edge cases and how my code handles them
If it is a core feature, I have added thorough tests
My code follows the project's style guidelines and conventions

Note: Please check Allow edits from maintainers if you would like us to assist in the PR.

github-actions · 2026-06-09T12:07:34Z

Greptile code review

This repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md.

Run a review — add a PR comment with:

@greptile review

Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5.

Optional: automate with the greploop skill.

YauhenBichel · 2026-06-09T12:07:46Z

@greptile review

greptile-apps · 2026-06-09T12:10:36Z

Greptile Summary

This PR fixes a vocabulary completeness bug in the cloudopsbench predictor where 33.7% of benchmark cells (714/2118) had a ground-truth fault_object not present in _FAULT_OBJECT_SERVICES, forcing the LLM to substitute the nearest in-vocab service and capping A@1 to zero on those cells. It also adds the missing llm_alone arm to the definitive trimmed-prompt experiment config.

vocabulary.py: Expands _FAULT_OBJECT_SERVICES from 19 to 43 entries by adding the full train-ticket service mesh (24 entries including tsdb-mysql), eliminating the Runtime/Performance/Startup stratum penalty from vocab-miss substitution.
cloudopsbench_definitive_trimmed_prompt_openai.yml: Adds llm_alone as a third experiment arm alongside opensre+llm and llm_alone_pure, enabling the (llm_alone) − (llm_alone_pure) contrast that isolates opensre's prompt contribution.

Confidence Score: 5/5

Safe to merge — vocabulary expansion is a pure additive correction to a static tuple, and the YAML arm addition is isolated to experiment configuration with no runtime side effects on existing arms.

Both changes are narrow and well-justified: the vocabulary tuple is extended with services already present in the benchmark dataset, and the config YAML gains one extra mode that does not alter the existing two arms. No logic, data flow, or scoring paths are modified.

No files require special attention — the only open item is a stale header comment in the YAML config that could mislead experiment-design readers.

Important Files Changed

Filename	Overview
tests/benchmarks/cloudopsbench/predictor/vocabulary.py	Expands _FAULT_OBJECT_SERVICES from 19 to 43 entries by adding all missing train-ticket services and tsdb-mysql; fixes the 33.7% GT-not-in-vocab bug that capped A@1 on Runtime/Performance/Startup strata
tests/benchmarks/cloudopsbench/configs/cloudopsbench_definitive_trimmed_prompt_openai.yml	Adds the missing llm_alone mode arm; header comment becomes stale (claims single-variable change from floor0 baseline but two variables now differ); cost comment on line 58 also stale with the third arm added

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[LLM emits fault_object] --> B{snap_fault_object}
    B --> C{Service in vocab?}
    C -->|exact match| D[Return canonical service]
    C -->|no match - fuzzy snap| E[Nearest in-vocab substitute]
    E --> F[Wrong service scored - A1 zero on affected cells]
    D --> G[Correct score]

    subgraph Before - 19 services
        E
        F
    end

    subgraph After - 43 services
        D
        G
    end

_{Reviews (2): Last reviewed commit: "fixed vocabulary and prepared config for..." | Re-trigger Greptile}

github-actions · 2026-06-09T13:18:38Z

🏄 Some PRs rot in review for six weeks. @YauhenBichel's said "not today" and merged like it owned the place. 🌊

👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

fixed vocabulary and prepared config for next experiment

8e691c5

YauhenBichel marked this pull request as ready for review June 9, 2026 13:18

YauhenBichel merged commit e2aadf2 into main Jun 9, 2026
22 checks passed

YauhenBichel deleted the fix/2074-bench-exp-runtime-case-data branch June 9, 2026 13:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bench): fixed vocabulary and prepared config for next experiment#2787

fix(bench): fixed vocabulary and prepared config for next experiment#2787
YauhenBichel merged 1 commit into
mainfrom
fix/2074-bench-exp-runtime-case-data

YauhenBichel commented Jun 9, 2026

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

YauhenBichel commented Jun 9, 2026

Uh oh!

greptile-apps Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

YauhenBichel commented Jun 9, 2026

Describe the changes you have made in this PR -

Root cause — vocabulary completeness

Predicted impact

Code Understanding and AI Usage

Checklist before requesting a review

Uh oh!

github-actions Bot commented Jun 9, 2026

Greptile code review

Uh oh!

YauhenBichel commented Jun 9, 2026

Uh oh!

greptile-apps Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Jun 9, 2026 •

edited

Loading