Skip to content

fix(bench): fixed vocabulary and prepared config for next experiment#2787

Merged
YauhenBichel merged 1 commit into
mainfrom
fix/2074-bench-exp-runtime-case-data
Jun 9, 2026
Merged

fix(bench): fixed vocabulary and prepared config for next experiment#2787
YauhenBichel merged 1 commit into
mainfrom
fix/2074-bench-exp-runtime-case-data

Conversation

@YauhenBichel

Copy link
Copy Markdown
Collaborator

Fixes #2074

  • Fix predictor vocabulary completeness bug capping A@1 on 33.7% of cells (Runtime/Performance/Startup strata)
  • Fix two greptile P1 findings: CaseFilters.seed silent drop + silent ImportError in adapter registry
  • Add missing llm_alone arm to the trimmed-prompt definitive config
  • Land the framework split + predictor package refactor that surfaced the registry / seed bugs during review

Describe the changes you have made in this PR -

Root cause — vocabulary completeness

Loss-mode triage on the n=2118 case-level results from dev-2026-06-09T10-09-04Z_cloudopsbench (full N, trimmed prompt) found that 62% of opensre+llm Runtime losses are OBJECT_MISS (wrong service entirely). Drilling in: GT=app/tsdb-mysql accounts for 90 cells at A@1=0.00 — the predictor was substituting the nearest in-vocab service (ts-inside-payment-service) every time.

_FAULT_OBJECT_SERVICES listed 19 services. The dataset has 41 distinct fault_object values. The system prompt declares those 19 as "service is one of: ..." so the LLM treats it as a closed set. 714 of 2118 cells (33.7%) had a GT not in the vocab.

Stratum Cells with GT not in vocab % affected
Admission 0 0.0%
Performance 180 44.1%
Runtime 426 50.4%
Startup 108 20.9%

Predicted impact

Metric Current Predicted post-fix
opensre+llm aggregate A@1 0.65 0.70-0.73
llm_alone_pure aggregate A@1 0.62 0.67-0.70
opensre+llm Runtime A@1 0.43 0.55-0.65
Δ(opensre+llm − llm_alone_pure) +0.03 (ns) ~+0.03 (both arms gain equally)

The vocab fix is a scoring-artifact correction, not an opensre-side improvement. Both arms gain equally. The structural opensre-vs-LLM-alone question remains for the Tier 1 experiments scoped in the prior issue comment.

Code Understanding and AI Usage

Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?

  • No, I wrote all the code myself
  • Yes, I used AI assistance (continue below)

If you used AI assistance:

  • I have reviewed every single line of the AI-generated code
  • I can explain the purpose and logic of each function/component I added
  • I have tested edge cases and understand how the code handles them
  • I have modified the AI output to follow this project's coding standards and conventions

Checklist before requesting a review

  • I have added proper PR title and linked to the issue
  • I have performed a self-review of my code
  • I can explain the purpose of every function, class, and logic block I added
  • I understand why my changes work and have tested them thoroughly
  • I have considered potential edge cases and how my code handles them
  • If it is a core feature, I have added thorough tests
  • My code follows the project's style guidelines and conventions

Note: Please check Allow edits from maintainers if you would like us to assist in the PR.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Greptile code review

This repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md.

Run a review — add a PR comment with:

@greptile review

Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5.

Optional: automate with the greploop skill.

@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

@greptile-apps

greptile-apps Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a vocabulary completeness bug in the cloudopsbench predictor where 33.7% of benchmark cells (714/2118) had a ground-truth fault_object not present in _FAULT_OBJECT_SERVICES, forcing the LLM to substitute the nearest in-vocab service and capping A@1 to zero on those cells. It also adds the missing llm_alone arm to the definitive trimmed-prompt experiment config.

  • vocabulary.py: Expands _FAULT_OBJECT_SERVICES from 19 to 43 entries by adding the full train-ticket service mesh (24 entries including tsdb-mysql), eliminating the Runtime/Performance/Startup stratum penalty from vocab-miss substitution.
  • cloudopsbench_definitive_trimmed_prompt_openai.yml: Adds llm_alone as a third experiment arm alongside opensre+llm and llm_alone_pure, enabling the (llm_alone) − (llm_alone_pure) contrast that isolates opensre's prompt contribution.

Confidence Score: 5/5

Safe to merge — vocabulary expansion is a pure additive correction to a static tuple, and the YAML arm addition is isolated to experiment configuration with no runtime side effects on existing arms.

Both changes are narrow and well-justified: the vocabulary tuple is extended with services already present in the benchmark dataset, and the config YAML gains one extra mode that does not alter the existing two arms. No logic, data flow, or scoring paths are modified.

No files require special attention — the only open item is a stale header comment in the YAML config that could mislead experiment-design readers.

Important Files Changed

Filename Overview
tests/benchmarks/cloudopsbench/predictor/vocabulary.py Expands _FAULT_OBJECT_SERVICES from 19 to 43 entries by adding all missing train-ticket services and tsdb-mysql; fixes the 33.7% GT-not-in-vocab bug that capped A@1 on Runtime/Performance/Startup strata
tests/benchmarks/cloudopsbench/configs/cloudopsbench_definitive_trimmed_prompt_openai.yml Adds the missing llm_alone mode arm; header comment becomes stale (claims single-variable change from floor0 baseline but two variables now differ); cost comment on line 58 also stale with the third arm added

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[LLM emits fault_object] --> B{snap_fault_object}
    B --> C{Service in vocab?}
    C -->|exact match| D[Return canonical service]
    C -->|no match - fuzzy snap| E[Nearest in-vocab substitute]
    E --> F[Wrong service scored - A1 zero on affected cells]
    D --> G[Correct score]

    subgraph Before - 19 services
        E
        F
    end

    subgraph After - 43 services
        D
        G
    end
Loading

Reviews (2): Last reviewed commit: "fixed vocabulary and prepared config for..." | Re-trigger Greptile

@YauhenBichel YauhenBichel marked this pull request as ready for review June 9, 2026 13:18
@YauhenBichel YauhenBichel merged commit e2aadf2 into main Jun 9, 2026
22 checks passed
@YauhenBichel YauhenBichel deleted the fix/2074-bench-exp-runtime-case-data branch June 9, 2026 13:18
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

🏄 Some PRs rot in review for six weeks. @YauhenBichel's said "not today" and merged like it owned the place. 🌊


👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Benchmark opensre+LLM vs LLM-alone (Cloudopsbench)

1 participant