fix(bench): predictor DB-localization rule (Runtime gap) by YauhenBichel · Pull Request #2788 · Tracer-Cloud/opensre

YauhenBichel · 2026-06-09T17:35:21Z

Fixes #2074

Describe the changes you have made in this PR -

Add a Database-localization rule to the predictor system prompt to fix the dominant Runtime-stratum failure pattern surfaced by the loss-mode triage on dev-2026-06-09T13:10:13Z. When DB symptoms appear (MySQL connection refused, port mismatch, auth failure, pool exhaustion), the predictor now localizes onto the DB service (app/tsdb-mysql, app/redis-cart) instead of the upstream caller that surfaces the failure.

Root cause from the triage

Runtime loss-mode breakdown on the post-vocab-fix full-N (n=423 cells × 3 modes):

OBJECT_MISS (wrong service): 57.9% of opensre+llm Runtime losses
OBJECT_HIT_RC_MISS (predictor drift): 24.0%
TOP3_a3: 18.0%

Within OBJECT_MISS, tsdb-mysql accounted for 37 of 135 cells (27%). In 20 of those, the predictor picked app/ts-inside-payment-service (the immediate DB caller). Within OBJECT_HIT_RC_MISS, the dominant pattern was mysql_invalid_port → db_connection_exhaustion (25 cells) — the same DB-localization problem at the root_cause level (LLM reaches for the generic exhaustion bucket rather than the specific port-misconfig cause).

Combined: 62 of 233 Runtime losses (27%) are a form of DB-localization error.

Code Understanding and AI Usage

Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?

No, I wrote all the code myself
Yes, I used AI assistance (continue below)

If you used AI assistance:

I have reviewed every single line of the AI-generated code
I can explain the purpose and logic of each function/component I added
I have tested edge cases and understand how the code handles them
I have modified the AI output to follow this project's coding standards and conventions

Explain your implementation approach:

File	Change
`tests/benchmarks/cloudopsbench/predictor/llm_call.py`	Add Database-localization rule block between the namespace-scope rule and the performance-fault disambiguation rules in `_build_system_prompt()`. Symmetric framing to the existing namespace-scope rule (faults live at their origin level).

The rule has three parts:

Localization principle: when DB symptoms + DB service named, fault_object MUST be the DB service, not the upstream caller. Explicitly calls out the ts-inside-payment-service substitution as the wrong-localization pattern.
MySQL root_cause disambiguation: distinguishes mysql_invalid_port (port mismatch, "connection refused" on non-3306), mysql_invalid_credentials (auth failure), and db_connection_exhaustion (actual pool saturation). Targets the 25-cell mysql_invalid_port → db_connection_exhaustion confusion.
Tiebreaker: when uncertain between mysql_invalid_port and db_connection_exhaustion, prefer port — exhaustion is the over-fired generic bucket on this corpus.

Checklist before requesting a review

I have added proper PR title and linked to the issue
I have performed a self-review of my code
I can explain the purpose of every function, class, and logic block I added
I understand why my changes work and have tested them thoroughly
I have considered potential edge cases and how my code handles them
If it is a core feature, I have added thorough tests
My code follows the project's style guidelines and conventions

Note: Please check Allow edits from maintainers if you would like us to assist in the PR.

github-actions · 2026-06-09T17:35:36Z

Greptile code review

This repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md.

Run a review — add a PR comment with:

@greptile review

Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5.

Optional: automate with the greploop skill.

YauhenBichel · 2026-06-09T17:35:46Z

@greptile review

greptile-apps · 2026-06-09T17:39:02Z

Greptile Summary

Adds a "Database-localization rule" block to the predictor system prompt in _build_system_prompt() to fix the dominant Runtime-stratum failure pattern identified by loss-mode triage: the LLM was localizing DB faults onto upstream callers instead of the DB service itself. The new block is inserted between the namespace-scope rule and the performance-fault disambiguation rules, following the same structural pattern as the existing rules.

DB-localization principle: when DB symptoms are present and a DB service (tsdb-mysql, redis-cart) is named, fault_object must be the DB service — not the upstream caller experiencing the downstream effect.
MySQL root_cause disambiguation: provides explicit evidence cues to distinguish mysql_invalid_port, mysql_invalid_credentials, and db_connection_exhaustion, with a tiebreaker preferring mysql_invalid_port over the over-fired generic exhaustion bucket.
Redis root_cause disambiguation: mirrors the MySQL section with cues for missing_secret_binding, db_readonly_mode, and db_connection_exhaustion.

Confidence Score: 5/5

Safe to merge — the change is purely additive prompt text with no modifications to parsing, validation, or any runtime code path.

All changes are confined to the static string returned by _build_system_prompt(). No control-flow, schema validation, scoring logic, or external calls are touched. The new rule block follows the exact structural pattern of existing rules (namespace-scope, performance-fault), and the MySQL tiebreaker is grounded in the triage data cited in the PR description.

No files require special attention.

Important Files Changed

Filename	Overview
tests/benchmarks/cloudopsbench/predictor/llm_call.py	Adds 52 lines of prompt text for DB-localization, MySQL root-cause disambiguation with tiebreaker, and Redis root-cause disambiguation; no logic changes to parsing, validation, or runtime code paths.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[LLM receives alert + investigation summary] --> B{Investigation summary present?}
    B -- Yes --> C[Treat as AUTHORITATIVE for rank-1]
    B -- No --> D[Reason from alert alone]
    C --> E{Root cause type?}
    D --> E

    E -- namespace_* token --> F["fault_object = namespace/X - Scope Rule"]
    E -- DB symptoms named --> G{Which DB service?}
    E -- Performance anomaly --> H["fault_object = service with highest saturation / latency - Performance-fault Rule"]
    E -- Other --> I[Standard ranking]

    G -- tsdb-mysql / MySQL error --> J["fault_object = app/tsdb-mysql - DB-localization Rule"]
    G -- redis-cart / Redis error --> K["fault_object = app/redis-cart - DB-localization Rule"]

    J --> L{MySQL root_cause?}
    L -- port mismatch / refused on non-3306 --> M[mysql_invalid_port]
    L -- access denied / auth failed --> N[mysql_invalid_credentials]
    L -- too many connections / pool at limit --> O[db_connection_exhaustion]
    L -- Uncertain --> M

    K --> P{Redis root_cause?}
    P -- NOAUTH / WRONGPASS / no requirepass --> Q[missing_secret_binding]
    P -- READONLY replica --> R[db_readonly_mode]
    P -- max clients reached / maxclients --> S[db_connection_exhaustion]

_{Reviews (3): Last reviewed commit: "added redis" | Re-trigger Greptile}

YauhenBichel · 2026-06-09T17:43:37Z

@greptile review

github-actions · 2026-06-09T17:49:21Z

🧠 @YauhenBichel opened a PR. Maintainers feared them. CI genuflected. It merged. 🚨

👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

added DB-localization rule

48e5086

greptile-apps Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread tests/benchmarks/cloudopsbench/predictor/llm_call.py Outdated

added redis

17f38d3

YauhenBichel marked this pull request as ready for review June 9, 2026 17:47

YauhenBichel merged commit be6d2a5 into main Jun 9, 2026
17 checks passed

YauhenBichel deleted the fix/2074-bench-runtime-lost-mode-triage branch June 9, 2026 17:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bench): predictor DB-localization rule (Runtime gap)#2788

fix(bench): predictor DB-localization rule (Runtime gap)#2788
YauhenBichel merged 2 commits into
mainfrom
fix/2074-bench-runtime-lost-mode-triage

YauhenBichel commented Jun 9, 2026

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

YauhenBichel commented Jun 9, 2026

Uh oh!

greptile-apps Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

YauhenBichel commented Jun 9, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

YauhenBichel commented Jun 9, 2026

Describe the changes you have made in this PR -

Root cause from the triage

Code Understanding and AI Usage

Checklist before requesting a review

Uh oh!

github-actions Bot commented Jun 9, 2026

Greptile code review

Uh oh!

YauhenBichel commented Jun 9, 2026

Uh oh!

greptile-apps Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

YauhenBichel commented Jun 9, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Jun 9, 2026 •

edited

Loading