feat(bench): full-N readiness + SHA capture fix by YauhenBichel · Pull Request #2799 · Tracer-Cloud/opensre

YauhenBichel · 2026-06-12T10:10:59Z

Fixes #2074

Describe the changes you have made in this PR -

Three things in one PR, all in service of getting the DB-evidence
pipeline full-N runnable and promotable on #2074:

V2 prompt — alert-anchored upstream-attribution rule. When the
alert names a service and describes a performance / latency /
network / resource problem, trust the alert's named service.
Downstream services with loud error logs are usually victims of the
slow upstream, not causes. Added to the trimmed bench prompt
alongside the existing dependency-traversal rule.
Full-N pre-registration + run config. Locks predictions for each
fault category (Admission +20pp, Performance -5pp, aggregate +5pp)
before seeing data. Decision matrix: ship if all four gates pass;
roll back if Admission or Performance hardens beyond the threshold.
Includes the explicit "do not iterate prompts" rollback rule learned
from the V3 smoke regression.
SHA capture bug fix. The previous full-N stamped
opensre_sha=(no-git) despite dev_mode=false. The Fargate
container has no .git directory, and the integrity gate didn't
enforce the pre-reg's committed_checkout_required: true. Now:
- _git_sha() reads the OPENSRE_SHA env var first.
- The image build workflow passes --build-arg OPENSRE_SHA=<tag>.
- Dockerfile.bench exposes it as a runtime env var.
- BenchmarkRunner.run() rejects (no-git) / (unknown) SHAs at
  start so unverifiable artifacts cannot reach a report

Code Understanding and AI Usage

Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?

No, I wrote all the code myself
Yes, I used AI assistance (continue below)

If you used AI assistance:

I have reviewed every single line of the AI-generated code
I can explain the purpose and logic of each function/component I added
I have tested edge cases and understand how the code handles them
I have modified the AI output to follow this project's coding standards and conventions

Explain your implementation approach:

The V3 root-cause-shape clause was added then rolled back when V3 made
the score worse. The V2 rule (this PR) is what survived. The test
pins that V3 stays gone.
The investigation-native scoring (investigation_a1 /
translation_loss) shipped separately in feat(bench): investigation_a1 + translation_loss metrics #2798.
Admission cascading writeup lives in ~/DevBox/tracer-cloud/opensre-notes/
as a draft; not in this PR.

Checklist before requesting a review

I have added proper PR title and linked to the issue
I have performed a self-review of my code
I can explain the purpose of every function, class, and logic block I added
I understand why my changes work and have tested them thoroughly
I have considered potential edge cases and how my code handles them
If it is a core feature, I have added thorough tests
My code follows the project's style guidelines and conventions

Note: Please check Allow edits from maintainers if you would like us to assist in the PR.

github-actions · 2026-06-12T10:11:08Z

Greptile code review

This repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md.

Run a review — add a PR comment with:

@greptile review

Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5.

Optional: automate with the greploop skill.

github-actions · 2026-06-12T10:11:38Z

terraform-bench plan

step	outcome
fmt	✅ `success`
init	✅ `success`
validate	✅ `success`
plan	✅ `success`

Plan output

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
-/+ destroy and then create replacement

Terraform will perform the following actions:

  # aws_ecs_task_definition.bench must be replaced
-/+ resource "aws_ecs_task_definition" "bench" {
      ~ arn                      = "arn:aws:ecs:us-east-1:395261708130:task-definition/opensre-bench:34" -> (known after apply)
      ~ arn_without_revision     = "arn:aws:ecs:us-east-1:395261708130:task-definition/opensre-bench" -> (known after apply)
      ~ container_definitions    = jsonencode(
          ~ [
              ~ {
                  ~ image            = "395261708130.dkr.ecr.us-east-1.amazonaws.com/opensre-bench:3d20513" -> "395261708130.dkr.ecr.us-east-1.amazonaws.com/opensre-bench:bootstrap"
                  - mountPoints      = []
                    name             = "bench"
                  - portMappings     = []
                  - systemControls   = []
                  - volumesFrom      = []
                    # (4 unchanged attributes hidden)
                },
            ] # forces replacement
        )
      ~ enable_fault_injection   = false -> (known after apply)
      ~ id                       = "opensre-bench" -> (known after apply)
      ~ revision                 = 34 -> (known after apply)
      - tags                     = {} -> null
        # (10 unchanged attributes hidden)
    }

Plan: 1 to add, 0 to change, 1 to destroy.

Changes to Outputs:
  ~ task_definition_arn     = "arn:aws:ecs:us-east-1:395261708130:task-definition/opensre-bench:34" -> (known after apply)

Updated by terraform-bench.yml.

YauhenBichel · 2026-06-12T10:12:16Z

@greptile review

greptile-apps · 2026-06-12T10:17:31Z

Greptile Summary

This PR fixes the SHA capture regression that caused Fargate runs to stamp (no-git) into provenance artifacts, adds a V2 alert-anchored upstream-attribution rule to the trimmed bench prompt, and wires the full-N pre-registration + run config for the DB-evidence pipeline experiment.

SHA integrity fix: _git_sha() now reads OPENSRE_SHA env var first; the Dockerfile exposes it from a build-arg stamped by github.sha; and BenchmarkRunner.run() gates the promotable path with a ^[0-9a-f]{7,40}$ regex before any cells run — all three layers are covered by a dedicated new test file.
V2 prompt rule: Alert-anchored upstream-attribution clause added to _TRIMMED_BENCH_SYSTEM_PROMPT, complementing the existing dependency-traversal rule; prompt-pinning tests enforce both rules are present and that no corpus-specific tokens leaked in.
Full-N config + pre-registration: exp_db_evidence_pipeline_v1.yml locks predictions, decision matrix, and rollback rules before the run; cloudopsbench_db_evidence_pipeline_full_openai.yml wires the full 452-case corpus against the pre-reg.

Confidence Score: 5/5

Safe to merge; all changes are scoped to the benchmark framework and its configuration — no production application code is touched.

The three-layer SHA fix (env var resolution, Dockerfile ARG/ENV, runtime regex gate) is coherent and well-tested. The pre-registration and run configs are documentation-heavy and self-contained. The only findings are a stale docstring example, a cheap-before-expensive check ordering preference, and a missing dirty-suffix test case — none affect runtime correctness.

No files require special attention; the runner.py changes have dedicated test coverage in the new test_runner_sha_integrity.py.

Important Files Changed

Filename	Overview
.github/workflows/benchmark-image.yml	Passes github.sha as OPENSRE_SHA build-arg instead of the user-supplied image tag, decoupling ECR tag naming from reproducibility provenance.
infra/bench/Dockerfile.bench	Adds ARG OPENSRE_SHA with a (no-git) default and promotes it to a runtime ENV so the running container exposes the build-time commit SHA.
tests/benchmarks/_framework/runner.py	Adds _validate_promotable_sha (7-40 hex regex gate) called in run() before _run_inner, and updates _git_sha() to read OPENSRE_SHA env var first; minor ordering note: the cheap SHA check runs after the heavier pre_flight call.
tests/benchmarks/_framework/tests/test_runner_sha_integrity.py	New test file covering env-var precedence, whitespace stripping, rejection of bad SHA shapes (including sentinel strings), acceptance of valid 7-40 hex SHAs, and dev-path bypass.
tests/benchmarks/cloudopsbench/bench_agent.py	Adds the V2 alert-anchored upstream-attribution rule to _TRIMMED_BENCH_SYSTEM_PROMPT, complementing the existing dependency-traversal rule for performance/latency alert patterns.
tests/benchmarks/cloudopsbench/configs/preregistrations/exp_db_evidence_pipeline_v1.yml	New pre-registration for the full-N DB-evidence pipeline run; opensre_sha field carries a descriptive note (not a real SHA) which is expected as the value is stamped at run time by the bench image workflow.
tests/benchmarks/cloudopsbench/configs/cloudopsbench_db_evidence_pipeline_full_openai.yml	New full-N run config (n=452, gpt-4o, cost_budget_usd=120) wired to exp_db_evidence_pipeline_v1 pre-registration with explicit decision matrix copied from pre-reg.
tests/benchmarks/cloudopsbench/configs/cloudopsbench_perf_only_smoke_openai.yml	New Performance-only smoke config (n=25, dev_mode=true) with committed hypotheses and decision matrix for diagnosing the -14.8pp Performance regression observed in the prior joint smoke.
tests/benchmarks/cloudopsbench/tests/test_bench_agent.py	Adds two prompt-pinning tests: dependency-traversal rule presence, and alert-anchored rule presence with a negative assertion that root-cause-shape (V3) is absent and no corpus tokens leaked into the prompt.

Sequence Diagram

sequenceDiagram
    participant WF as GitHub Actions Workflow
    participant Docker as Dockerfile.bench
    participant Runner as BenchmarkRunner.run()
    participant GitSHA as _git_sha()
    participant Gate as _validate_promotable_sha()

    WF->>Docker: "--build-arg OPENSRE_SHA=github.sha"
    Docker->>Docker: ARG OPENSRE_SHA / ENV OPENSRE_SHA
    Note over Docker: Runtime container has OPENSRE_SHA set

    Runner->>GitSHA: _git_sha() called in __init__
    GitSHA->>GitSHA: os.environ.get(OPENSRE_SHA)
    alt OPENSRE_SHA set (Fargate path)
        GitSHA-->>Runner: returns full 40-char SHA
    else OPENSRE_SHA empty (local dev path)
        GitSHA->>GitSHA: git rev-parse --short HEAD
        GitSHA-->>Runner: returns short SHA or (no-git)
    end

    Runner->>Runner: integrity.pre_flight()
    Runner->>Gate: _validate_promotable_sha(self._opensre_sha)
    alt "SHA matches ^[0-9a-f]{7,40}$"
        Gate-->>Runner: passes
        Runner->>Runner: "_run_inner(dev_mode=False)"
    else Invalid SHA
        Gate-->>Runner: raises IntegrityViolation
    end

_{Reviews (3): Last reviewed commit: "fixed git SHA related issues" | Re-trigger Greptile}

YauhenBichel · 2026-06-12T10:25:33Z

@greptile review

github-actions · 2026-06-12T10:31:26Z

🛸 Aliens watching our repo just upgraded @YauhenBichel's threat level to: do not engage — too competent. 👽

👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

YauhenBichel added 5 commits June 11, 2026 13:51

added a new config for perf smoke running

1ae5c83

perf experiment fixed

ec2c675

perf experiment V3

65f69e0

full test using openai

3d20513

fixed dockerfile and runner

cdf508e

greptile-apps Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread .github/workflows/benchmark-image.yml Outdated

Comment thread tests/benchmarks/cloudopsbench/configs/preregistrations/exp_db_evidence_pipeline_v1.yml Outdated

fixed git SHA related issues

fa5cd71

YauhenBichel marked this pull request as ready for review June 12, 2026 10:30

YauhenBichel merged commit c27bee9 into main Jun 12, 2026
21 checks passed

YauhenBichel deleted the fix/2074-bench-performance-smoke branch June 12, 2026 10:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): full-N readiness + SHA capture fix#2799

feat(bench): full-N readiness + SHA capture fix#2799
YauhenBichel merged 6 commits into
mainfrom
fix/2074-bench-performance-smoke

YauhenBichel commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

YauhenBichel commented Jun 12, 2026

Uh oh!

greptile-apps Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

YauhenBichel commented Jun 12, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

YauhenBichel commented Jun 12, 2026

Describe the changes you have made in this PR -

Code Understanding and AI Usage

Checklist before requesting a review

Uh oh!

github-actions Bot commented Jun 12, 2026

Greptile code review

Uh oh!

github-actions Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

terraform-bench plan

Uh oh!

YauhenBichel commented Jun 12, 2026

Uh oh!

greptile-apps Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

YauhenBichel commented Jun 12, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Jun 12, 2026 •

edited

Loading

greptile-apps Bot commented Jun 12, 2026 •

edited

Loading