Skip to content

feat(bench): full-N readiness + SHA capture fix#2799

Merged
YauhenBichel merged 6 commits into
mainfrom
fix/2074-bench-performance-smoke
Jun 12, 2026
Merged

feat(bench): full-N readiness + SHA capture fix#2799
YauhenBichel merged 6 commits into
mainfrom
fix/2074-bench-performance-smoke

Conversation

@YauhenBichel

Copy link
Copy Markdown
Collaborator

Fixes #2074

Describe the changes you have made in this PR -

Three things in one PR, all in service of getting the DB-evidence
pipeline full-N runnable and promotable on #2074:

  1. V2 prompt — alert-anchored upstream-attribution rule. When the
    alert names a service and describes a performance / latency /
    network / resource problem, trust the alert's named service.
    Downstream services with loud error logs are usually victims of the
    slow upstream, not causes. Added to the trimmed bench prompt
    alongside the existing dependency-traversal rule.

  2. Full-N pre-registration + run config. Locks predictions for each
    fault category (Admission +20pp, Performance -5pp, aggregate +5pp)
    before seeing data. Decision matrix: ship if all four gates pass;
    roll back if Admission or Performance hardens beyond the threshold.
    Includes the explicit "do not iterate prompts" rollback rule learned
    from the V3 smoke regression.

  3. SHA capture bug fix. The previous full-N stamped
    opensre_sha=(no-git) despite dev_mode=false. The Fargate
    container has no .git directory, and the integrity gate didn't
    enforce the pre-reg's committed_checkout_required: true. Now:

    • _git_sha() reads the OPENSRE_SHA env var first.
    • The image build workflow passes --build-arg OPENSRE_SHA=<tag>.
    • Dockerfile.bench exposes it as a runtime env var.
    • BenchmarkRunner.run() rejects (no-git) / (unknown) SHAs at
      start so unverifiable artifacts cannot reach a report

Code Understanding and AI Usage

Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?

  • No, I wrote all the code myself
  • Yes, I used AI assistance (continue below)

If you used AI assistance:

  • I have reviewed every single line of the AI-generated code
  • I can explain the purpose and logic of each function/component I added
  • I have tested edge cases and understand how the code handles them
  • I have modified the AI output to follow this project's coding standards and conventions

Explain your implementation approach:

  • The V3 root-cause-shape clause was added then rolled back when V3 made
    the score worse. The V2 rule (this PR) is what survived. The test
    pins that V3 stays gone.
  • The investigation-native scoring (investigation_a1 /
    translation_loss) shipped separately in feat(bench): investigation_a1 + translation_loss metrics #2798.
  • Admission cascading writeup lives in ~/DevBox/tracer-cloud/opensre-notes/
    as a draft; not in this PR.

Checklist before requesting a review

  • I have added proper PR title and linked to the issue
  • I have performed a self-review of my code
  • I can explain the purpose of every function, class, and logic block I added
  • I understand why my changes work and have tested them thoroughly
  • I have considered potential edge cases and how my code handles them
  • If it is a core feature, I have added thorough tests
  • My code follows the project's style guidelines and conventions

Note: Please check Allow edits from maintainers if you would like us to assist in the PR.

@github-actions

Copy link
Copy Markdown
Contributor

Greptile code review

This repo uses Greptile for automated review. Before merge, aim for Confidence Score: 5/5 with zero unresolved review threads — see CONTRIBUTING.md.

Run a review — add a PR comment with:

@greptile review

Give it ~5-10 minutes (sometimes longer) for results, then fix feedback and re-trigger until you reach Confidence Score: 5/5.

Optional: automate with the greploop skill.

@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

terraform-bench plan

step outcome
fmt success
init success
validate success
plan success
Plan output
Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
-/+ destroy and then create replacement

Terraform will perform the following actions:

  # aws_ecs_task_definition.bench must be replaced
-/+ resource "aws_ecs_task_definition" "bench" {
      ~ arn                      = "arn:aws:ecs:us-east-1:395261708130:task-definition/opensre-bench:34" -> (known after apply)
      ~ arn_without_revision     = "arn:aws:ecs:us-east-1:395261708130:task-definition/opensre-bench" -> (known after apply)
      ~ container_definitions    = jsonencode(
          ~ [
              ~ {
                  ~ image            = "395261708130.dkr.ecr.us-east-1.amazonaws.com/opensre-bench:3d20513" -> "395261708130.dkr.ecr.us-east-1.amazonaws.com/opensre-bench:bootstrap"
                  - mountPoints      = []
                    name             = "bench"
                  - portMappings     = []
                  - systemControls   = []
                  - volumesFrom      = []
                    # (4 unchanged attributes hidden)
                },
            ] # forces replacement
        )
      ~ enable_fault_injection   = false -> (known after apply)
      ~ id                       = "opensre-bench" -> (known after apply)
      ~ revision                 = 34 -> (known after apply)
      - tags                     = {} -> null
        # (10 unchanged attributes hidden)
    }

Plan: 1 to add, 0 to change, 1 to destroy.

Changes to Outputs:
  ~ task_definition_arn     = "arn:aws:ecs:us-east-1:395261708130:task-definition/opensre-bench:34" -> (known after apply)

Updated by terraform-bench.yml.

@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

@greptile-apps

greptile-apps Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes the SHA capture regression that caused Fargate runs to stamp (no-git) into provenance artifacts, adds a V2 alert-anchored upstream-attribution rule to the trimmed bench prompt, and wires the full-N pre-registration + run config for the DB-evidence pipeline experiment.

  • SHA integrity fix: _git_sha() now reads OPENSRE_SHA env var first; the Dockerfile exposes it from a build-arg stamped by github.sha; and BenchmarkRunner.run() gates the promotable path with a ^[0-9a-f]{7,40}$ regex before any cells run — all three layers are covered by a dedicated new test file.
  • V2 prompt rule: Alert-anchored upstream-attribution clause added to _TRIMMED_BENCH_SYSTEM_PROMPT, complementing the existing dependency-traversal rule; prompt-pinning tests enforce both rules are present and that no corpus-specific tokens leaked in.
  • Full-N config + pre-registration: exp_db_evidence_pipeline_v1.yml locks predictions, decision matrix, and rollback rules before the run; cloudopsbench_db_evidence_pipeline_full_openai.yml wires the full 452-case corpus against the pre-reg.

Confidence Score: 5/5

Safe to merge; all changes are scoped to the benchmark framework and its configuration — no production application code is touched.

The three-layer SHA fix (env var resolution, Dockerfile ARG/ENV, runtime regex gate) is coherent and well-tested. The pre-registration and run configs are documentation-heavy and self-contained. The only findings are a stale docstring example, a cheap-before-expensive check ordering preference, and a missing dirty-suffix test case — none affect runtime correctness.

No files require special attention; the runner.py changes have dedicated test coverage in the new test_runner_sha_integrity.py.

Important Files Changed

Filename Overview
.github/workflows/benchmark-image.yml Passes github.sha as OPENSRE_SHA build-arg instead of the user-supplied image tag, decoupling ECR tag naming from reproducibility provenance.
infra/bench/Dockerfile.bench Adds ARG OPENSRE_SHA with a (no-git) default and promotes it to a runtime ENV so the running container exposes the build-time commit SHA.
tests/benchmarks/_framework/runner.py Adds _validate_promotable_sha (7-40 hex regex gate) called in run() before _run_inner, and updates _git_sha() to read OPENSRE_SHA env var first; minor ordering note: the cheap SHA check runs after the heavier pre_flight call.
tests/benchmarks/_framework/tests/test_runner_sha_integrity.py New test file covering env-var precedence, whitespace stripping, rejection of bad SHA shapes (including sentinel strings), acceptance of valid 7-40 hex SHAs, and dev-path bypass.
tests/benchmarks/cloudopsbench/bench_agent.py Adds the V2 alert-anchored upstream-attribution rule to _TRIMMED_BENCH_SYSTEM_PROMPT, complementing the existing dependency-traversal rule for performance/latency alert patterns.
tests/benchmarks/cloudopsbench/configs/preregistrations/exp_db_evidence_pipeline_v1.yml New pre-registration for the full-N DB-evidence pipeline run; opensre_sha field carries a descriptive note (not a real SHA) which is expected as the value is stamped at run time by the bench image workflow.
tests/benchmarks/cloudopsbench/configs/cloudopsbench_db_evidence_pipeline_full_openai.yml New full-N run config (n=452, gpt-4o, cost_budget_usd=120) wired to exp_db_evidence_pipeline_v1 pre-registration with explicit decision matrix copied from pre-reg.
tests/benchmarks/cloudopsbench/configs/cloudopsbench_perf_only_smoke_openai.yml New Performance-only smoke config (n=25, dev_mode=true) with committed hypotheses and decision matrix for diagnosing the -14.8pp Performance regression observed in the prior joint smoke.
tests/benchmarks/cloudopsbench/tests/test_bench_agent.py Adds two prompt-pinning tests: dependency-traversal rule presence, and alert-anchored rule presence with a negative assertion that root-cause-shape (V3) is absent and no corpus tokens leaked into the prompt.

Sequence Diagram

sequenceDiagram
    participant WF as GitHub Actions Workflow
    participant Docker as Dockerfile.bench
    participant Runner as BenchmarkRunner.run()
    participant GitSHA as _git_sha()
    participant Gate as _validate_promotable_sha()

    WF->>Docker: "--build-arg OPENSRE_SHA=github.sha"
    Docker->>Docker: ARG OPENSRE_SHA / ENV OPENSRE_SHA
    Note over Docker: Runtime container has OPENSRE_SHA set

    Runner->>GitSHA: _git_sha() called in __init__
    GitSHA->>GitSHA: os.environ.get(OPENSRE_SHA)
    alt OPENSRE_SHA set (Fargate path)
        GitSHA-->>Runner: returns full 40-char SHA
    else OPENSRE_SHA empty (local dev path)
        GitSHA->>GitSHA: git rev-parse --short HEAD
        GitSHA-->>Runner: returns short SHA or (no-git)
    end

    Runner->>Runner: integrity.pre_flight()
    Runner->>Gate: _validate_promotable_sha(self._opensre_sha)
    alt "SHA matches ^[0-9a-f]{7,40}$"
        Gate-->>Runner: passes
        Runner->>Runner: "_run_inner(dev_mode=False)"
    else Invalid SHA
        Gate-->>Runner: raises IntegrityViolation
    end
Loading

Reviews (3): Last reviewed commit: "fixed git SHA related issues" | Re-trigger Greptile

Comment thread .github/workflows/benchmark-image.yml Outdated
@YauhenBichel

Copy link
Copy Markdown
Collaborator Author

@greptile review

@YauhenBichel YauhenBichel marked this pull request as ready for review June 12, 2026 10:30
@YauhenBichel YauhenBichel merged commit c27bee9 into main Jun 12, 2026
21 checks passed
@YauhenBichel YauhenBichel deleted the fix/2074-bench-performance-smoke branch June 12, 2026 10:31
@github-actions

Copy link
Copy Markdown
Contributor

🛸 Aliens watching our repo just upgraded @YauhenBichel's threat level to: do not engage — too competent. 👽


👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Benchmark opensre+LLM vs LLM-alone (Cloudopsbench)

1 participant