Skip to content

feat: Regression gates — baseline comparison with thresholds and statistical confidence #364

Description

@spboyer

Problem

waza compare shows diffs, but there's no opinionated gate for CI:

  • No "fail the build if pass rate dropped >5%."
  • No first-class concept of a golden task that must always pass regardless of aggregate score.
  • No machine-readable exit codes or summary that CI annotators can parse.
  • Adding or removing tasks between baseline and current breaks naive comparison.

Skill authors copy-paste shell glue to make this work.

Proposal

Add waza gate, a CI-first command that consumes a baseline results.json and a current results.json:

waza gate \
  --baseline baseline.json \
  --current results.json \
  --max-regression-pct 5 \
  --golden-must-pass \
  --on-new-tasks=allow \
  --on-removed-tasks=warn \
  --format=github-actions
  • Regression threshold: --max-regression-pct N fails when aggregate pass rate drops > N%.
  • Golden tasks: eval.yaml task field golden: true — any failure is a hard fail, independent of aggregate.
  • Task set changes: --on-new-tasks and --on-removed-tasks each accept allow|warn|fail.
  • Exit codes: documented and stable (0 = pass, 1 = regression, 2 = golden failure, 3 = config error).
  • Output formats: human, json, markdown, github-actions (creates inline annotations).

Statistical significance (p-values from multiple trials) is out of scope for MVP — add later if trials > 1 data is consistently available.

Why this matters for agentic-first

Agentic eval results are noisy. Without a gate that knows about golden tasks and task-set deltas, CI is either too strict (blocks on noise) or useless (always green). This issue makes "ship safely" a one-liner.

Acceptance criteria

  • waza gate command implemented.
  • golden: true task field absorbed from former feat: Eval dataset versioning, sharing, and golden-set management #359; any golden failure exits 2 regardless of aggregate.
  • Configurable behavior for added/removed tasks.
  • Stable, documented exit codes.
  • github-actions output emits inline annotations and a job summary.
  • Docs in site/ with a CI snippet (GitHub Actions + Azure DevOps).
  • Tests cover: regression threshold, golden hard-fail, task-set delta behaviors, exit codes.

Non-goals

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions