Skip to content

feat: regression gate command (closes #364)#384

Merged
spboyer merged 2 commits into
mainfrom
spboyer-issue-364-gate
Jun 28, 2026
Merged

feat: regression gate command (closes #364)#384
spboyer merged 2 commits into
mainfrom
spboyer-issue-364-gate

Conversation

@spboyer

@spboyer spboyer commented Jun 28, 2026

Copy link
Copy Markdown
Member

Closes #364.

Adds a new waza gate command for CI regression gates, plus a golden: true task field (absorbing #359) that gate enforces as a hard requirement.

What's new

waza gate

waza gate --baseline baseline.json --current results.json \
  [--max-regression-pct 5] \
  [--golden-must-pass] \
  [--on-new-tasks allow|warn|fail] \
  [--on-removed-tasks allow|warn|fail] \
  [--format human|json|markdown|github-actions]

Stable exit codes

Code Meaning
0 Pass
1 Regression — success rate dropped beyond --max-regression-pct, or a task-set fail policy triggered
2 Golden failure — at least one task marked golden: true did not pass (takes precedence over regression)
3 Config error — bad flags, missing/unparseable files

Golden tasks

New golden: true field on tasks in eval YAML. It's propagated all the way through to results.json (TestOutcome.Golden) so waza gate can enforce it from results alone, without needing to re-read the eval YAML.

Conservative detection: a task is treated as golden if either the baseline or the current run marks it golden. This avoids regressions slipping through when an older baseline predates the field.

Output formats

  • human (default) — colored summary with regression table, golden status, task-set deltas
  • json — machine-readable GateReport
  • markdown — PR-comment-friendly report
  • github-actions — emits ::error:: / ::warning:: / ::notice:: annotations on stdout and appends a markdown summary to $GITHUB_STEP_SUMMARY when set

Tests

cmd/waza/cmd_gate_test.go covers all acceptance criteria:

  • Pass when no regression
  • Regression exceeds threshold
  • Golden failure takes precedence over regression
  • --golden-must-pass=false allows golden to fail without exit 2
  • Golden detected from baseline even when missing in current
  • New/removed task policies (allow/warn/fail combinations)
  • All four output formats render correctly
  • Config errors return exit 3
  • golden YAML roundtrip

Docs

  • site/src/content/docs/reference/cli.mdx — full ## waza gate section
  • site/src/content/docs/guides/ci-cd.mdx — GitHub Actions + Azure DevOps snippets
  • site/src/content/docs/guides/eval-yaml.mdxgolden field in task fields table

Design notes (simple choices for ambiguous spec items)

  • Golden detection: union across baseline/current (described above) — safer for older baselines.
  • Default policy: --on-new-tasks=allow (additive growth is good), --on-removed-tasks=warn (visibility without breaking PRs that intentionally prune tasks). Both can be overridden to fail for stricter CI.
  • --max-regression-pct=0 default: no regression tolerated unless explicitly allowed.
  • Exit-code plumbing: introduced ExitCodeError in cmd/waza/main.go so subcommands can request specific exit codes without leaking that concern across the rest of the CLI.

Files

  • New: cmd/waza/cmd_gate.go, cmd/waza/cmd_gate_test.go
  • Modified: cmd/waza/main.go (ExitCodeError), cmd/waza/root.go (register command), internal/models/{testcase,outcome}.go (Golden field), internal/orchestration/runner.go (propagate Golden through 3 emit paths), site docs.

Add 'waza gate' for CI regression gates: compares baseline vs current
results.json with configurable thresholds and stable exit codes.

- New 'golden' field on TestCase (YAML) and TestOutcome (JSON), propagated
  through runner so gate can read it without re-reading the eval YAML.
- Stable exit codes: 0 pass, 1 regression, 2 golden failure, 3 config error.
  Golden failure takes precedence over plain regression.
- Configurable policies for new/removed tasks (allow/warn/fail).
- Output formats: human, json, markdown, github-actions (annotations +
  $GITHUB_STEP_SUMMARY).
- Conservative golden detection: treat task as golden if either side marks
  it, so older baselines without the field don't bypass enforcement.
- Tests cover all acceptance criteria (regression threshold, golden hard-fail,
  task-set policies, all four output formats, config errors).
- Docs: new CLI reference section, CI/CD guide snippets for GitHub Actions
  and Azure DevOps, golden field documented in eval YAML guide.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 28, 2026 11:15

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new CI-focused regression gate to waza, introducing a waza gate command that compares results.json files (baseline vs current), enforces regression and “golden task must pass” policies, and emits stable exit codes + CI-friendly output formats. This fits into the CLI’s existing results tooling (alongside waza compare) and extends the results schema to carry golden metadata end-to-end.

Changes:

  • Introduces waza gate (human/json/markdown/github-actions output, stable exit codes, task-set delta policies).
  • Adds golden: true support on eval tasks and propagates it into results.json (TestCase.GoldenTestOutcome.Golden).
  • Updates site docs to document waza gate and the new golden task field.
Show a summary per file
File Description
cmd/waza/cmd_gate.go Implements the new waza gate command, report model, gating logic, and renderers.
cmd/waza/cmd_gate_test.go Adds acceptance-criteria tests for gating behavior, exit codes, and output formats.
cmd/waza/main.go Adds ExitCodeError plumbing to support stable subcommand-selected exit codes.
cmd/waza/root.go Registers the new gate subcommand.
internal/models/testcase.go Adds TestCase.Golden YAML/JSON field to mark golden tasks.
internal/models/outcome.go Adds TestOutcome.Golden JSON field to persist golden status into results.json.
internal/orchestration/runner.go Propagates Golden into emitted TestOutcomes across execution paths.
site/src/content/docs/reference/cli.mdx Documents waza gate flags, exit codes, and examples.
site/src/content/docs/guides/eval-yaml.mdx Documents the new golden task field.
site/src/content/docs/guides/ci-cd.mdx Adds CI wiring examples for waza gate (GitHub Actions + Azure DevOps).

Review details

  • Files reviewed: 10/10 changed files
  • Comments generated: 6
  • Review effort level: Low

Comment thread cmd/waza/cmd_gate.go Outdated
Comment thread cmd/waza/cmd_gate.go Outdated
Comment thread cmd/waza/cmd_gate.go
Comment thread cmd/waza/cmd_gate.go
Comment thread cmd/waza/cmd_gate_test.go Outdated
Comment thread cmd/waza/cmd_gate_test.go
- Default --max-regression-pct now 0 (was 5.0); explicit threshold required
  to tolerate any drop in success rate
- Help examples updated: separate examples for default (zero tolerance) and
  for tolerating a 5pp drop
- Set SilenceUsage/SilenceErrors on gate cobra cmd; ExitCodeError now
  carries a meaningful message (e.g. 'waza gate: regression (exit 1)')
- GitHub Actions formatter demotes golden annotations to ::warning::
  with title 'Golden task failed (non-blocking)' when --golden-must-pass=false;
  preserves ::error:: only when goldens are required
- Tests: defaultOpts maxRegressionPct=0; TestGate_RegressionWithinThresholdPasses
  now sets the threshold explicitly; new TestGate_DefaultZeroThresholdFailsAnyRegression
  and TestGate_FormatGitHubActionsDemotesGoldenWhenPolicyRelaxed

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@spboyer spboyer merged commit 169ade0 into main Jun 28, 2026
10 checks passed
@spboyer spboyer deleted the spboyer-issue-364-gate branch June 28, 2026 11:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Regression gates — baseline comparison with thresholds and statistical confidence

3 participants