Problem
waza compare shows diffs, but there's no opinionated gate for CI:
- No "fail the build if pass rate dropped >5%."
- No first-class concept of a golden task that must always pass regardless of aggregate score.
- No machine-readable exit codes or summary that CI annotators can parse.
- Adding or removing tasks between baseline and current breaks naive comparison.
Skill authors copy-paste shell glue to make this work.
Proposal
Add waza gate, a CI-first command that consumes a baseline results.json and a current results.json:
waza gate \
--baseline baseline.json \
--current results.json \
--max-regression-pct 5 \
--golden-must-pass \
--on-new-tasks=allow \
--on-removed-tasks=warn \
--format=github-actions
- Regression threshold:
--max-regression-pct N fails when aggregate pass rate drops > N%.
- Golden tasks:
eval.yaml task field golden: true — any failure is a hard fail, independent of aggregate.
- Task set changes:
--on-new-tasks and --on-removed-tasks each accept allow|warn|fail.
- Exit codes: documented and stable (0 = pass, 1 = regression, 2 = golden failure, 3 = config error).
- Output formats:
human, json, markdown, github-actions (creates inline annotations).
Statistical significance (p-values from multiple trials) is out of scope for MVP — add later if trials > 1 data is consistently available.
Why this matters for agentic-first
Agentic eval results are noisy. Without a gate that knows about golden tasks and task-set deltas, CI is either too strict (blocks on noise) or useless (always green). This issue makes "ship safely" a one-liner.
Acceptance criteria
Non-goals
Related
Problem
waza compareshows diffs, but there's no opinionated gate for CI:Skill authors copy-paste shell glue to make this work.
Proposal
Add
waza gate, a CI-first command that consumes a baselineresults.jsonand a currentresults.json:--max-regression-pct Nfails when aggregate pass rate drops > N%.eval.yamltask fieldgolden: true— any failure is a hard fail, independent of aggregate.--on-new-tasksand--on-removed-taskseach acceptallow|warn|fail.human,json,markdown,github-actions(creates inline annotations).Statistical significance (p-values from multiple trials) is out of scope for MVP — add later if
trials > 1data is consistently available.Why this matters for agentic-first
Agentic eval results are noisy. Without a gate that knows about golden tasks and task-set deltas, CI is either too strict (blocks on noise) or useless (always green). This issue makes "ship safely" a one-liner.
Acceptance criteria
waza gatecommand implemented.golden: truetask field absorbed from former feat: Eval dataset versioning, sharing, and golden-set management #359; any golden failure exits 2 regardless of aggregate.github-actionsoutput emits inline annotations and a job summary.site/with a CI snippet (GitHub Actions + Azure DevOps).Non-goals
Related
waza compare.