Changelog

@spboyer

What's Changed

chore: Release v0.37.0 registry update by @spboyer in #336
docs: add eval registry design by @spboyer in #338
docs: Eval & Grader Registry design doc (#13) by @spboyer in #337
chore: update project dependencies by @spboyer in #339
chore(deps): Bump docker/setup-buildx-action from 3 to 4 by @dependabot[bot] in #340
chore(deps): Bump codecov/codecov-action from 4 to 7 by @dependabot[bot] in #342
chore(deps): Bump golangci/golangci-lint-action from 7 to 9 by @dependabot[bot] in #344
chore(deps-dev): Bump globals from 17.6.0 to 17.7.0 in /web by @dependabot[bot] in #346
chore(deps-dev): Bump vite from 8.0.16 to 8.1.0 in /web by @dependabot[bot] in #348
chore(deps): Bump actions/setup-go from 5 to 6 by @dependabot[bot] in #341
chore(deps-dev): Bump @vitejs/plugin-react from 6.0.2 to 6.0.3 in /web by @dependabot[bot] in #349
chore(deps): Bump @tanstack/react-query from 5.101.0 to 5.101.1 in /web by @dependabot[bot] in #350
chore(deps): Bump actions/deploy-pages from 4 to 5 by @dependabot[bot] in #345
feat: rubric preset library (closes #360) by @spboyer in #381
feat: regression gate command (closes #364) by @spboyer in #384
feat: OpenTelemetry trace export (closes #362) by @spboyer in #383
feat: focused test generation in waza suggest (closes #357) by @spboyer in #380
feat: spec verify command (closes #361) by @spboyer in #385
feat: schema versioning policy (closes #368) by @spboyer in #382
feat: per-turn checkpoint graders (closes #358) by @spboyer in #386
feat: add MCP server mocks (closes #363) by @spboyer in #387
feat: per-task tool metrics with structured arg matchers (closes #366) by @spboyer in #388
feat: snapshot/replay for deterministic eval reproduction (closes #367) by @spboyer in #391
feat: adversarial / fault-injection harness (closes #365) by @spboyer in #392
chore(deps): update uncovered dependencies by @spboyer in #395
chore(deps-dev): Bump typescript-eslint from 8.61.1 to 8.62.1 in /web by @dependabot[bot] in #343
refactor: decouple ExecutionResponse from Copilot SDK events (Phase 1, #10) by @spboyer in #396
feat: add run SSE events by @spboyer in #397
docs: v0.38.0 feature coverage by @spboyer in #398
test: coverage backfill for v0.38.0 features by @spboyer in #399

Full Changelog: v0.37.0...v0.38.0

Changelog

All notable changes to waza will be documented in this file.

The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.

[Unreleased]

[0.38.0] - 2026-06-30

Added

Focused eval suggestions — waza suggest now supports targeted generation with --count, --focus, --dry-run, --apply, and --force (#357, #380)
Per-turn checkpoint graders — Task YAML can run inline graders after specific conversation turns with checkpoints[] and on_failure policies (#358, #386)
Rubric preset library — Prompt graders can reuse built-in rubric presets for common judge dimensions; no separate waza rubric subcommand ships in this release (#360, #381)
Spec verification — Added waza spec verify to report eval coverage against SKILL.md requirements (#361, #385)
OpenTelemetry trace export — Added waza run --otel-exporter, --otel-endpoint, --otel-headers, --otel-file, and --otel-include-payloads (#362, #383)
MCP server mocks — Added eval-level mcp_mocks: for hermetic Copilot SDK tool-call evals (#363, #387)
Regression gates — Added waza gate with stable exit codes for pass, regression, golden failure, and config errors (#364, #384)
Adversarial harness — Added waza adversarial and eval-level adversarial: pack configuration for prompt-injection and scope-bypass checks (#365, #392)
Tool metrics and structured argument matchers — Results now include normalized tool_events[]; tool graders can assert structured argument matchers through expect_tools[].args and tool_calls.expect[].args (#366, #388)
Snapshot and replay — Added waza run --snapshot and waza replay with a self-contained snapshot artifact format (#367, #391)
Schema version policy — Documented and enforced MAJOR.MINOR schemaVersion compatibility for public artifacts (#368, #382)
Dashboard SSE resume — Added Last-Event-ID / lastEventId resume support for dashboard event streams, including legacy /api/events (#178, #397)

Changed

Phase 1 internal refactor — Internal cleanup with no user-facing CLI, schema, or site behavior changes (#10)

[0.37.0] - 2026-06-18

Added

Interactive skill responder — Eval runs can now drive interactive skills with an LLM responder for more realistic conversational workflows (#304, closes #303)
Triage automation and regression loop — Added triage automation and regression-loop support for Squad workflow validation (#326)

Fixed

Task-level context fixtures — Task-level context fixtures now materialize in workspaces before execution (#329)
waza suggest engine failures — Engine failures are now surfaced by waza suggest instead of being hidden behind success-shaped output (#330)
Session idle hang — Upgraded the Copilot SDK to v1.0.2 and re-bundled Copilot CLI 1.0.64-0 to fix session idle hangs (#333)

Dependencies

Bump astro from 6.3.2 to 6.4.7 in /site (#331)
Bump js-yaml from 4.1.1 to 4.2.0 in /site (#327)

[0.36.0] - 2026-06-15

Added

Squad framework v0.10.0 upgrade — Upgraded Squad from 0.8.25 to 0.10.0 (#322, #323)
Squad workflow failure detection — Added Squad workflow and failure detection infrastructure (#322, #324)

Fixed

Prompt grader timeout configuration — Prompt grader timeout can now be configured with WAZA_PROMPT_GRADER_TIMEOUT (#319)
Session-start hang detection — Added a first-event watchdog to catch session-start hangs (#321)
Non-Squad coordinator canary handling — Clarified the Squad coordinator canary guard so non-Squad sessions can continue without using Squad (#325)

Dependencies

Bump esbuild, @tailwindcss/vite, @vitejs/plugin-react, and vite (#317)

[0.35.0] - 2026-06-06

Added

Copilot SDK v1.0.0 upgrade — Upgraded github.com/github/copilot-sdk/go to v1.0.0 and surfaced premium-request credits on the dashboard (#311)
Model-aware dashboard pricing — Dashboard cost calculation now applies per-model pricing for more accurate run cost reporting (#310)
Git worktree resources in task inputs — Tasks can now reference git worktree resources as inputs (#121, #302)

Fixed

BYOK + --model startup arg — The Copilot CLI validates the startup --model flag against the Copilot catalog before BYOK provider config is applied, so provider-only model IDs would fail. The --model startup arg is now skipped when a BYOK provider is configured (#305, #306)
Model override propagation — --model is now passed via CLIArgs so it correctly overrides user settings and experiment flights (#263)
Copilot CLI PATH fallback — Prevent silent fallback to a Copilot CLI on PATH when the bundled binary is unavailable (#300)
Installer latest-release selection — Installer now correctly selects the latest standalone waza release (#299)
Skill best practices doc link — Fixed the broken skill best practices reference (#295, #298)

Changed

AgentEngine cancellation — Simplified AgentEngine cancellation handling around caller contexts to make shutdown semantics more predictable (#290)

[0.34.0] - 2026-05-23

Added

BYOK provider wiring — Added bring-your-own-key provider support for configured model providers (#240)
waza update command — Added an update command for upgrading local Waza installations (#288)
Skill injection opt-out — Added an option to run evals without injecting the target skill body (#285, #292)
Forbidden skills grading — skill_invocation graders can now assert that specific skills must not be invoked (#286, #291)
Per-trial usage reporting — Results JSON now includes per-trial usage details for deeper run analysis (#277)
Agent-friendly GitHub templates — Added issue and pull request templates tuned for agent-authored work (#293)

Fixed

Tool approval handling — Tool permission handling now uses the SDK approval kind (#240)
Signal cancellation — waza run now respects cancellation signals more reliably (#279)
Sandbox prompt handling — Empty sandbox prompts are guarded before execution (#273, #278)
Custom agent example schema — Fixed the custom-agent eval example to match the supported schema (#282)
Binary release links — Fixed binary release documentation links (#276, #284)
Agent path guidance — Corrected AGENTS root path guidance (#267, #269)

Changed

Run concurrency — waza run now reuses a shared Copilot client and auto-sizes parallel workers when --workers is unset (#135, #221)
Documentation — Updated integration testing, custom-agent eval, and OpenAI Evals model-graded YAML documentation (#281, #283, #14, #280)
Release workflow — GitHub Pages deployment now runs after the release workflow (#265)

[0.33.0] - 2026-05-21

Note: This release includes the changes previously prepared under 0.32.0, which was not published.

Added

Configurable eval file naming — .waza.yaml can now configure files.evalFile, files.taskGlob, and files.taskFileSuffix, with the new naming carried through scaffolding, workspace discovery, discovery mode, schemas, and docs while preserving the existing eval.yaml and tasks/*.yaml defaults (#254, closes #232)
Instruction files in eval runs — Eval-level config.instruction_files and task-level instruction_files now copy files from the active context into task workspaces and append path-labeled contents to the Copilot system message (#248, closes #239)

Fixed

Prompt graders use the execution engine — Prompt graders now route judge turns through CopilotEngine instead of constructing a Copilot client directly, keeping grader execution aligned with engine configuration and preserving follow-up recovery behavior (#258, closes #54)
Prompt grader follow-up recovery — Prompt grading now preserves collected grades when a follow-up turn fails after successful grader collection (#251)
Bundled Copilot CLI updated — Embedded copilot-cli bundles are updated from 1.0.2 to 1.0.49 across supported platforms, with reproducible pinned bundle generation via COPILOT_CLI_VERSION (#260, closes #244)
Spec-aligned skill scaffolding — waza new skill no longer asks for a nonstandard skill type or emits type: frontmatter, and the wizard now rejects early exits that omit required name or description fields (#261, closes #243)
waza check eval discovery — Nested skills and separated evals are discovered consistently in multi-skill workspaces (#247, closes #238)
Skill body routing markers — Compliance scoring now detects trigger, anti-trigger, and routing markers in SKILL.md body sections as well as frontmatter descriptions (#236, closes #223)

Changed

Copilot SDK v0.3.0 migration — Updated github.com/github/copilot-sdk/go to v0.3.0, migrated session event handling to typed payloads, and refreshed transcript, logging, web API, usage collection, suggestion trace, and test coverage for the new API (#255, closes #253)
Dashboard validation coverage — Added coverage for dashboard lint and end-to-end validation (#249)
Install documentation — Replaced unsupported go install guidance and clarified Windows/WSL install behavior (#246, closes #242; #245, closes #241)
Dependencies — Bump devalue in /site, postcss in /web, and astro in /site (#237, #235, #234)

[0.31.0] - 2026-04-28

Added

Custom agent (.agent.md) eval support — Discover .agent.md files alongside SKILL.md, parse agent-specific frontmatter (tools, model, handoffs, mcp-servers, agents), auto-inject tool_constraint grader from agent tools: field, complete worked example under examples/custom-agent/, and new "Evaluating Custom Agents" docs guide (#226, closes #225)

Fixed

**Mock engine echoes file co...

@spboyer

What's Changed

ci: Deploy Pages after release workflow by @spboyer in #265
docs: fix AGENTS root path guidance by @drvoss in #269
fix: guard empty sandbox prompts by @spboyer in #278
fix: respect signal cancellation in waza run by @spboyer in #279
docs: map OpenAI Evals modelgraded YAML to waza graders by @spboyer in #280
Fix custom-agent eval example schema by @spboyer in #282
docs: fix binary release links for #276 by @spboyer in #284
docs: align custom agent eval docs with skill field by @spboyer in #283
docs: update integration testing guide by @spboyer in #281
feat: add per-trial usage to results JSON by @spboyer in #277
fix: use SDK approval kind for tool permissions by @drvoss in #268
feat: wire BYOK providers by @slbug in #240
Add waza update command by @spboyer in #288
Add forbidden skills to skill invocation grader by @spboyer in #291
Add skill body injection opt-out by @spboyer in #292
Add agent-friendly PR and issue templates by @spboyer in #293
Improve waza run concurrency: shared Copilot client + auto-sized workers (#135) by @spboyer with @Copilot in #221
Release v0.34.0 by @spboyer in #294
Fix skill best practices reference by @spboyer in #298
Fix installer latest release selection by @spboyer in #299
fix: prevent Copilot CLI PATH fallback by @spboyer in #300
Simplify AgentEngine cancellation around caller contexts by @spboyer with @Copilot in #290
fix: pass --model via CLIArgs to override user settings and experiment flights by @sebastienlevert in #263
feat: support git worktree resources in task inputs (#121) by @spboyer in #302
fix: skip --model CLI startup arg when BYOK provider is configured (#305) by @spboyer in #306
feat(pricing): model-aware cost calculation for dashboard by @spboyer in #310
Upgrade Copilot SDK to v1.0.0 and surface premium-request credits on the dashboard by @spboyer in #311
Release v0.35.0 by @spboyer in #312
chore(deps): Bump esbuild, @tailwindcss/vite, @vitejs/plugin-react and vite in /web by @dependabot[bot] in #317
fix(execution): add a first-event watchdog to catch session-start hangs by @sebastienlevert in #321
fix(graders): make prompt-grader timeout configurable via WAZA_PROMPT_GRADER_TIMEOUT by @sebastienlevert in #319
feat: Upgrade Squad framework from 0.8.25 to 0.10.0 by @spboyer in #323
Phase 1 & 2: Squad workflows + failure detection infrastructure #322 by @spboyer in #324
[WIP] Add support for custom agents in waza check by @spboyer with @Copilot in #314
Release v0.36.0 by @spboyer in #325
feat: complete issue #322 with triage automation and regression loop by @spboyer in #326
Surface engine failures in waza suggest by @spboyer with @Copilot in #330
Materialize task-level context fixtures in workspaces by @spboyer with @Copilot in #329
chore(deps): Bump js-yaml from 4.1.1 to 4.2.0 in /site by @dependabot[bot] in #327
chore(deps): Bump astro from 6.3.2 to 6.4.7 in /site by @dependabot[bot] in #331
feat: drive interactive skills via an LLM responder (#303) by @adamdougal in #304
chore: upgrade copilot-sdk to v1.0.2 and re-bundle embedded CLI to 1.0.64-0 (fixes session.idle hang) by @sebastienlevert in #333
chore: refresh dependencies by @spboyer in #335
Release v0.37.0 by @spboyer in #334

New Contributors

@slbug made their first contribution in #240
@adamdougal made their first contribution in #304

Full Changelog: v0.33.0...v0.37.0

Changelog

All notable changes to waza will be documented in this file.

The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.

[Unreleased]

[0.37.0] - 2026-06-18

Added

Interactive skill responder — Eval runs can now drive interactive skills with an LLM responder for more realistic conversational workflows (#304, closes #303)
Triage automation and regression loop — Added triage automation and regression-loop support for Squad workflow validation (#326)

Fixed

Task-level context fixtures — Task-level context fixtures now materialize in workspaces before execution (#329)
waza suggest engine failures — Engine failures are now surfaced by waza suggest instead of being hidden behind success-shaped output (#330)
Session idle hang — Upgraded the Copilot SDK to v1.0.2 and re-bundled Copilot CLI 1.0.64-0 to fix session idle hangs (#333)

Dependencies

Bump astro from 6.3.2 to 6.4.7 in /site (#331)
Bump js-yaml from 4.1.1 to 4.2.0 in /site (#327)

[0.36.0] - 2026-06-15

Added

Squad framework v0.10.0 upgrade — Upgraded Squad from 0.8.25 to 0.10.0 (#322, #323)
Squad workflow failure detection — Added Squad workflow and failure detection infrastructure (#322, #324)

Fixed

Prompt grader timeout configuration — Prompt grader timeout can now be configured with WAZA_PROMPT_GRADER_TIMEOUT (#319)
Session-start hang detection — Added a first-event watchdog to catch session-start hangs (#321)
Non-Squad coordinator canary handling — Clarified the Squad coordinator canary guard so non-Squad sessions can continue without using Squad (#325)

Dependencies

Bump esbuild, @tailwindcss/vite, @vitejs/plugin-react, and vite (#317)

[0.35.0] - 2026-06-06

Added

Copilot SDK v1.0.0 upgrade — Upgraded github.com/github/copilot-sdk/go to v1.0.0 and surfaced premium-request credits on the dashboard (#311)
Model-aware dashboard pricing — Dashboard cost calculation now applies per-model pricing for more accurate run cost reporting (#310)
Git worktree resources in task inputs — Tasks can now reference git worktree resources as inputs (#121, #302)

Fixed

BYOK + --model startup arg — The Copilot CLI validates the startup --model flag against the Copilot catalog before BYOK provider config is applied, so provider-only model IDs would fail. The --model startup arg is now skipped when a BYOK provider is configured (#305, #306)
Model override propagation — --model is now passed via CLIArgs so it correctly overrides user settings and experiment flights (#263)
Copilot CLI PATH fallback — Prevent silent fallback to a Copilot CLI on PATH when the bundled binary is unavailable (#300)
Installer latest-release selection — Installer now correctly selects the latest standalone waza release (#299)
Skill best practices doc link — Fixed the broken skill best practices reference (#295, #298)

Changed

AgentEngine cancellation — Simplified AgentEngine cancellation handling around caller contexts to make shutdown semantics more predictable (#290)

[0.34.0] - 2026-05-23

Added

BYOK provider wiring — Added bring-your-own-key provider support for configured model providers (#240)
waza update command — Added an update command for upgrading local Waza installations (#288)
Skill injection opt-out — Added an option to run evals without injecting the target skill body (#285, #292)
Forbidden skills grading — skill_invocation graders can now assert that specific skills must not be invoked (#286, #291)
Per-trial usage reporting — Results JSON now includes per-trial usage details for deeper run analysis (#277)
Agent-friendly GitHub templates — Added issue and pull request templates tuned for agent-authored work (#293)

Fixed

Tool approval handling — Tool permission handling now uses the SDK approval kind (#240)
Signal cancellation — waza run now respects cancellation signals more reliably (#279)
Sandbox prompt handling — Empty sandbox prompts are guarded before execution (#273, #278)
Custom agent example schema — Fixed the custom-agent eval example to match the supported schema (#282)
Binary release links — Fixed binary release documentation links (#276, #284)
Agent path guidance — Corrected AGENTS root path guidance (#267, #269)

Changed

Run concurrency — waza run now reuses a shared Copilot client and auto-sizes parallel workers when --workers is unset (#135, #221)
Documentation — Updated integration testing, custom-agent eval, and OpenAI Evals model-graded YAML documentation (#281, #283, #14, #280)
Release workflow — GitHub Pages deployment now runs after the release workflow (#265)

[0.33.0] - 2026-05-21

Note: This release includes the changes previously prepared under 0.32.0, which was not published.

Added

Configurable eval file naming — .waza.yaml can now configure files.evalFile, files.taskGlob, and files.taskFileSuffix, with the new naming carried through scaffolding, workspace discovery, discovery mode, schemas, and docs while preserving the existing eval.yaml and tasks/*.yaml defaults (#254, closes #232)
Instruction files in eval runs — Eval-level config.instruction_files and task-level instruction_files now copy files from the active context into task workspaces and append path-labeled contents to the Copilot system message (#248, closes #239)

Fixed

Prompt graders use the execution engine — Prompt graders now route judge turns through CopilotEngine instead of constructing a Copilot client directly, keeping grader execution aligned with engine configuration and preserving follow-up recovery behavior (#258, closes #54)
Prompt grader follow-up recovery — Prompt grading now preserves collected grades when a follow-up turn fails after successful grader collection (#251)
Bundled Copilot CLI updated — Embedded copilot-cli bundles are updated from 1.0.2 to 1.0.49 across supported platforms, with reproducible pinned bundle generation via COPILOT_CLI_VERSION (#260, closes #244)
Spec-aligned skill scaffolding — waza new skill no longer asks for a nonstandard skill type or emits type: frontmatter, and the wizard now rejects early exits that omit required name or description fields (#261, closes #243)
waza check eval discovery — Nested skills and separated evals are discovered consistently in multi-skill workspaces (#247, closes #238)
Skill body routing markers — Compliance scoring now detects trigger, anti-trigger, and routing markers in SKILL.md body sections as well as frontmatter descriptions (#236, closes #223)

Changed

Copilot SDK v0.3.0 migration — Updated github.com/github/copilot-sdk/go to v0.3.0, migrated session event handling to typed payloads, and refreshed transcript, logging, web API, usage collection, suggestion trace, and test coverage for the new API (#255, closes #253)
Dashboard validation coverage — Added coverage for dashboard lint and end-to-end validation (#249)
Install documentation — Replaced unsupported go install guidance and clarified Windows/WSL install behavior (#246, closes #242; #245, closes #241)
Dependencies — Bump devalue in /site, postcss in /web, and astro in /site (#237, #235, #234)

[0.31.0] - 2026-04-28

Added

Custom agent (.agent.md) eval support — Discover .agent.md files alongside SKILL.md, parse agent-specific frontmatter (tools, model, handoffs, mcp-servers, agents), auto-inject tool_constraint grader from agent tools: field, complete worked example under examples/custom-agent/, and new "Evaluating Custom Agents" docs guide (#226, closes #225)

Fixed

Mock engine echoes file content — _output_contains expectations against file contents now work in CI without a real model. Mock response includes task metadata, file paths, and a 1KB content preview per resource (#228, closes #227)
waza serve no longer crashes when stdin isn't a terminal — MCP stdio server only starts when term.IsTerminal() is true; piped input or background mode no longer kills the HTTP dashboard (#224)

Changed

Vocabulary renames — Internal types renamed: BenchmarkSpec → EvalSpec, TestRunner → EvalRunner. Not a breaking change for external consumers (types live in internal/) (#222)

Documentation

Cross-reference audit for recent renames + custom agent feature: added .agent.md coverage to quickstart, getting-started, GUIDE, TUTORIAL, examples README; updated mock engine descriptions in INTEGRATION-TESTING and eval-yaml guide (#230)

Dependencies

Bump postcss from 8.5.6 to 8.5.12 in /site (#229)

[0.30.1] - 2026-04-22

Documentation

Updated README with missing CLI commands — Added documentation for recently-added CLI commands that were missing from the README (#220)

[0.30.0] - 2026-04-22

Added

waza quality command — LLM-as-Judge skill quality scoring that evaluates skill output quality using a configurable judge model (#218)
Scope-reduction advisory check — waza check now includes an advisory that flags skills with overly broad scope, helping authors tighten skill definitions (#219)

[0.29.0] - 2026-04-22

Added

--keep-workspace flag — Preserve the temporary workspace after task execution for debugging agent output (#123, #217)
--no-skills flag and disabled_skills config — Disable specific skills during evaluation to isolate behavior (#126, #216)
Non-blocking version update check — CLI now checks for newer waza versions in the background without slowing startup (#104, #214)
Per-task skill_directories — Specify different skill directories for individua...

@drvoss

What's Changed

chore(deps): Bump astro from 6.1.8 to 6.3.2 in /site by @dependabot[bot] in #234
chore(deps): Bump postcss from 8.5.6 to 8.5.14 in /web by @dependabot[bot] in #235
chore(deps): Bump devalue from 5.6.4 to 5.8.1 in /site by @dependabot[bot] in #237
fix: detect SKILL body routing markers by @drvoss in #236
docs: clarify Windows install guidance by @spboyer in #245
docs: fix unsupported go install guidance by @spboyer in #246
fix: align check eval discovery by @spboyer in #247
feat: include instruction files in eval runs by @spboyer in #248
test: cover dashboard lint and e2e validation by @spboyer in #249
fix: prompt grader gracefully recovers when follow-up turn fails after grades collected by @sebastienlevert in #251
Release v0.32.0 by @spboyer in #252
feat: configure eval file naming by @spboyer in #254
Migrate Copilot SDK to v0.3.0 by @spboyer in #255
fix: route prompt graders through CopilotEngine by @spboyer in #258
fix: bump bundled copilot-cli by @spboyer in #260
fix: remove nonstandard skill type prompt by @spboyer in #261
Release v0.33.0 by @spboyer in #264

New Contributors

@drvoss made their first contribution in #236
@sebastienlevert made their first contribution in #251

Full Changelog: v0.31.0...v0.33.0

Changelog

All notable changes to waza will be documented in this file.

The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.

[Unreleased]

[0.33.0] - 2026-05-21

Note: This release includes the changes previously prepared under 0.32.0, which was not published.

Added

Configurable eval file naming — .waza.yaml can now configure files.evalFile, files.taskGlob, and files.taskFileSuffix, with the new naming carried through scaffolding, workspace discovery, discovery mode, schemas, and docs while preserving the existing eval.yaml and tasks/*.yaml defaults (#254, closes #232)
Instruction files in eval runs — Eval-level config.instruction_files and task-level instruction_files now copy files from the active context into task workspaces and append path-labeled contents to the Copilot system message (#248, closes #239)

Fixed

Prompt graders use the execution engine — Prompt graders now route judge turns through CopilotEngine instead of constructing a Copilot client directly, keeping grader execution aligned with engine configuration and preserving follow-up recovery behavior (#258, closes #54)
Prompt grader follow-up recovery — Prompt grading now preserves collected grades when a follow-up turn fails after successful grader collection (#251)
Bundled Copilot CLI updated — Embedded copilot-cli bundles are updated from 1.0.2 to 1.0.49 across supported platforms, with reproducible pinned bundle generation via COPILOT_CLI_VERSION (#260, closes #244)
Spec-aligned skill scaffolding — waza new skill no longer asks for a nonstandard skill type or emits type: frontmatter, and the wizard now rejects early exits that omit required name or description fields (#261, closes #243)
waza check eval discovery — Nested skills and separated evals are discovered consistently in multi-skill workspaces (#247, closes #238)
Skill body routing markers — Compliance scoring now detects trigger, anti-trigger, and routing markers in SKILL.md body sections as well as frontmatter descriptions (#236, closes #223)

Changed

Copilot SDK v0.3.0 migration — Updated github.com/github/copilot-sdk/go to v0.3.0, migrated session event handling to typed payloads, and refreshed transcript, logging, web API, usage collection, suggestion trace, and test coverage for the new API (#255, closes #253)
Dashboard validation coverage — Added coverage for dashboard lint and end-to-end validation (#249)
Install documentation — Replaced unsupported go install guidance and clarified Windows/WSL install behavior (#246, closes #242; #245, closes #241)
Dependencies — Bump devalue in /site, postcss in /web, and astro in /site (#237, #235, #234)

[0.31.0] - 2026-04-28

Added

Custom agent (.agent.md) eval support — Discover .agent.md files alongside SKILL.md, parse agent-specific frontmatter (tools, model, handoffs, mcp-servers, agents), auto-inject tool_constraint grader from agent tools: field, complete worked example under examples/custom-agent/, and new "Evaluating Custom Agents" docs guide (#226, closes #225)

Fixed

Mock engine echoes file content — _output_contains expectations against file contents now work in CI without a real model. Mock response includes task metadata, file paths, and a 1KB content preview per resource (#228, closes #227)
waza serve no longer crashes when stdin isn't a terminal — MCP stdio server only starts when term.IsTerminal() is true; piped input or background mode no longer kills the HTTP dashboard (#224)

Changed

Vocabulary renames — Internal types renamed: BenchmarkSpec → EvalSpec, TestRunner → EvalRunner. Not a breaking change for external consumers (types live in internal/) (#222)

Documentation

Cross-reference audit for recent renames + custom agent feature: added .agent.md coverage to quickstart, getting-started, GUIDE, TUTORIAL, examples README; updated mock engine descriptions in INTEGRATION-TESTING and eval-yaml guide (#230)

Dependencies

Bump postcss from 8.5.6 to 8.5.12 in /site (#229)

[0.30.1] - 2026-04-22

Documentation

Updated README with missing CLI commands — Added documentation for recently-added CLI commands that were missing from the README (#220)

[0.30.0] - 2026-04-22

Added

waza quality command — LLM-as-Judge skill quality scoring that evaluates skill output quality using a configurable judge model (#218)
Scope-reduction advisory check — waza check now includes an advisory that flags skills with overly broad scope, helping authors tighten skill definitions (#219)

[0.29.0] - 2026-04-22

Added

--keep-workspace flag — Preserve the temporary workspace after task execution for debugging agent output (#123, #217)
--no-skills flag and disabled_skills config — Disable specific skills during evaluation to isolate behavior (#126, #216)
Non-blocking version update check — CLI now checks for newer waza versions in the background without slowing startup (#104, #214)
Per-task skill_directories — Specify different skill directories for individual tasks in eval YAML (#156, #215)

Dependencies

Bump astro and @astrojs/starlight in /site (#212)

[0.28.0] - 2026-04-21

Added

Follow-up prompts in eval YAML — Tasks can now include pre-written follow-up prompts for multi-turn evaluation conversations (#189, #209)
waza models command — List all available models supported by the configured engine (#208)
Early termination for trigger tests — Trigger tests can now stop early once the target skill is invoked, reducing evaluation time (#207)

Fixed

Stricter YAML validation — Audited all YAML parsers; unknown fields in TestCase definitions are now properly rejected (#132, #206)
Test fixture assertion syntax — Fixed invalid Python expression in a test fixture assertion (#197)
CI integration test stability — CI integration tests now correctly handle expected eval failures when using the mock executor (#210)

Documentation

Added Quick Start guide to the documentation site (#205)

[0.27.0] - 2026-04-21

Added

output_contains_any expectation — New expectation field that passes when the agent response contains any one of the specified strings (#203)
max_response_time_ms behavior rule — Enforce maximum response time constraints on agent execution (#201)
Task prompt from file — Task prompt field can now reference an external file path instead of inline text (#157, #200)
tool_calls grader — New grader type that validates the specific tool calls an agent makes during execution (#187, #202)

Fixed

Webserver test resilience — Webserver tests now skip gracefully when frontend assets are not built (#204)

[0.26.0] - 2026-04-21

Changed

Timestamped output directories — run --output-dir now groups result files by timestamp for cleaner organization (#153)
Improved debug logging — Debug output is now more structured and useful for troubleshooting (#152)

Fixed

--discover finds eval.yaml in nested layout — Skill discovery now correctly locates eval.yaml files in evals/{name}/ directories at the project root (#44)
Diff grader reads post-execution workspace — The diff grader now reads files from the workspace after agent execution completes, not before (#165, #196)
Grader config validation — Required grader configuration fields are now validated before evaluation starts (#195)
macOS install and trigger test count — Fixed macOS binary installation and an off-by-one error in trigger test counting (#164, #184, #193)

Documentation

Added cache command reference, prompt mode documentation, and complete YAML schema reference (#198)
Updated demo guide and added CI/CD integration guide (#112, #89, #194)

Dependencies

Bump defu from 6.1.4 to 6.1.6 in /site (#181)
Bump vite from 6.4.1 to 6.4.2 in /site and /web (#182, #192)
Bump go.opentelemetry.io/otel/sdk from 1.42.0 to 1.43.0 (#185)
Bump astro from 5.17.3 to 5.18.1 in /site (#163)
Bump picomatch from 4.0.3 to 4.0.4 in /site and /web (#159, #160)
Bump smol-toml from 1.6.0 to 1.6.1 in /site (#158)

[0.25.0] - 2026-04-21

Added

Eval coverage grid generator — New coverage output that visualizes which skills have eval coverage across grader types (#92)

Fixed

SKILL.md injection and trigger fixture loading — waza run now correctly injects SKILL.md content into the evaluation context, loads trigger test fixtures, and passes MCP server configuration to the engine (#191)

Dependencies

Bump h3 from 1.15.5 to 1.15.8 in /site (#144)

[0.24.0] - 2026-03-25

Changed

Strict YAML validation — All YAML parsers now use KnownFields(true) to reject unknown fields, catching typos and misconfigurations early (#132, #133)
max_workers renamed to workers — Config YAML key renamed for consistency across all config types (breaking change)
Unified token counting — waza check and waza tokens count now share the same counting logic for consistent results (#146)

Fixed

Typo in prompt grader — Fixed "prmopt" → "prompt" in error message

Dependencies

Bump h3 from 1.15.8 to 1.15.9 in /site (#155)
Bump github.com/buger/jsonparser from 1.1.1 to 1.1.2 (#149)

[0.21.0] - 2026-03-12

Added

waza new task from-prompt command — Record Copilot sessions into task YAML files for eval creation (#110)
Trigger heuristic grader — New grader type that scores based on trigger/anti-trigger matching heuristics (#90)
Eval scaffolding command — waza eval new generates eval.yaml scaffolding for skills (#94)
Multi-trial flakiness detection — Detec...

@spboyer

What's Changed

refactor: complete vocabulary renames — BenchmarkSpec→EvalSpec, TestRunner→EvalRunner (#166) by @spboyer in #222
feat: support custom agent (.agent.md) file discovery and parsing #225 by @spboyer in #226
fix: mock engine echoes file content for CI evals (#227) by @spboyer in #228
fix: waza serve crashes when stdin is not a terminal by @spboyer in #224
chore(deps): Bump postcss from 8.5.6 to 8.5.12 in /site by @dependabot[bot] in #229
docs: cross-reference audit for recent renames and feature additions by @spboyer in #230
Release v0.31.0 by @spboyer in #231

Full Changelog: v0.30.1...v0.31.0

Changelog

All notable changes to waza will be documented in this file.

The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.

[Unreleased]

[0.31.0] - 2026-04-28

Added

Custom agent (.agent.md) eval support — Discover .agent.md files alongside SKILL.md, parse agent-specific frontmatter (tools, model, handoffs, mcp-servers, agents), auto-inject tool_constraint grader from agent tools: field, complete worked example under examples/custom-agent/, and new "Evaluating Custom Agents" docs guide (#226, closes #225)

Fixed

Mock engine echoes file content — _output_contains expectations against file contents now work in CI without a real model. Mock response includes task metadata, file paths, and a 1KB content preview per resource (#228, closes #227)
waza serve no longer crashes when stdin isn't a terminal — MCP stdio server only starts when term.IsTerminal() is true; piped input or background mode no longer kills the HTTP dashboard (#224)

Changed

Vocabulary renames — Internal types renamed: BenchmarkSpec → EvalSpec, TestRunner → EvalRunner. Not a breaking change for external consumers (types live in internal/) (#222)

Documentation

Cross-reference audit for recent renames + custom agent feature: added .agent.md coverage to quickstart, getting-started, GUIDE, TUTORIAL, examples README; updated mock engine descriptions in INTEGRATION-TESTING and eval-yaml guide (#230)

Dependencies

Bump postcss from 8.5.6 to 8.5.12 in /site (#229)

[0.30.1] - 2026-04-22

Documentation

Updated README with missing CLI commands — Added documentation for recently-added CLI commands that were missing from the README (#220)

[0.30.0] - 2026-04-22

Added

waza quality command — LLM-as-Judge skill quality scoring that evaluates skill output quality using a configurable judge model (#218)
Scope-reduction advisory check — waza check now includes an advisory that flags skills with overly broad scope, helping authors tighten skill definitions (#219)

[0.29.0] - 2026-04-22

Added

--keep-workspace flag — Preserve the temporary workspace after task execution for debugging agent output (#123, #217)
--no-skills flag and disabled_skills config — Disable specific skills during evaluation to isolate behavior (#126, #216)
Non-blocking version update check — CLI now checks for newer waza versions in the background without slowing startup (#104, #214)
Per-task skill_directories — Specify different skill directories for individual tasks in eval YAML (#156, #215)

Dependencies

Bump astro and @astrojs/starlight in /site (#212)

[0.28.0] - 2026-04-21

Added

Follow-up prompts in eval YAML — Tasks can now include pre-written follow-up prompts for multi-turn evaluation conversations (#189, #209)
waza models command — List all available models supported by the configured engine (#208)
Early termination for trigger tests — Trigger tests can now stop early once the target skill is invoked, reducing evaluation time (#207)

Fixed

Stricter YAML validation — Audited all YAML parsers; unknown fields in TestCase definitions are now properly rejected (#132, #206)
Test fixture assertion syntax — Fixed invalid Python expression in a test fixture assertion (#197)
CI integration test stability — CI integration tests now correctly handle expected eval failures when using the mock executor (#210)

Documentation

Added Quick Start guide to the documentation site (#205)

[0.27.0] - 2026-04-21

Added

output_contains_any expectation — New expectation field that passes when the agent response contains any one of the specified strings (#203)
max_response_time_ms behavior rule — Enforce maximum response time constraints on agent execution (#201)
Task prompt from file — Task prompt field can now reference an external file path instead of inline text (#157, #200)
tool_calls grader — New grader type that validates the specific tool calls an agent makes during execution (#187, #202)

Fixed

Webserver test resilience — Webserver tests now skip gracefully when frontend assets are not built (#204)

[0.26.0] - 2026-04-21

Changed

Timestamped output directories — run --output-dir now groups result files by timestamp for cleaner organization (#153)
Improved debug logging — Debug output is now more structured and useful for troubleshooting (#152)

Fixed

--discover finds eval.yaml in nested layout — Skill discovery now correctly locates eval.yaml files in evals/{name}/ directories at the project root (#44)
Diff grader reads post-execution workspace — The diff grader now reads files from the workspace after agent execution completes, not before (#165, #196)
Grader config validation — Required grader configuration fields are now validated before evaluation starts (#195)
macOS install and trigger test count — Fixed macOS binary installation and an off-by-one error in trigger test counting (#164, #184, #193)

Documentation

Added cache command reference, prompt mode documentation, and complete YAML schema reference (#198)
Updated demo guide and added CI/CD integration guide (#112, #89, #194)

Dependencies

Bump defu from 6.1.4 to 6.1.6 in /site (#181)
Bump vite from 6.4.1 to 6.4.2 in /site and /web (#182, #192)
Bump go.opentelemetry.io/otel/sdk from 1.42.0 to 1.43.0 (#185)
Bump astro from 5.17.3 to 5.18.1 in /site (#163)
Bump picomatch from 4.0.3 to 4.0.4 in /site and /web (#159, #160)
Bump smol-toml from 1.6.0 to 1.6.1 in /site (#158)

[0.25.0] - 2026-04-21

Added

Eval coverage grid generator — New coverage output that visualizes which skills have eval coverage across grader types (#92)

Fixed

SKILL.md injection and trigger fixture loading — waza run now correctly injects SKILL.md content into the evaluation context, loads trigger test fixtures, and passes MCP server configuration to the engine (#191)

Dependencies

Bump h3 from 1.15.5 to 1.15.8 in /site (#144)

[0.24.0] - 2026-03-25

Changed

Strict YAML validation — All YAML parsers now use KnownFields(true) to reject unknown fields, catching typos and misconfigurations early (#132, #133)
max_workers renamed to workers — Config YAML key renamed for consistency across all config types (breaking change)
Unified token counting — waza check and waza tokens count now share the same counting logic for consistent results (#146)

Fixed

Typo in prompt grader — Fixed "prmopt" → "prompt" in error message

Dependencies

Bump h3 from 1.15.8 to 1.15.9 in /site (#155)
Bump github.com/buger/jsonparser from 1.1.1 to 1.1.2 (#149)

[0.21.0] - 2026-03-12

Added

waza new task from-prompt command — Record Copilot sessions into task YAML files for eval creation (#110)
Trigger heuristic grader — New grader type that scores based on trigger/anti-trigger matching heuristics (#90)
Eval scaffolding command — waza eval new generates eval.yaml scaffolding for skills (#94)
Multi-trial flakiness detection — Detect flaky evals across multiple trial runs (#103)
Snapshot auto-update workflow — Diff grader can now auto-update snapshot files on mismatch (#95)
Per-file token budget configuration — Configure token budgets per-file in .waza.yaml (#96)
Skill-aware thresholds — waza tokens compare supports skill-specific threshold configuration (#93)
Sensei scoring parity — WHEN triggers, spec-security, invalid level, and advisory checks 16-18 (#79)
CI/CD integration guide — GitHub Actions and Azure DevOps integration documentation (#100)
FileWriter service — Refactored waza init inventory with FileWriter abstraction (#63)

Fixed

waza suggest deadlock — Execute() now applies the request timeout before calling Start(), preventing goroutine deadlock (#43)
ResourceFile.Content type — Changed from string to []byte for proper binary file handling (#117)
tokens compare in subdirectory — No longer shows all files as "added" when run from a subdirectory (#105)
--output-dir ignored — Fixed --output-dir having no effect for single-skill runs (#109)
Web dashboard build order — Build dashboard assets before Go compilation (#107)
Test file leak — Fixed test that leaked files into the repo (#120)
Config schema defaults — Aligned config.schema.json defaults with Go source of truth (#65)
Skill discovery path — Discover skills under .github/skills/ directory (#69)

Changed

Renamed config node max_workers to workers for consistency across all config types
- This is a breaking change
Custom YAML deserializers for config types (#106)
Validate only known fields in YAML decoders. (#132)
Token limits priority inverted to .waza.yaml first (#64)
@wbreza added to CODEOWNERS (#111)
Go 1.26+ noted in agent instruction files (#108)

[0.9.0] - 2026-02-23

Added

A/B baseline testing — --baseline flag runs each task with and without skill, computes weighted improvement scores across quality, tokens, turns, time, and task completion (#307)
Pairwise LLM judging — pairwise mode on prompt grader with position-swap bias mitigation. Three modes: pairwise, independent, both. Magnitude scoring from much-better to much-worse (#310)
Tool constraint grader — New tool_constraint grader type with expect_tools, reject_tools, max_turns, max_tokens constraints. Validates agent tool usage behavior (#391)
Auto skill discovery — --discover flag walks directory trees for SKILL.md + eval.yaml pairs. --strict mode fails if any skill lacks...

v0.30.1

Documentation

README updated — Added missing waza models command documentation with usage examples and flags (#220)

Full Changelog: v0.30.0...v0.30.1

@spboyer

What's New in v0.30.0

New Features

waza quality command (#98) — LLM-as-Judge skill quality scoring. Evaluates SKILL.md across 5 dimensions (clarity, completeness, trigger precision, scope coverage, anti-patterns) using the Copilot SDK. Scored 1-5 with visual bar output. Supports --format json for CI integration. (@spboyer)
Scope reduction advisory (#183) — waza check now warns when a skill has low capability scope, detecting potential token-limit compression loss. Parses USE FOR phrases, headings, and numbered procedures as capability signals. (@diberry)

Housekeeping

Closed 5 stale issues that were already implemented: #59 (token limits priority), #86 (per-file budgets), #81 (tokens diff), #83 (eval scaffolding), #162 (TypeSpec user query — answered)

Full Changelog: v0.29.0...v0.30.0

Uh oh!

Releases: microsoft/waza

Release list

Waza v0.38.0

What's Changed

Contributors

Uh oh!

Waza azd Extension v0.38.0

Changelog

[Unreleased]

[0.38.0] - 2026-06-30

Added

Changed

[0.37.0] - 2026-06-18

Added

Fixed

Dependencies

[0.36.0] - 2026-06-15

Added

Fixed

Dependencies

[0.35.0] - 2026-06-06

Added

Fixed

Changed

[0.34.0] - 2026-05-23

Added

Fixed

Changed

[0.33.0] - 2026-05-21

Added

Fixed

Changed

[0.31.0] - 2026-04-28

Added

Fixed

Uh oh!

Waza v0.37.0

What's Changed

New Contributors

Contributors

Uh oh!

Waza azd Extension v0.37.0

Changelog

[Unreleased]

[0.37.0] - 2026-06-18

Added

Fixed

Dependencies

[0.36.0] - 2026-06-15

Added

Fixed

Dependencies

[0.35.0] - 2026-06-06

Added

Fixed

Changed

[0.34.0] - 2026-05-23

Added

Fixed

Changed

[0.33.0] - 2026-05-21

Added

Fixed

Changed

[0.31.0] - 2026-04-28

Added

Fixed

Changed

Documentation

Dependencies

[0.30.1] - 2026-04-22

Documentation

[0.30.0] - 2026-04-22

Added

[0.29.0] - 2026-04-22

Added

Uh oh!

Waza v0.33.0

What's Changed