Releases: microsoft/waza
Release list
Waza v0.38.0
What's Changed
- chore: Release v0.37.0 registry update by @spboyer in #336
- docs: add eval registry design by @spboyer in #338
- docs: Eval & Grader Registry design doc (#13) by @spboyer in #337
- chore: update project dependencies by @spboyer in #339
- chore(deps): Bump docker/setup-buildx-action from 3 to 4 by @dependabot[bot] in #340
- chore(deps): Bump codecov/codecov-action from 4 to 7 by @dependabot[bot] in #342
- chore(deps): Bump golangci/golangci-lint-action from 7 to 9 by @dependabot[bot] in #344
- chore(deps-dev): Bump globals from 17.6.0 to 17.7.0 in /web by @dependabot[bot] in #346
- chore(deps-dev): Bump vite from 8.0.16 to 8.1.0 in /web by @dependabot[bot] in #348
- chore(deps): Bump actions/setup-go from 5 to 6 by @dependabot[bot] in #341
- chore(deps-dev): Bump @vitejs/plugin-react from 6.0.2 to 6.0.3 in /web by @dependabot[bot] in #349
- chore(deps): Bump @tanstack/react-query from 5.101.0 to 5.101.1 in /web by @dependabot[bot] in #350
- chore(deps): Bump actions/deploy-pages from 4 to 5 by @dependabot[bot] in #345
- feat: rubric preset library (closes #360) by @spboyer in #381
- feat: regression gate command (closes #364) by @spboyer in #384
- feat: OpenTelemetry trace export (closes #362) by @spboyer in #383
- feat: focused test generation in waza suggest (closes #357) by @spboyer in #380
- feat: spec verify command (closes #361) by @spboyer in #385
- feat: schema versioning policy (closes #368) by @spboyer in #382
- feat: per-turn checkpoint graders (closes #358) by @spboyer in #386
- feat: add MCP server mocks (closes #363) by @spboyer in #387
- feat: per-task tool metrics with structured arg matchers (closes #366) by @spboyer in #388
- feat: snapshot/replay for deterministic eval reproduction (closes #367) by @spboyer in #391
- feat: adversarial / fault-injection harness (closes #365) by @spboyer in #392
- chore(deps): update uncovered dependencies by @spboyer in #395
- chore(deps-dev): Bump typescript-eslint from 8.61.1 to 8.62.1 in /web by @dependabot[bot] in #343
- refactor: decouple ExecutionResponse from Copilot SDK events (Phase 1, #10) by @spboyer in #396
- feat: add run SSE events by @spboyer in #397
- docs: v0.38.0 feature coverage by @spboyer in #398
- test: coverage backfill for v0.38.0 features by @spboyer in #399
Full Changelog: v0.37.0...v0.38.0
Waza azd Extension v0.38.0
Changelog
All notable changes to waza will be documented in this file.
The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.
[Unreleased]
[0.38.0] - 2026-06-30
Added
- Focused eval suggestions —
waza suggestnow supports targeted generation with--count,--focus,--dry-run,--apply, and--force(#357, #380) - Per-turn checkpoint graders — Task YAML can run inline graders after specific conversation turns with
checkpoints[]andon_failurepolicies (#358, #386) - Rubric preset library — Prompt graders can reuse built-in rubric presets for common judge dimensions; no separate
waza rubricsubcommand ships in this release (#360, #381) - Spec verification — Added
waza spec verifyto report eval coverage againstSKILL.mdrequirements (#361, #385) - OpenTelemetry trace export — Added
waza run --otel-exporter,--otel-endpoint,--otel-headers,--otel-file, and--otel-include-payloads(#362, #383) - MCP server mocks — Added eval-level
mcp_mocks:for hermetic Copilot SDK tool-call evals (#363, #387) - Regression gates — Added
waza gatewith stable exit codes for pass, regression, golden failure, and config errors (#364, #384) - Adversarial harness — Added
waza adversarialand eval-leveladversarial:pack configuration for prompt-injection and scope-bypass checks (#365, #392) - Tool metrics and structured argument matchers — Results now include normalized
tool_events[]; tool graders can assert structured argument matchers throughexpect_tools[].argsandtool_calls.expect[].args(#366, #388) - Snapshot and replay — Added
waza run --snapshotandwaza replaywith a self-contained snapshot artifact format (#367, #391) - Schema version policy — Documented and enforced MAJOR.MINOR
schemaVersioncompatibility for public artifacts (#368, #382) - Dashboard SSE resume — Added
Last-Event-ID/lastEventIdresume support for dashboard event streams, including legacy/api/events(#178, #397)
Changed
- Phase 1 internal refactor — Internal cleanup with no user-facing CLI, schema, or site behavior changes (#10)
[0.37.0] - 2026-06-18
Added
- Interactive skill responder — Eval runs can now drive interactive skills with an LLM responder for more realistic conversational workflows (#304, closes #303)
- Triage automation and regression loop — Added triage automation and regression-loop support for Squad workflow validation (#326)
Fixed
- Task-level context fixtures — Task-level context fixtures now materialize in workspaces before execution (#329)
waza suggestengine failures — Engine failures are now surfaced bywaza suggestinstead of being hidden behind success-shaped output (#330)- Session idle hang — Upgraded the Copilot SDK to v1.0.2 and re-bundled Copilot CLI 1.0.64-0 to fix session idle hangs (#333)
Dependencies
[0.36.0] - 2026-06-15
Added
- Squad framework v0.10.0 upgrade — Upgraded Squad from 0.8.25 to 0.10.0 (#322, #323)
- Squad workflow failure detection — Added Squad workflow and failure detection infrastructure (#322, #324)
Fixed
- Prompt grader timeout configuration — Prompt grader timeout can now be configured with
WAZA_PROMPT_GRADER_TIMEOUT(#319) - Session-start hang detection — Added a first-event watchdog to catch session-start hangs (#321)
- Non-Squad coordinator canary handling — Clarified the Squad coordinator canary guard so non-Squad sessions can continue without using Squad (#325)
Dependencies
- Bump esbuild, @tailwindcss/vite, @vitejs/plugin-react, and vite (#317)
[0.35.0] - 2026-06-06
Added
- Copilot SDK v1.0.0 upgrade — Upgraded
github.com/github/copilot-sdk/goto v1.0.0 and surfaced premium-request credits on the dashboard (#311) - Model-aware dashboard pricing — Dashboard cost calculation now applies per-model pricing for more accurate run cost reporting (#310)
- Git worktree resources in task inputs — Tasks can now reference git worktree resources as inputs (#121, #302)
Fixed
- BYOK +
--modelstartup arg — The Copilot CLI validates the startup--modelflag against the Copilot catalog before BYOK provider config is applied, so provider-only model IDs would fail. The--modelstartup arg is now skipped when a BYOK provider is configured (#305, #306) - Model override propagation —
--modelis now passed viaCLIArgsso it correctly overrides user settings and experiment flights (#263) - Copilot CLI PATH fallback — Prevent silent fallback to a Copilot CLI on
PATHwhen the bundled binary is unavailable (#300) - Installer latest-release selection — Installer now correctly selects the latest standalone waza release (#299)
- Skill best practices doc link — Fixed the broken skill best practices reference (#295, #298)
Changed
- AgentEngine cancellation — Simplified
AgentEnginecancellation handling around caller contexts to make shutdown semantics more predictable (#290)
[0.34.0] - 2026-05-23
Added
- BYOK provider wiring — Added bring-your-own-key provider support for configured model providers (#240)
waza updatecommand — Added an update command for upgrading local Waza installations (#288)- Skill injection opt-out — Added an option to run evals without injecting the target skill body (#285, #292)
- Forbidden skills grading —
skill_invocationgraders can now assert that specific skills must not be invoked (#286, #291) - Per-trial usage reporting — Results JSON now includes per-trial usage details for deeper run analysis (#277)
- Agent-friendly GitHub templates — Added issue and pull request templates tuned for agent-authored work (#293)
Fixed
- Tool approval handling — Tool permission handling now uses the SDK approval kind (#240)
- Signal cancellation —
waza runnow respects cancellation signals more reliably (#279) - Sandbox prompt handling — Empty sandbox prompts are guarded before execution (#273, #278)
- Custom agent example schema — Fixed the custom-agent eval example to match the supported schema (#282)
- Binary release links — Fixed binary release documentation links (#276, #284)
- Agent path guidance — Corrected AGENTS root path guidance (#267, #269)
Changed
- Run concurrency —
waza runnow reuses a shared Copilot client and auto-sizes parallel workers when--workersis unset (#135, #221) - Documentation — Updated integration testing, custom-agent eval, and OpenAI Evals model-graded YAML documentation (#281, #283, #14, #280)
- Release workflow — GitHub Pages deployment now runs after the release workflow (#265)
[0.33.0] - 2026-05-21
Note: This release includes the changes previously prepared under 0.32.0, which was not published.
Added
- Configurable eval file naming —
.waza.yamlcan now configurefiles.evalFile,files.taskGlob, andfiles.taskFileSuffix, with the new naming carried through scaffolding, workspace discovery, discovery mode, schemas, and docs while preserving the existingeval.yamlandtasks/*.yamldefaults (#254, closes #232) - Instruction files in eval runs — Eval-level
config.instruction_filesand task-levelinstruction_filesnow copy files from the active context into task workspaces and append path-labeled contents to the Copilot system message (#248, closes #239)
Fixed
- Prompt graders use the execution engine — Prompt graders now route judge turns through
CopilotEngineinstead of constructing a Copilot client directly, keeping grader execution aligned with engine configuration and preserving follow-up recovery behavior (#258, closes #54) - Prompt grader follow-up recovery — Prompt grading now preserves collected grades when a follow-up turn fails after successful grader collection (#251)
- Bundled Copilot CLI updated — Embedded
copilot-clibundles are updated from 1.0.2 to 1.0.49 across supported platforms, with reproducible pinned bundle generation viaCOPILOT_CLI_VERSION(#260, closes #244) - Spec-aligned skill scaffolding —
waza new skillno longer asks for a nonstandard skill type or emitstype:frontmatter, and the wizard now rejects early exits that omit required name or description fields (#261, closes #243) waza checkeval discovery — Nested skills and separated evals are discovered consistently in multi-skill workspaces (#247, closes #238)- Skill body routing markers — Compliance scoring now detects trigger, anti-trigger, and routing markers in
SKILL.mdbody sections as well as frontmatter descriptions (#236, closes #223)
Changed
- Copilot SDK v0.3.0 migration — Updated
github.com/github/copilot-sdk/goto v0.3.0, migrated session event handling to typed payloads, and refreshed transcript, logging, web API, usage collection, suggestion trace, and test coverage for the new API (#255, closes #253) - Dashboard validation coverage — Added coverage for dashboard lint and end-to-end validation (#249)
- Install documentation — Replaced unsupported
go installguidance and clarified Windows/WSL install behavior (#246, closes #242; #245, closes #241) - Dependencies — Bump devalue in /site, postcss in /web, and astro in /site (#237, #235, #234)
[0.31.0] - 2026-04-28
Added
- Custom agent (
.agent.md) eval support — Discover.agent.mdfiles alongsideSKILL.md, parse agent-specific frontmatter (tools,model,handoffs,mcp-servers,agents), auto-injecttool_constraintgrader from agenttools:field, complete worked example underexamples/custom-agent/, and new "Evaluating Custom Agents" docs guide (#226, closes #225)
Fixed
- **Mock engine echoes file co...
Waza v0.37.0
What's Changed
- ci: Deploy Pages after release workflow by @spboyer in #265
- docs: fix AGENTS root path guidance by @drvoss in #269
- fix: guard empty sandbox prompts by @spboyer in #278
- fix: respect signal cancellation in waza run by @spboyer in #279
- docs: map OpenAI Evals modelgraded YAML to waza graders by @spboyer in #280
- Fix custom-agent eval example schema by @spboyer in #282
- docs: fix binary release links for #276 by @spboyer in #284
- docs: align custom agent eval docs with skill field by @spboyer in #283
- docs: update integration testing guide by @spboyer in #281
- feat: add per-trial usage to results JSON by @spboyer in #277
- fix: use SDK approval kind for tool permissions by @drvoss in #268
- feat: wire BYOK providers by @slbug in #240
- Add waza update command by @spboyer in #288
- Add forbidden skills to skill invocation grader by @spboyer in #291
- Add skill body injection opt-out by @spboyer in #292
- Add agent-friendly PR and issue templates by @spboyer in #293
- Improve
waza runconcurrency: shared Copilot client + auto-sized workers (#135) by @spboyer with @Copilot in #221 - Release v0.34.0 by @spboyer in #294
- Fix skill best practices reference by @spboyer in #298
- Fix installer latest release selection by @spboyer in #299
- fix: prevent Copilot CLI PATH fallback by @spboyer in #300
- Simplify AgentEngine cancellation around caller contexts by @spboyer with @Copilot in #290
- fix: pass --model via CLIArgs to override user settings and experiment flights by @sebastienlevert in #263
- feat: support git worktree resources in task inputs (#121) by @spboyer in #302
- fix: skip --model CLI startup arg when BYOK provider is configured (#305) by @spboyer in #306
- feat(pricing): model-aware cost calculation for dashboard by @spboyer in #310
- Upgrade Copilot SDK to v1.0.0 and surface premium-request credits on the dashboard by @spboyer in #311
- Release v0.35.0 by @spboyer in #312
- chore(deps): Bump esbuild, @tailwindcss/vite, @vitejs/plugin-react and vite in /web by @dependabot[bot] in #317
- fix(execution): add a first-event watchdog to catch session-start hangs by @sebastienlevert in #321
- fix(graders): make prompt-grader timeout configurable via WAZA_PROMPT_GRADER_TIMEOUT by @sebastienlevert in #319
- feat: Upgrade Squad framework from 0.8.25 to 0.10.0 by @spboyer in #323
- Phase 1 & 2: Squad workflows + failure detection infrastructure #322 by @spboyer in #324
- [WIP] Add support for custom agents in waza check by @spboyer with @Copilot in #314
- Release v0.36.0 by @spboyer in #325
- feat: complete issue #322 with triage automation and regression loop by @spboyer in #326
- Surface engine failures in
waza suggestby @spboyer with @Copilot in #330 - Materialize task-level context fixtures in workspaces by @spboyer with @Copilot in #329
- chore(deps): Bump js-yaml from 4.1.1 to 4.2.0 in /site by @dependabot[bot] in #327
- chore(deps): Bump astro from 6.3.2 to 6.4.7 in /site by @dependabot[bot] in #331
- feat: drive interactive skills via an LLM responder (#303) by @adamdougal in #304
- chore: upgrade copilot-sdk to v1.0.2 and re-bundle embedded CLI to 1.0.64-0 (fixes session.idle hang) by @sebastienlevert in #333
- chore: refresh dependencies by @spboyer in #335
- Release v0.37.0 by @spboyer in #334
New Contributors
- @slbug made their first contribution in #240
- @adamdougal made their first contribution in #304
Full Changelog: v0.33.0...v0.37.0
Waza azd Extension v0.37.0
Changelog
All notable changes to waza will be documented in this file.
The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.
[Unreleased]
[0.37.0] - 2026-06-18
Added
- Interactive skill responder — Eval runs can now drive interactive skills with an LLM responder for more realistic conversational workflows (#304, closes #303)
- Triage automation and regression loop — Added triage automation and regression-loop support for Squad workflow validation (#326)
Fixed
- Task-level context fixtures — Task-level context fixtures now materialize in workspaces before execution (#329)
waza suggestengine failures — Engine failures are now surfaced bywaza suggestinstead of being hidden behind success-shaped output (#330)- Session idle hang — Upgraded the Copilot SDK to v1.0.2 and re-bundled Copilot CLI 1.0.64-0 to fix session idle hangs (#333)
Dependencies
[0.36.0] - 2026-06-15
Added
- Squad framework v0.10.0 upgrade — Upgraded Squad from 0.8.25 to 0.10.0 (#322, #323)
- Squad workflow failure detection — Added Squad workflow and failure detection infrastructure (#322, #324)
Fixed
- Prompt grader timeout configuration — Prompt grader timeout can now be configured with
WAZA_PROMPT_GRADER_TIMEOUT(#319) - Session-start hang detection — Added a first-event watchdog to catch session-start hangs (#321)
- Non-Squad coordinator canary handling — Clarified the Squad coordinator canary guard so non-Squad sessions can continue without using Squad (#325)
Dependencies
- Bump esbuild, @tailwindcss/vite, @vitejs/plugin-react, and vite (#317)
[0.35.0] - 2026-06-06
Added
- Copilot SDK v1.0.0 upgrade — Upgraded
github.com/github/copilot-sdk/goto v1.0.0 and surfaced premium-request credits on the dashboard (#311) - Model-aware dashboard pricing — Dashboard cost calculation now applies per-model pricing for more accurate run cost reporting (#310)
- Git worktree resources in task inputs — Tasks can now reference git worktree resources as inputs (#121, #302)
Fixed
- BYOK +
--modelstartup arg — The Copilot CLI validates the startup--modelflag against the Copilot catalog before BYOK provider config is applied, so provider-only model IDs would fail. The--modelstartup arg is now skipped when a BYOK provider is configured (#305, #306) - Model override propagation —
--modelis now passed viaCLIArgsso it correctly overrides user settings and experiment flights (#263) - Copilot CLI PATH fallback — Prevent silent fallback to a Copilot CLI on
PATHwhen the bundled binary is unavailable (#300) - Installer latest-release selection — Installer now correctly selects the latest standalone waza release (#299)
- Skill best practices doc link — Fixed the broken skill best practices reference (#295, #298)
Changed
- AgentEngine cancellation — Simplified
AgentEnginecancellation handling around caller contexts to make shutdown semantics more predictable (#290)
[0.34.0] - 2026-05-23
Added
- BYOK provider wiring — Added bring-your-own-key provider support for configured model providers (#240)
waza updatecommand — Added an update command for upgrading local Waza installations (#288)- Skill injection opt-out — Added an option to run evals without injecting the target skill body (#285, #292)
- Forbidden skills grading —
skill_invocationgraders can now assert that specific skills must not be invoked (#286, #291) - Per-trial usage reporting — Results JSON now includes per-trial usage details for deeper run analysis (#277)
- Agent-friendly GitHub templates — Added issue and pull request templates tuned for agent-authored work (#293)
Fixed
- Tool approval handling — Tool permission handling now uses the SDK approval kind (#240)
- Signal cancellation —
waza runnow respects cancellation signals more reliably (#279) - Sandbox prompt handling — Empty sandbox prompts are guarded before execution (#273, #278)
- Custom agent example schema — Fixed the custom-agent eval example to match the supported schema (#282)
- Binary release links — Fixed binary release documentation links (#276, #284)
- Agent path guidance — Corrected AGENTS root path guidance (#267, #269)
Changed
- Run concurrency —
waza runnow reuses a shared Copilot client and auto-sizes parallel workers when--workersis unset (#135, #221) - Documentation — Updated integration testing, custom-agent eval, and OpenAI Evals model-graded YAML documentation (#281, #283, #14, #280)
- Release workflow — GitHub Pages deployment now runs after the release workflow (#265)
[0.33.0] - 2026-05-21
Note: This release includes the changes previously prepared under 0.32.0, which was not published.
Added
- Configurable eval file naming —
.waza.yamlcan now configurefiles.evalFile,files.taskGlob, andfiles.taskFileSuffix, with the new naming carried through scaffolding, workspace discovery, discovery mode, schemas, and docs while preserving the existingeval.yamlandtasks/*.yamldefaults (#254, closes #232) - Instruction files in eval runs — Eval-level
config.instruction_filesand task-levelinstruction_filesnow copy files from the active context into task workspaces and append path-labeled contents to the Copilot system message (#248, closes #239)
Fixed
- Prompt graders use the execution engine — Prompt graders now route judge turns through
CopilotEngineinstead of constructing a Copilot client directly, keeping grader execution aligned with engine configuration and preserving follow-up recovery behavior (#258, closes #54) - Prompt grader follow-up recovery — Prompt grading now preserves collected grades when a follow-up turn fails after successful grader collection (#251)
- Bundled Copilot CLI updated — Embedded
copilot-clibundles are updated from 1.0.2 to 1.0.49 across supported platforms, with reproducible pinned bundle generation viaCOPILOT_CLI_VERSION(#260, closes #244) - Spec-aligned skill scaffolding —
waza new skillno longer asks for a nonstandard skill type or emitstype:frontmatter, and the wizard now rejects early exits that omit required name or description fields (#261, closes #243) waza checkeval discovery — Nested skills and separated evals are discovered consistently in multi-skill workspaces (#247, closes #238)- Skill body routing markers — Compliance scoring now detects trigger, anti-trigger, and routing markers in
SKILL.mdbody sections as well as frontmatter descriptions (#236, closes #223)
Changed
- Copilot SDK v0.3.0 migration — Updated
github.com/github/copilot-sdk/goto v0.3.0, migrated session event handling to typed payloads, and refreshed transcript, logging, web API, usage collection, suggestion trace, and test coverage for the new API (#255, closes #253) - Dashboard validation coverage — Added coverage for dashboard lint and end-to-end validation (#249)
- Install documentation — Replaced unsupported
go installguidance and clarified Windows/WSL install behavior (#246, closes #242; #245, closes #241) - Dependencies — Bump devalue in /site, postcss in /web, and astro in /site (#237, #235, #234)
[0.31.0] - 2026-04-28
Added
- Custom agent (
.agent.md) eval support — Discover.agent.mdfiles alongsideSKILL.md, parse agent-specific frontmatter (tools,model,handoffs,mcp-servers,agents), auto-injecttool_constraintgrader from agenttools:field, complete worked example underexamples/custom-agent/, and new "Evaluating Custom Agents" docs guide (#226, closes #225)
Fixed
- Mock engine echoes file content —
_output_containsexpectations against file contents now work in CI without a real model. Mock response includes task metadata, file paths, and a 1KB content preview per resource (#228, closes #227) waza serveno longer crashes when stdin isn't a terminal — MCP stdio server only starts whenterm.IsTerminal()is true; piped input or background mode no longer kills the HTTP dashboard (#224)
Changed
- Vocabulary renames — Internal types renamed:
BenchmarkSpec→EvalSpec,TestRunner→EvalRunner. Not a breaking change for external consumers (types live ininternal/) (#222)
Documentation
- Cross-reference audit for recent renames + custom agent feature: added
.agent.mdcoverage to quickstart, getting-started, GUIDE, TUTORIAL, examples README; updated mock engine descriptions in INTEGRATION-TESTING and eval-yaml guide (#230)
Dependencies
- Bump postcss from 8.5.6 to 8.5.12 in /site (#229)
[0.30.1] - 2026-04-22
Documentation
- Updated README with missing CLI commands — Added documentation for recently-added CLI commands that were missing from the README (#220)
[0.30.0] - 2026-04-22
Added
waza qualitycommand — LLM-as-Judge skill quality scoring that evaluates skill output quality using a configurable judge model (#218)- Scope-reduction advisory check —
waza checknow includes an advisory that flags skills with overly broad scope, helping authors tighten skill definitions (#219)
[0.29.0] - 2026-04-22
Added
--keep-workspaceflag — Preserve the temporary workspace after task execution for debugging agent output (#123, #217)--no-skillsflag anddisabled_skillsconfig — Disable specific skills during evaluation to isolate behavior (#126, #216)- Non-blocking version update check — CLI now checks for newer waza versions in the background without slowing startup (#104, #214)
- Per-task
skill_directories— Specify different skill directories for individua...
Waza v0.33.0
What's Changed
- chore(deps): Bump astro from 6.1.8 to 6.3.2 in /site by @dependabot[bot] in #234
- chore(deps): Bump postcss from 8.5.6 to 8.5.14 in /web by @dependabot[bot] in #235
- chore(deps): Bump devalue from 5.6.4 to 5.8.1 in /site by @dependabot[bot] in #237
- fix: detect SKILL body routing markers by @drvoss in #236
- docs: clarify Windows install guidance by @spboyer in #245
- docs: fix unsupported go install guidance by @spboyer in #246
- fix: align check eval discovery by @spboyer in #247
- feat: include instruction files in eval runs by @spboyer in #248
- test: cover dashboard lint and e2e validation by @spboyer in #249
- fix: prompt grader gracefully recovers when follow-up turn fails after grades collected by @sebastienlevert in #251
- Release v0.32.0 by @spboyer in #252
- feat: configure eval file naming by @spboyer in #254
- Migrate Copilot SDK to v0.3.0 by @spboyer in #255
- fix: route prompt graders through CopilotEngine by @spboyer in #258
- fix: bump bundled copilot-cli by @spboyer in #260
- fix: remove nonstandard skill type prompt by @spboyer in #261
- Release v0.33.0 by @spboyer in #264
New Contributors
- @drvoss made their first contribution in #236
- @sebastienlevert made their first contribution in #251
Full Changelog: v0.31.0...v0.33.0
Waza azd Extension v0.33.0
Changelog
All notable changes to waza will be documented in this file.
The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.
[Unreleased]
[0.33.0] - 2026-05-21
Note: This release includes the changes previously prepared under 0.32.0, which was not published.
Added
- Configurable eval file naming —
.waza.yamlcan now configurefiles.evalFile,files.taskGlob, andfiles.taskFileSuffix, with the new naming carried through scaffolding, workspace discovery, discovery mode, schemas, and docs while preserving the existingeval.yamlandtasks/*.yamldefaults (#254, closes #232) - Instruction files in eval runs — Eval-level
config.instruction_filesand task-levelinstruction_filesnow copy files from the active context into task workspaces and append path-labeled contents to the Copilot system message (#248, closes #239)
Fixed
- Prompt graders use the execution engine — Prompt graders now route judge turns through
CopilotEngineinstead of constructing a Copilot client directly, keeping grader execution aligned with engine configuration and preserving follow-up recovery behavior (#258, closes #54) - Prompt grader follow-up recovery — Prompt grading now preserves collected grades when a follow-up turn fails after successful grader collection (#251)
- Bundled Copilot CLI updated — Embedded
copilot-clibundles are updated from 1.0.2 to 1.0.49 across supported platforms, with reproducible pinned bundle generation viaCOPILOT_CLI_VERSION(#260, closes #244) - Spec-aligned skill scaffolding —
waza new skillno longer asks for a nonstandard skill type or emitstype:frontmatter, and the wizard now rejects early exits that omit required name or description fields (#261, closes #243) waza checkeval discovery — Nested skills and separated evals are discovered consistently in multi-skill workspaces (#247, closes #238)- Skill body routing markers — Compliance scoring now detects trigger, anti-trigger, and routing markers in
SKILL.mdbody sections as well as frontmatter descriptions (#236, closes #223)
Changed
- Copilot SDK v0.3.0 migration — Updated
github.com/github/copilot-sdk/goto v0.3.0, migrated session event handling to typed payloads, and refreshed transcript, logging, web API, usage collection, suggestion trace, and test coverage for the new API (#255, closes #253) - Dashboard validation coverage — Added coverage for dashboard lint and end-to-end validation (#249)
- Install documentation — Replaced unsupported
go installguidance and clarified Windows/WSL install behavior (#246, closes #242; #245, closes #241) - Dependencies — Bump devalue in /site, postcss in /web, and astro in /site (#237, #235, #234)
[0.31.0] - 2026-04-28
Added
- Custom agent (
.agent.md) eval support — Discover.agent.mdfiles alongsideSKILL.md, parse agent-specific frontmatter (tools,model,handoffs,mcp-servers,agents), auto-injecttool_constraintgrader from agenttools:field, complete worked example underexamples/custom-agent/, and new "Evaluating Custom Agents" docs guide (#226, closes #225)
Fixed
- Mock engine echoes file content —
_output_containsexpectations against file contents now work in CI without a real model. Mock response includes task metadata, file paths, and a 1KB content preview per resource (#228, closes #227) waza serveno longer crashes when stdin isn't a terminal — MCP stdio server only starts whenterm.IsTerminal()is true; piped input or background mode no longer kills the HTTP dashboard (#224)
Changed
- Vocabulary renames — Internal types renamed:
BenchmarkSpec→EvalSpec,TestRunner→EvalRunner. Not a breaking change for external consumers (types live ininternal/) (#222)
Documentation
- Cross-reference audit for recent renames + custom agent feature: added
.agent.mdcoverage to quickstart, getting-started, GUIDE, TUTORIAL, examples README; updated mock engine descriptions in INTEGRATION-TESTING and eval-yaml guide (#230)
Dependencies
- Bump postcss from 8.5.6 to 8.5.12 in /site (#229)
[0.30.1] - 2026-04-22
Documentation
- Updated README with missing CLI commands — Added documentation for recently-added CLI commands that were missing from the README (#220)
[0.30.0] - 2026-04-22
Added
waza qualitycommand — LLM-as-Judge skill quality scoring that evaluates skill output quality using a configurable judge model (#218)- Scope-reduction advisory check —
waza checknow includes an advisory that flags skills with overly broad scope, helping authors tighten skill definitions (#219)
[0.29.0] - 2026-04-22
Added
--keep-workspaceflag — Preserve the temporary workspace after task execution for debugging agent output (#123, #217)--no-skillsflag anddisabled_skillsconfig — Disable specific skills during evaluation to isolate behavior (#126, #216)- Non-blocking version update check — CLI now checks for newer waza versions in the background without slowing startup (#104, #214)
- Per-task
skill_directories— Specify different skill directories for individual tasks in eval YAML (#156, #215)
Dependencies
- Bump astro and @astrojs/starlight in /site (#212)
[0.28.0] - 2026-04-21
Added
- Follow-up prompts in eval YAML — Tasks can now include pre-written follow-up prompts for multi-turn evaluation conversations (#189, #209)
waza modelscommand — List all available models supported by the configured engine (#208)- Early termination for trigger tests — Trigger tests can now stop early once the target skill is invoked, reducing evaluation time (#207)
Fixed
- Stricter YAML validation — Audited all YAML parsers; unknown fields in
TestCasedefinitions are now properly rejected (#132, #206) - Test fixture assertion syntax — Fixed invalid Python expression in a test fixture assertion (#197)
- CI integration test stability — CI integration tests now correctly handle expected eval failures when using the mock executor (#210)
Documentation
- Added Quick Start guide to the documentation site (#205)
[0.27.0] - 2026-04-21
Added
output_contains_anyexpectation — New expectation field that passes when the agent response contains any one of the specified strings (#203)max_response_time_msbehavior rule — Enforce maximum response time constraints on agent execution (#201)- Task prompt from file — Task
promptfield can now reference an external file path instead of inline text (#157, #200) tool_callsgrader — New grader type that validates the specific tool calls an agent makes during execution (#187, #202)
Fixed
- Webserver test resilience — Webserver tests now skip gracefully when frontend assets are not built (#204)
[0.26.0] - 2026-04-21
Changed
- Timestamped output directories —
run --output-dirnow groups result files by timestamp for cleaner organization (#153) - Improved debug logging — Debug output is now more structured and useful for troubleshooting (#152)
Fixed
--discoverfinds eval.yaml in nested layout — Skill discovery now correctly locateseval.yamlfiles inevals/{name}/directories at the project root (#44)- Diff grader reads post-execution workspace — The diff grader now reads files from the workspace after agent execution completes, not before (#165, #196)
- Grader config validation — Required grader configuration fields are now validated before evaluation starts (#195)
- macOS install and trigger test count — Fixed macOS binary installation and an off-by-one error in trigger test counting (#164, #184, #193)
Documentation
- Added cache command reference, prompt mode documentation, and complete YAML schema reference (#198)
- Updated demo guide and added CI/CD integration guide (#112, #89, #194)
Dependencies
- Bump defu from 6.1.4 to 6.1.6 in /site (#181)
- Bump vite from 6.4.1 to 6.4.2 in /site and /web (#182, #192)
- Bump go.opentelemetry.io/otel/sdk from 1.42.0 to 1.43.0 (#185)
- Bump astro from 5.17.3 to 5.18.1 in /site (#163)
- Bump picomatch from 4.0.3 to 4.0.4 in /site and /web (#159, #160)
- Bump smol-toml from 1.6.0 to 1.6.1 in /site (#158)
[0.25.0] - 2026-04-21
Added
- Eval coverage grid generator — New coverage output that visualizes which skills have eval coverage across grader types (#92)
Fixed
- SKILL.md injection and trigger fixture loading —
waza runnow correctly injects SKILL.md content into the evaluation context, loads trigger test fixtures, and passes MCP server configuration to the engine (#191)
Dependencies
- Bump h3 from 1.15.5 to 1.15.8 in /site (#144)
[0.24.0] - 2026-03-25
Changed
- Strict YAML validation — All YAML parsers now use
KnownFields(true)to reject unknown fields, catching typos and misconfigurations early (#132, #133) max_workersrenamed toworkers— Config YAML key renamed for consistency across all config types (breaking change)- Unified token counting —
waza checkandwaza tokens countnow share the same counting logic for consistent results (#146)
Fixed
- Typo in prompt grader — Fixed "prmopt" → "prompt" in error message
Dependencies
- Bump h3 from 1.15.8 to 1.15.9 in /site (#155)
- Bump github.com/buger/jsonparser from 1.1.1 to 1.1.2 (#149)
[0.21.0] - 2026-03-12
Added
waza new task from-promptcommand — Record Copilot sessions into task YAML files for eval creation (#110)- Trigger heuristic grader — New grader type that scores based on trigger/anti-trigger matching heuristics (#90)
- Eval scaffolding command —
waza eval newgenerates eval.yaml scaffolding for skills (#94) - Multi-trial flakiness detection — Detec...
Waza v0.31.0
What's Changed
- refactor: complete vocabulary renames — BenchmarkSpec→EvalSpec, TestRunner→EvalRunner (#166) by @spboyer in #222
- feat: support custom agent (.agent.md) file discovery and parsing #225 by @spboyer in #226
- fix: mock engine echoes file content for CI evals (#227) by @spboyer in #228
- fix: waza serve crashes when stdin is not a terminal by @spboyer in #224
- chore(deps): Bump postcss from 8.5.6 to 8.5.12 in /site by @dependabot[bot] in #229
- docs: cross-reference audit for recent renames and feature additions by @spboyer in #230
- Release v0.31.0 by @spboyer in #231
Full Changelog: v0.30.1...v0.31.0
Waza azd Extension v0.31.0
Changelog
All notable changes to waza will be documented in this file.
The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.
[Unreleased]
[0.31.0] - 2026-04-28
Added
- Custom agent (
.agent.md) eval support — Discover.agent.mdfiles alongsideSKILL.md, parse agent-specific frontmatter (tools,model,handoffs,mcp-servers,agents), auto-injecttool_constraintgrader from agenttools:field, complete worked example underexamples/custom-agent/, and new "Evaluating Custom Agents" docs guide (#226, closes #225)
Fixed
- Mock engine echoes file content —
_output_containsexpectations against file contents now work in CI without a real model. Mock response includes task metadata, file paths, and a 1KB content preview per resource (#228, closes #227) waza serveno longer crashes when stdin isn't a terminal — MCP stdio server only starts whenterm.IsTerminal()is true; piped input or background mode no longer kills the HTTP dashboard (#224)
Changed
- Vocabulary renames — Internal types renamed:
BenchmarkSpec→EvalSpec,TestRunner→EvalRunner. Not a breaking change for external consumers (types live ininternal/) (#222)
Documentation
- Cross-reference audit for recent renames + custom agent feature: added
.agent.mdcoverage to quickstart, getting-started, GUIDE, TUTORIAL, examples README; updated mock engine descriptions in INTEGRATION-TESTING and eval-yaml guide (#230)
Dependencies
- Bump postcss from 8.5.6 to 8.5.12 in /site (#229)
[0.30.1] - 2026-04-22
Documentation
- Updated README with missing CLI commands — Added documentation for recently-added CLI commands that were missing from the README (#220)
[0.30.0] - 2026-04-22
Added
waza qualitycommand — LLM-as-Judge skill quality scoring that evaluates skill output quality using a configurable judge model (#218)- Scope-reduction advisory check —
waza checknow includes an advisory that flags skills with overly broad scope, helping authors tighten skill definitions (#219)
[0.29.0] - 2026-04-22
Added
--keep-workspaceflag — Preserve the temporary workspace after task execution for debugging agent output (#123, #217)--no-skillsflag anddisabled_skillsconfig — Disable specific skills during evaluation to isolate behavior (#126, #216)- Non-blocking version update check — CLI now checks for newer waza versions in the background without slowing startup (#104, #214)
- Per-task
skill_directories— Specify different skill directories for individual tasks in eval YAML (#156, #215)
Dependencies
- Bump astro and @astrojs/starlight in /site (#212)
[0.28.0] - 2026-04-21
Added
- Follow-up prompts in eval YAML — Tasks can now include pre-written follow-up prompts for multi-turn evaluation conversations (#189, #209)
waza modelscommand — List all available models supported by the configured engine (#208)- Early termination for trigger tests — Trigger tests can now stop early once the target skill is invoked, reducing evaluation time (#207)
Fixed
- Stricter YAML validation — Audited all YAML parsers; unknown fields in
TestCasedefinitions are now properly rejected (#132, #206) - Test fixture assertion syntax — Fixed invalid Python expression in a test fixture assertion (#197)
- CI integration test stability — CI integration tests now correctly handle expected eval failures when using the mock executor (#210)
Documentation
- Added Quick Start guide to the documentation site (#205)
[0.27.0] - 2026-04-21
Added
output_contains_anyexpectation — New expectation field that passes when the agent response contains any one of the specified strings (#203)max_response_time_msbehavior rule — Enforce maximum response time constraints on agent execution (#201)- Task prompt from file — Task
promptfield can now reference an external file path instead of inline text (#157, #200) tool_callsgrader — New grader type that validates the specific tool calls an agent makes during execution (#187, #202)
Fixed
- Webserver test resilience — Webserver tests now skip gracefully when frontend assets are not built (#204)
[0.26.0] - 2026-04-21
Changed
- Timestamped output directories —
run --output-dirnow groups result files by timestamp for cleaner organization (#153) - Improved debug logging — Debug output is now more structured and useful for troubleshooting (#152)
Fixed
--discoverfinds eval.yaml in nested layout — Skill discovery now correctly locateseval.yamlfiles inevals/{name}/directories at the project root (#44)- Diff grader reads post-execution workspace — The diff grader now reads files from the workspace after agent execution completes, not before (#165, #196)
- Grader config validation — Required grader configuration fields are now validated before evaluation starts (#195)
- macOS install and trigger test count — Fixed macOS binary installation and an off-by-one error in trigger test counting (#164, #184, #193)
Documentation
- Added cache command reference, prompt mode documentation, and complete YAML schema reference (#198)
- Updated demo guide and added CI/CD integration guide (#112, #89, #194)
Dependencies
- Bump defu from 6.1.4 to 6.1.6 in /site (#181)
- Bump vite from 6.4.1 to 6.4.2 in /site and /web (#182, #192)
- Bump go.opentelemetry.io/otel/sdk from 1.42.0 to 1.43.0 (#185)
- Bump astro from 5.17.3 to 5.18.1 in /site (#163)
- Bump picomatch from 4.0.3 to 4.0.4 in /site and /web (#159, #160)
- Bump smol-toml from 1.6.0 to 1.6.1 in /site (#158)
[0.25.0] - 2026-04-21
Added
- Eval coverage grid generator — New coverage output that visualizes which skills have eval coverage across grader types (#92)
Fixed
- SKILL.md injection and trigger fixture loading —
waza runnow correctly injects SKILL.md content into the evaluation context, loads trigger test fixtures, and passes MCP server configuration to the engine (#191)
Dependencies
- Bump h3 from 1.15.5 to 1.15.8 in /site (#144)
[0.24.0] - 2026-03-25
Changed
- Strict YAML validation — All YAML parsers now use
KnownFields(true)to reject unknown fields, catching typos and misconfigurations early (#132, #133) max_workersrenamed toworkers— Config YAML key renamed for consistency across all config types (breaking change)- Unified token counting —
waza checkandwaza tokens countnow share the same counting logic for consistent results (#146)
Fixed
- Typo in prompt grader — Fixed "prmopt" → "prompt" in error message
Dependencies
- Bump h3 from 1.15.8 to 1.15.9 in /site (#155)
- Bump github.com/buger/jsonparser from 1.1.1 to 1.1.2 (#149)
[0.21.0] - 2026-03-12
Added
waza new task from-promptcommand — Record Copilot sessions into task YAML files for eval creation (#110)- Trigger heuristic grader — New grader type that scores based on trigger/anti-trigger matching heuristics (#90)
- Eval scaffolding command —
waza eval newgenerates eval.yaml scaffolding for skills (#94) - Multi-trial flakiness detection — Detect flaky evals across multiple trial runs (#103)
- Snapshot auto-update workflow — Diff grader can now auto-update snapshot files on mismatch (#95)
- Per-file token budget configuration — Configure token budgets per-file in
.waza.yaml(#96) - Skill-aware thresholds —
waza tokens comparesupports skill-specific threshold configuration (#93) - Sensei scoring parity — WHEN triggers, spec-security, invalid level, and advisory checks 16-18 (#79)
- CI/CD integration guide — GitHub Actions and Azure DevOps integration documentation (#100)
- FileWriter service — Refactored
waza initinventory with FileWriter abstraction (#63)
Fixed
waza suggestdeadlock —Execute()now applies the request timeout before callingStart(), preventing goroutine deadlock (#43)ResourceFile.Contenttype — Changed fromstringto[]bytefor proper binary file handling (#117)tokens comparein subdirectory — No longer shows all files as "added" when run from a subdirectory (#105)--output-dirignored — Fixed--output-dirhaving no effect for single-skill runs (#109)- Web dashboard build order — Build dashboard assets before Go compilation (#107)
- Test file leak — Fixed test that leaked files into the repo (#120)
- Config schema defaults — Aligned
config.schema.jsondefaults with Go source of truth (#65) - Skill discovery path — Discover skills under
.github/skills/directory (#69)
Changed
- Renamed
confignodemax_workerstoworkersfor consistency across all config types- This is a breaking change
- Custom YAML deserializers for config types (#106)
- Validate only known fields in YAML decoders. (#132)
- Token limits priority inverted to
.waza.yamlfirst (#64) @wbrezaadded to CODEOWNERS (#111)- Go 1.26+ noted in agent instruction files (#108)
[0.9.0] - 2026-02-23
Added
- A/B baseline testing —
--baselineflag runs each task with and without skill, computes weighted improvement scores across quality, tokens, turns, time, and task completion (#307) - Pairwise LLM judging —
pairwisemode onpromptgrader with position-swap bias mitigation. Three modes: pairwise, independent, both. Magnitude scoring from much-better to much-worse (#310) - Tool constraint grader — New
tool_constraintgrader type withexpect_tools,reject_tools,max_turns,max_tokensconstraints. Validates agent tool usage behavior (#391) - Auto skill discovery —
--discoverflag walks directory trees for SKILL.md + eval.yaml pairs.--strictmode fails if any skill lacks...
v0.30.1
v0.30.0
What's New in v0.30.0
New Features
-
waza qualitycommand (#98) — LLM-as-Judge skill quality scoring. Evaluates SKILL.md across 5 dimensions (clarity, completeness, trigger precision, scope coverage, anti-patterns) using the Copilot SDK. Scored 1-5 with visual bar output. Supports--format jsonfor CI integration. (@spboyer) -
Scope reduction advisory (#183) —
waza checknow warns when a skill has low capability scope, detecting potential token-limit compression loss. Parses USE FOR phrases, headings, and numbered procedures as capability signals. (@diberry)
Housekeeping
- Closed 5 stale issues that were already implemented: #59 (token limits priority), #86 (per-file budgets), #81 (tokens diff), #83 (eval scaffolding), #162 (TypeSpec user query — answered)
Full Changelog: v0.29.0...v0.30.0