Skip to content

Releases: microsoft/waza

Waza v0.38.0

Choose a tag to compare

@github-actions github-actions released this 30 Jun 16:03
6157165

What's Changed

  • chore: Release v0.37.0 registry update by @spboyer in #336
  • docs: add eval registry design by @spboyer in #338
  • docs: Eval & Grader Registry design doc (#13) by @spboyer in #337
  • chore: update project dependencies by @spboyer in #339
  • chore(deps): Bump docker/setup-buildx-action from 3 to 4 by @dependabot[bot] in #340
  • chore(deps): Bump codecov/codecov-action from 4 to 7 by @dependabot[bot] in #342
  • chore(deps): Bump golangci/golangci-lint-action from 7 to 9 by @dependabot[bot] in #344
  • chore(deps-dev): Bump globals from 17.6.0 to 17.7.0 in /web by @dependabot[bot] in #346
  • chore(deps-dev): Bump vite from 8.0.16 to 8.1.0 in /web by @dependabot[bot] in #348
  • chore(deps): Bump actions/setup-go from 5 to 6 by @dependabot[bot] in #341
  • chore(deps-dev): Bump @vitejs/plugin-react from 6.0.2 to 6.0.3 in /web by @dependabot[bot] in #349
  • chore(deps): Bump @tanstack/react-query from 5.101.0 to 5.101.1 in /web by @dependabot[bot] in #350
  • chore(deps): Bump actions/deploy-pages from 4 to 5 by @dependabot[bot] in #345
  • feat: rubric preset library (closes #360) by @spboyer in #381
  • feat: regression gate command (closes #364) by @spboyer in #384
  • feat: OpenTelemetry trace export (closes #362) by @spboyer in #383
  • feat: focused test generation in waza suggest (closes #357) by @spboyer in #380
  • feat: spec verify command (closes #361) by @spboyer in #385
  • feat: schema versioning policy (closes #368) by @spboyer in #382
  • feat: per-turn checkpoint graders (closes #358) by @spboyer in #386
  • feat: add MCP server mocks (closes #363) by @spboyer in #387
  • feat: per-task tool metrics with structured arg matchers (closes #366) by @spboyer in #388
  • feat: snapshot/replay for deterministic eval reproduction (closes #367) by @spboyer in #391
  • feat: adversarial / fault-injection harness (closes #365) by @spboyer in #392
  • chore(deps): update uncovered dependencies by @spboyer in #395
  • chore(deps-dev): Bump typescript-eslint from 8.61.1 to 8.62.1 in /web by @dependabot[bot] in #343
  • refactor: decouple ExecutionResponse from Copilot SDK events (Phase 1, #10) by @spboyer in #396
  • feat: add run SSE events by @spboyer in #397
  • docs: v0.38.0 feature coverage by @spboyer in #398
  • test: coverage backfill for v0.38.0 features by @spboyer in #399

Full Changelog: v0.37.0...v0.38.0

Waza azd Extension v0.38.0

Choose a tag to compare

@github-actions github-actions released this 30 Jun 16:05
6157165

Changelog

All notable changes to waza will be documented in this file.

The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.

[Unreleased]

[0.38.0] - 2026-06-30

Added

  • Focused eval suggestionswaza suggest now supports targeted generation with --count, --focus, --dry-run, --apply, and --force (#357, #380)
  • Per-turn checkpoint graders — Task YAML can run inline graders after specific conversation turns with checkpoints[] and on_failure policies (#358, #386)
  • Rubric preset library — Prompt graders can reuse built-in rubric presets for common judge dimensions; no separate waza rubric subcommand ships in this release (#360, #381)
  • Spec verification — Added waza spec verify to report eval coverage against SKILL.md requirements (#361, #385)
  • OpenTelemetry trace export — Added waza run --otel-exporter, --otel-endpoint, --otel-headers, --otel-file, and --otel-include-payloads (#362, #383)
  • MCP server mocks — Added eval-level mcp_mocks: for hermetic Copilot SDK tool-call evals (#363, #387)
  • Regression gates — Added waza gate with stable exit codes for pass, regression, golden failure, and config errors (#364, #384)
  • Adversarial harness — Added waza adversarial and eval-level adversarial: pack configuration for prompt-injection and scope-bypass checks (#365, #392)
  • Tool metrics and structured argument matchers — Results now include normalized tool_events[]; tool graders can assert structured argument matchers through expect_tools[].args and tool_calls.expect[].args (#366, #388)
  • Snapshot and replay — Added waza run --snapshot and waza replay with a self-contained snapshot artifact format (#367, #391)
  • Schema version policy — Documented and enforced MAJOR.MINOR schemaVersion compatibility for public artifacts (#368, #382)
  • Dashboard SSE resume — Added Last-Event-ID / lastEventId resume support for dashboard event streams, including legacy /api/events (#178, #397)

Changed

  • Phase 1 internal refactor — Internal cleanup with no user-facing CLI, schema, or site behavior changes (#10)

[0.37.0] - 2026-06-18

Added

  • Interactive skill responder — Eval runs can now drive interactive skills with an LLM responder for more realistic conversational workflows (#304, closes #303)
  • Triage automation and regression loop — Added triage automation and regression-loop support for Squad workflow validation (#326)

Fixed

  • Task-level context fixtures — Task-level context fixtures now materialize in workspaces before execution (#329)
  • waza suggest engine failures — Engine failures are now surfaced by waza suggest instead of being hidden behind success-shaped output (#330)
  • Session idle hang — Upgraded the Copilot SDK to v1.0.2 and re-bundled Copilot CLI 1.0.64-0 to fix session idle hangs (#333)

Dependencies

  • Bump astro from 6.3.2 to 6.4.7 in /site (#331)
  • Bump js-yaml from 4.1.1 to 4.2.0 in /site (#327)

[0.36.0] - 2026-06-15

Added

  • Squad framework v0.10.0 upgrade — Upgraded Squad from 0.8.25 to 0.10.0 (#322, #323)
  • Squad workflow failure detection — Added Squad workflow and failure detection infrastructure (#322, #324)

Fixed

  • Prompt grader timeout configuration — Prompt grader timeout can now be configured with WAZA_PROMPT_GRADER_TIMEOUT (#319)
  • Session-start hang detection — Added a first-event watchdog to catch session-start hangs (#321)
  • Non-Squad coordinator canary handling — Clarified the Squad coordinator canary guard so non-Squad sessions can continue without using Squad (#325)

Dependencies

  • Bump esbuild, @tailwindcss/vite, @vitejs/plugin-react, and vite (#317)

[0.35.0] - 2026-06-06

Added

  • Copilot SDK v1.0.0 upgrade — Upgraded github.com/github/copilot-sdk/go to v1.0.0 and surfaced premium-request credits on the dashboard (#311)
  • Model-aware dashboard pricing — Dashboard cost calculation now applies per-model pricing for more accurate run cost reporting (#310)
  • Git worktree resources in task inputs — Tasks can now reference git worktree resources as inputs (#121, #302)

Fixed

  • BYOK + --model startup arg — The Copilot CLI validates the startup --model flag against the Copilot catalog before BYOK provider config is applied, so provider-only model IDs would fail. The --model startup arg is now skipped when a BYOK provider is configured (#305, #306)
  • Model override propagation--model is now passed via CLIArgs so it correctly overrides user settings and experiment flights (#263)
  • Copilot CLI PATH fallback — Prevent silent fallback to a Copilot CLI on PATH when the bundled binary is unavailable (#300)
  • Installer latest-release selection — Installer now correctly selects the latest standalone waza release (#299)
  • Skill best practices doc link — Fixed the broken skill best practices reference (#295, #298)

Changed

  • AgentEngine cancellation — Simplified AgentEngine cancellation handling around caller contexts to make shutdown semantics more predictable (#290)

[0.34.0] - 2026-05-23

Added

  • BYOK provider wiring — Added bring-your-own-key provider support for configured model providers (#240)
  • waza update command — Added an update command for upgrading local Waza installations (#288)
  • Skill injection opt-out — Added an option to run evals without injecting the target skill body (#285, #292)
  • Forbidden skills gradingskill_invocation graders can now assert that specific skills must not be invoked (#286, #291)
  • Per-trial usage reporting — Results JSON now includes per-trial usage details for deeper run analysis (#277)
  • Agent-friendly GitHub templates — Added issue and pull request templates tuned for agent-authored work (#293)

Fixed

  • Tool approval handling — Tool permission handling now uses the SDK approval kind (#240)
  • Signal cancellationwaza run now respects cancellation signals more reliably (#279)
  • Sandbox prompt handling — Empty sandbox prompts are guarded before execution (#273, #278)
  • Custom agent example schema — Fixed the custom-agent eval example to match the supported schema (#282)
  • Binary release links — Fixed binary release documentation links (#276, #284)
  • Agent path guidance — Corrected AGENTS root path guidance (#267, #269)

Changed

  • Run concurrencywaza run now reuses a shared Copilot client and auto-sizes parallel workers when --workers is unset (#135, #221)
  • Documentation — Updated integration testing, custom-agent eval, and OpenAI Evals model-graded YAML documentation (#281, #283, #14, #280)
  • Release workflow — GitHub Pages deployment now runs after the release workflow (#265)

[0.33.0] - 2026-05-21

Note: This release includes the changes previously prepared under 0.32.0, which was not published.

Added

  • Configurable eval file naming.waza.yaml can now configure files.evalFile, files.taskGlob, and files.taskFileSuffix, with the new naming carried through scaffolding, workspace discovery, discovery mode, schemas, and docs while preserving the existing eval.yaml and tasks/*.yaml defaults (#254, closes #232)
  • Instruction files in eval runs — Eval-level config.instruction_files and task-level instruction_files now copy files from the active context into task workspaces and append path-labeled contents to the Copilot system message (#248, closes #239)

Fixed

  • Prompt graders use the execution engine — Prompt graders now route judge turns through CopilotEngine instead of constructing a Copilot client directly, keeping grader execution aligned with engine configuration and preserving follow-up recovery behavior (#258, closes #54)
  • Prompt grader follow-up recovery — Prompt grading now preserves collected grades when a follow-up turn fails after successful grader collection (#251)
  • Bundled Copilot CLI updated — Embedded copilot-cli bundles are updated from 1.0.2 to 1.0.49 across supported platforms, with reproducible pinned bundle generation via COPILOT_CLI_VERSION (#260, closes #244)
  • Spec-aligned skill scaffoldingwaza new skill no longer asks for a nonstandard skill type or emits type: frontmatter, and the wizard now rejects early exits that omit required name or description fields (#261, closes #243)
  • waza check eval discovery — Nested skills and separated evals are discovered consistently in multi-skill workspaces (#247, closes #238)
  • Skill body routing markers — Compliance scoring now detects trigger, anti-trigger, and routing markers in SKILL.md body sections as well as frontmatter descriptions (#236, closes #223)

Changed

  • Copilot SDK v0.3.0 migration — Updated github.com/github/copilot-sdk/go to v0.3.0, migrated session event handling to typed payloads, and refreshed transcript, logging, web API, usage collection, suggestion trace, and test coverage for the new API (#255, closes #253)
  • Dashboard validation coverage — Added coverage for dashboard lint and end-to-end validation (#249)
  • Install documentation — Replaced unsupported go install guidance and clarified Windows/WSL install behavior (#246, closes #242; #245, closes #241)
  • Dependencies — Bump devalue in /site, postcss in /web, and astro in /site (#237, #235, #234)

[0.31.0] - 2026-04-28

Added

  • Custom agent (.agent.md) eval support — Discover .agent.md files alongside SKILL.md, parse agent-specific frontmatter (tools, model, handoffs, mcp-servers, agents), auto-inject tool_constraint grader from agent tools: field, complete worked example under examples/custom-agent/, and new "Evaluating Custom Agents" docs guide (#226, closes #225)

Fixed

  • **Mock engine echoes file co...
Read more

Waza v0.37.0

Choose a tag to compare

@github-actions github-actions released this 18 Jun 18:13
4d32191

What's Changed

  • ci: Deploy Pages after release workflow by @spboyer in #265
  • docs: fix AGENTS root path guidance by @drvoss in #269
  • fix: guard empty sandbox prompts by @spboyer in #278
  • fix: respect signal cancellation in waza run by @spboyer in #279
  • docs: map OpenAI Evals modelgraded YAML to waza graders by @spboyer in #280
  • Fix custom-agent eval example schema by @spboyer in #282
  • docs: fix binary release links for #276 by @spboyer in #284
  • docs: align custom agent eval docs with skill field by @spboyer in #283
  • docs: update integration testing guide by @spboyer in #281
  • feat: add per-trial usage to results JSON by @spboyer in #277
  • fix: use SDK approval kind for tool permissions by @drvoss in #268
  • feat: wire BYOK providers by @slbug in #240
  • Add waza update command by @spboyer in #288
  • Add forbidden skills to skill invocation grader by @spboyer in #291
  • Add skill body injection opt-out by @spboyer in #292
  • Add agent-friendly PR and issue templates by @spboyer in #293
  • Improve waza run concurrency: shared Copilot client + auto-sized workers (#135) by @spboyer with @Copilot in #221
  • Release v0.34.0 by @spboyer in #294
  • Fix skill best practices reference by @spboyer in #298
  • Fix installer latest release selection by @spboyer in #299
  • fix: prevent Copilot CLI PATH fallback by @spboyer in #300
  • Simplify AgentEngine cancellation around caller contexts by @spboyer with @Copilot in #290
  • fix: pass --model via CLIArgs to override user settings and experiment flights by @sebastienlevert in #263
  • feat: support git worktree resources in task inputs (#121) by @spboyer in #302
  • fix: skip --model CLI startup arg when BYOK provider is configured (#305) by @spboyer in #306
  • feat(pricing): model-aware cost calculation for dashboard by @spboyer in #310
  • Upgrade Copilot SDK to v1.0.0 and surface premium-request credits on the dashboard by @spboyer in #311
  • Release v0.35.0 by @spboyer in #312
  • chore(deps): Bump esbuild, @tailwindcss/vite, @vitejs/plugin-react and vite in /web by @dependabot[bot] in #317
  • fix(execution): add a first-event watchdog to catch session-start hangs by @sebastienlevert in #321
  • fix(graders): make prompt-grader timeout configurable via WAZA_PROMPT_GRADER_TIMEOUT by @sebastienlevert in #319
  • feat: Upgrade Squad framework from 0.8.25 to 0.10.0 by @spboyer in #323
  • Phase 1 & 2: Squad workflows + failure detection infrastructure #322 by @spboyer in #324
  • [WIP] Add support for custom agents in waza check by @spboyer with @Copilot in #314
  • Release v0.36.0 by @spboyer in #325
  • feat: complete issue #322 with triage automation and regression loop by @spboyer in #326
  • Surface engine failures in waza suggest by @spboyer with @Copilot in #330
  • Materialize task-level context fixtures in workspaces by @spboyer with @Copilot in #329
  • chore(deps): Bump js-yaml from 4.1.1 to 4.2.0 in /site by @dependabot[bot] in #327
  • chore(deps): Bump astro from 6.3.2 to 6.4.7 in /site by @dependabot[bot] in #331
  • feat: drive interactive skills via an LLM responder (#303) by @adamdougal in #304
  • chore: upgrade copilot-sdk to v1.0.2 and re-bundle embedded CLI to 1.0.64-0 (fixes session.idle hang) by @sebastienlevert in #333
  • chore: refresh dependencies by @spboyer in #335
  • Release v0.37.0 by @spboyer in #334

New Contributors

Full Changelog: v0.33.0...v0.37.0

Waza azd Extension v0.37.0

Choose a tag to compare

@github-actions github-actions released this 18 Jun 18:14
4d32191

Changelog

All notable changes to waza will be documented in this file.

The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.

[Unreleased]

[0.37.0] - 2026-06-18

Added

  • Interactive skill responder — Eval runs can now drive interactive skills with an LLM responder for more realistic conversational workflows (#304, closes #303)
  • Triage automation and regression loop — Added triage automation and regression-loop support for Squad workflow validation (#326)

Fixed

  • Task-level context fixtures — Task-level context fixtures now materialize in workspaces before execution (#329)
  • waza suggest engine failures — Engine failures are now surfaced by waza suggest instead of being hidden behind success-shaped output (#330)
  • Session idle hang — Upgraded the Copilot SDK to v1.0.2 and re-bundled Copilot CLI 1.0.64-0 to fix session idle hangs (#333)

Dependencies

  • Bump astro from 6.3.2 to 6.4.7 in /site (#331)
  • Bump js-yaml from 4.1.1 to 4.2.0 in /site (#327)

[0.36.0] - 2026-06-15

Added

  • Squad framework v0.10.0 upgrade — Upgraded Squad from 0.8.25 to 0.10.0 (#322, #323)
  • Squad workflow failure detection — Added Squad workflow and failure detection infrastructure (#322, #324)

Fixed

  • Prompt grader timeout configuration — Prompt grader timeout can now be configured with WAZA_PROMPT_GRADER_TIMEOUT (#319)
  • Session-start hang detection — Added a first-event watchdog to catch session-start hangs (#321)
  • Non-Squad coordinator canary handling — Clarified the Squad coordinator canary guard so non-Squad sessions can continue without using Squad (#325)

Dependencies

  • Bump esbuild, @tailwindcss/vite, @vitejs/plugin-react, and vite (#317)

[0.35.0] - 2026-06-06

Added

  • Copilot SDK v1.0.0 upgrade — Upgraded github.com/github/copilot-sdk/go to v1.0.0 and surfaced premium-request credits on the dashboard (#311)
  • Model-aware dashboard pricing — Dashboard cost calculation now applies per-model pricing for more accurate run cost reporting (#310)
  • Git worktree resources in task inputs — Tasks can now reference git worktree resources as inputs (#121, #302)

Fixed

  • BYOK + --model startup arg — The Copilot CLI validates the startup --model flag against the Copilot catalog before BYOK provider config is applied, so provider-only model IDs would fail. The --model startup arg is now skipped when a BYOK provider is configured (#305, #306)
  • Model override propagation--model is now passed via CLIArgs so it correctly overrides user settings and experiment flights (#263)
  • Copilot CLI PATH fallback — Prevent silent fallback to a Copilot CLI on PATH when the bundled binary is unavailable (#300)
  • Installer latest-release selection — Installer now correctly selects the latest standalone waza release (#299)
  • Skill best practices doc link — Fixed the broken skill best practices reference (#295, #298)

Changed

  • AgentEngine cancellation — Simplified AgentEngine cancellation handling around caller contexts to make shutdown semantics more predictable (#290)

[0.34.0] - 2026-05-23

Added

  • BYOK provider wiring — Added bring-your-own-key provider support for configured model providers (#240)
  • waza update command — Added an update command for upgrading local Waza installations (#288)
  • Skill injection opt-out — Added an option to run evals without injecting the target skill body (#285, #292)
  • Forbidden skills gradingskill_invocation graders can now assert that specific skills must not be invoked (#286, #291)
  • Per-trial usage reporting — Results JSON now includes per-trial usage details for deeper run analysis (#277)
  • Agent-friendly GitHub templates — Added issue and pull request templates tuned for agent-authored work (#293)

Fixed

  • Tool approval handling — Tool permission handling now uses the SDK approval kind (#240)
  • Signal cancellationwaza run now respects cancellation signals more reliably (#279)
  • Sandbox prompt handling — Empty sandbox prompts are guarded before execution (#273, #278)
  • Custom agent example schema — Fixed the custom-agent eval example to match the supported schema (#282)
  • Binary release links — Fixed binary release documentation links (#276, #284)
  • Agent path guidance — Corrected AGENTS root path guidance (#267, #269)

Changed

  • Run concurrencywaza run now reuses a shared Copilot client and auto-sizes parallel workers when --workers is unset (#135, #221)
  • Documentation — Updated integration testing, custom-agent eval, and OpenAI Evals model-graded YAML documentation (#281, #283, #14, #280)
  • Release workflow — GitHub Pages deployment now runs after the release workflow (#265)

[0.33.0] - 2026-05-21

Note: This release includes the changes previously prepared under 0.32.0, which was not published.

Added

  • Configurable eval file naming.waza.yaml can now configure files.evalFile, files.taskGlob, and files.taskFileSuffix, with the new naming carried through scaffolding, workspace discovery, discovery mode, schemas, and docs while preserving the existing eval.yaml and tasks/*.yaml defaults (#254, closes #232)
  • Instruction files in eval runs — Eval-level config.instruction_files and task-level instruction_files now copy files from the active context into task workspaces and append path-labeled contents to the Copilot system message (#248, closes #239)

Fixed

  • Prompt graders use the execution engine — Prompt graders now route judge turns through CopilotEngine instead of constructing a Copilot client directly, keeping grader execution aligned with engine configuration and preserving follow-up recovery behavior (#258, closes #54)
  • Prompt grader follow-up recovery — Prompt grading now preserves collected grades when a follow-up turn fails after successful grader collection (#251)
  • Bundled Copilot CLI updated — Embedded copilot-cli bundles are updated from 1.0.2 to 1.0.49 across supported platforms, with reproducible pinned bundle generation via COPILOT_CLI_VERSION (#260, closes #244)
  • Spec-aligned skill scaffoldingwaza new skill no longer asks for a nonstandard skill type or emits type: frontmatter, and the wizard now rejects early exits that omit required name or description fields (#261, closes #243)
  • waza check eval discovery — Nested skills and separated evals are discovered consistently in multi-skill workspaces (#247, closes #238)
  • Skill body routing markers — Compliance scoring now detects trigger, anti-trigger, and routing markers in SKILL.md body sections as well as frontmatter descriptions (#236, closes #223)

Changed

  • Copilot SDK v0.3.0 migration — Updated github.com/github/copilot-sdk/go to v0.3.0, migrated session event handling to typed payloads, and refreshed transcript, logging, web API, usage collection, suggestion trace, and test coverage for the new API (#255, closes #253)
  • Dashboard validation coverage — Added coverage for dashboard lint and end-to-end validation (#249)
  • Install documentation — Replaced unsupported go install guidance and clarified Windows/WSL install behavior (#246, closes #242; #245, closes #241)
  • Dependencies — Bump devalue in /site, postcss in /web, and astro in /site (#237, #235, #234)

[0.31.0] - 2026-04-28

Added

  • Custom agent (.agent.md) eval support — Discover .agent.md files alongside SKILL.md, parse agent-specific frontmatter (tools, model, handoffs, mcp-servers, agents), auto-inject tool_constraint grader from agent tools: field, complete worked example under examples/custom-agent/, and new "Evaluating Custom Agents" docs guide (#226, closes #225)

Fixed

  • Mock engine echoes file content_output_contains expectations against file contents now work in CI without a real model. Mock response includes task metadata, file paths, and a 1KB content preview per resource (#228, closes #227)
  • waza serve no longer crashes when stdin isn't a terminal — MCP stdio server only starts when term.IsTerminal() is true; piped input or background mode no longer kills the HTTP dashboard (#224)

Changed

  • Vocabulary renames — Internal types renamed: BenchmarkSpecEvalSpec, TestRunnerEvalRunner. Not a breaking change for external consumers (types live in internal/) (#222)

Documentation

  • Cross-reference audit for recent renames + custom agent feature: added .agent.md coverage to quickstart, getting-started, GUIDE, TUTORIAL, examples README; updated mock engine descriptions in INTEGRATION-TESTING and eval-yaml guide (#230)

Dependencies

  • Bump postcss from 8.5.6 to 8.5.12 in /site (#229)

[0.30.1] - 2026-04-22

Documentation

  • Updated README with missing CLI commands — Added documentation for recently-added CLI commands that were missing from the README (#220)

[0.30.0] - 2026-04-22

Added

  • waza quality command — LLM-as-Judge skill quality scoring that evaluates skill output quality using a configurable judge model (#218)
  • Scope-reduction advisory checkwaza check now includes an advisory that flags skills with overly broad scope, helping authors tighten skill definitions (#219)

[0.29.0] - 2026-04-22

Added

  • --keep-workspace flag — Preserve the temporary workspace after task execution for debugging agent output (#123, #217)
  • --no-skills flag and disabled_skills config — Disable specific skills during evaluation to isolate behavior (#126, #216)
  • Non-blocking version update check — CLI now checks for newer waza versions in the background without slowing startup (#104, #214)
  • Per-task skill_directories — Specify different skill directories for individua...
Read more

Waza v0.33.0

Choose a tag to compare

@github-actions github-actions released this 21 May 19:54
0db0473

What's Changed

New Contributors

Full Changelog: v0.31.0...v0.33.0

Waza azd Extension v0.33.0

Choose a tag to compare

@github-actions github-actions released this 21 May 19:54
0db0473

Changelog

All notable changes to waza will be documented in this file.

The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.

[Unreleased]

[0.33.0] - 2026-05-21

Note: This release includes the changes previously prepared under 0.32.0, which was not published.

Added

  • Configurable eval file naming.waza.yaml can now configure files.evalFile, files.taskGlob, and files.taskFileSuffix, with the new naming carried through scaffolding, workspace discovery, discovery mode, schemas, and docs while preserving the existing eval.yaml and tasks/*.yaml defaults (#254, closes #232)
  • Instruction files in eval runs — Eval-level config.instruction_files and task-level instruction_files now copy files from the active context into task workspaces and append path-labeled contents to the Copilot system message (#248, closes #239)

Fixed

  • Prompt graders use the execution engine — Prompt graders now route judge turns through CopilotEngine instead of constructing a Copilot client directly, keeping grader execution aligned with engine configuration and preserving follow-up recovery behavior (#258, closes #54)
  • Prompt grader follow-up recovery — Prompt grading now preserves collected grades when a follow-up turn fails after successful grader collection (#251)
  • Bundled Copilot CLI updated — Embedded copilot-cli bundles are updated from 1.0.2 to 1.0.49 across supported platforms, with reproducible pinned bundle generation via COPILOT_CLI_VERSION (#260, closes #244)
  • Spec-aligned skill scaffoldingwaza new skill no longer asks for a nonstandard skill type or emits type: frontmatter, and the wizard now rejects early exits that omit required name or description fields (#261, closes #243)
  • waza check eval discovery — Nested skills and separated evals are discovered consistently in multi-skill workspaces (#247, closes #238)
  • Skill body routing markers — Compliance scoring now detects trigger, anti-trigger, and routing markers in SKILL.md body sections as well as frontmatter descriptions (#236, closes #223)

Changed

  • Copilot SDK v0.3.0 migration — Updated github.com/github/copilot-sdk/go to v0.3.0, migrated session event handling to typed payloads, and refreshed transcript, logging, web API, usage collection, suggestion trace, and test coverage for the new API (#255, closes #253)
  • Dashboard validation coverage — Added coverage for dashboard lint and end-to-end validation (#249)
  • Install documentation — Replaced unsupported go install guidance and clarified Windows/WSL install behavior (#246, closes #242; #245, closes #241)
  • Dependencies — Bump devalue in /site, postcss in /web, and astro in /site (#237, #235, #234)

[0.31.0] - 2026-04-28

Added

  • Custom agent (.agent.md) eval support — Discover .agent.md files alongside SKILL.md, parse agent-specific frontmatter (tools, model, handoffs, mcp-servers, agents), auto-inject tool_constraint grader from agent tools: field, complete worked example under examples/custom-agent/, and new "Evaluating Custom Agents" docs guide (#226, closes #225)

Fixed

  • Mock engine echoes file content_output_contains expectations against file contents now work in CI without a real model. Mock response includes task metadata, file paths, and a 1KB content preview per resource (#228, closes #227)
  • waza serve no longer crashes when stdin isn't a terminal — MCP stdio server only starts when term.IsTerminal() is true; piped input or background mode no longer kills the HTTP dashboard (#224)

Changed

  • Vocabulary renames — Internal types renamed: BenchmarkSpecEvalSpec, TestRunnerEvalRunner. Not a breaking change for external consumers (types live in internal/) (#222)

Documentation

  • Cross-reference audit for recent renames + custom agent feature: added .agent.md coverage to quickstart, getting-started, GUIDE, TUTORIAL, examples README; updated mock engine descriptions in INTEGRATION-TESTING and eval-yaml guide (#230)

Dependencies

  • Bump postcss from 8.5.6 to 8.5.12 in /site (#229)

[0.30.1] - 2026-04-22

Documentation

  • Updated README with missing CLI commands — Added documentation for recently-added CLI commands that were missing from the README (#220)

[0.30.0] - 2026-04-22

Added

  • waza quality command — LLM-as-Judge skill quality scoring that evaluates skill output quality using a configurable judge model (#218)
  • Scope-reduction advisory checkwaza check now includes an advisory that flags skills with overly broad scope, helping authors tighten skill definitions (#219)

[0.29.0] - 2026-04-22

Added

  • --keep-workspace flag — Preserve the temporary workspace after task execution for debugging agent output (#123, #217)
  • --no-skills flag and disabled_skills config — Disable specific skills during evaluation to isolate behavior (#126, #216)
  • Non-blocking version update check — CLI now checks for newer waza versions in the background without slowing startup (#104, #214)
  • Per-task skill_directories — Specify different skill directories for individual tasks in eval YAML (#156, #215)

Dependencies

  • Bump astro and @astrojs/starlight in /site (#212)

[0.28.0] - 2026-04-21

Added

  • Follow-up prompts in eval YAML — Tasks can now include pre-written follow-up prompts for multi-turn evaluation conversations (#189, #209)
  • waza models command — List all available models supported by the configured engine (#208)
  • Early termination for trigger tests — Trigger tests can now stop early once the target skill is invoked, reducing evaluation time (#207)

Fixed

  • Stricter YAML validation — Audited all YAML parsers; unknown fields in TestCase definitions are now properly rejected (#132, #206)
  • Test fixture assertion syntax — Fixed invalid Python expression in a test fixture assertion (#197)
  • CI integration test stability — CI integration tests now correctly handle expected eval failures when using the mock executor (#210)

Documentation

  • Added Quick Start guide to the documentation site (#205)

[0.27.0] - 2026-04-21

Added

  • output_contains_any expectation — New expectation field that passes when the agent response contains any one of the specified strings (#203)
  • max_response_time_ms behavior rule — Enforce maximum response time constraints on agent execution (#201)
  • Task prompt from file — Task prompt field can now reference an external file path instead of inline text (#157, #200)
  • tool_calls grader — New grader type that validates the specific tool calls an agent makes during execution (#187, #202)

Fixed

  • Webserver test resilience — Webserver tests now skip gracefully when frontend assets are not built (#204)

[0.26.0] - 2026-04-21

Changed

  • Timestamped output directoriesrun --output-dir now groups result files by timestamp for cleaner organization (#153)
  • Improved debug logging — Debug output is now more structured and useful for troubleshooting (#152)

Fixed

  • --discover finds eval.yaml in nested layout — Skill discovery now correctly locates eval.yaml files in evals/{name}/ directories at the project root (#44)
  • Diff grader reads post-execution workspace — The diff grader now reads files from the workspace after agent execution completes, not before (#165, #196)
  • Grader config validation — Required grader configuration fields are now validated before evaluation starts (#195)
  • macOS install and trigger test count — Fixed macOS binary installation and an off-by-one error in trigger test counting (#164, #184, #193)

Documentation

  • Added cache command reference, prompt mode documentation, and complete YAML schema reference (#198)
  • Updated demo guide and added CI/CD integration guide (#112, #89, #194)

Dependencies

  • Bump defu from 6.1.4 to 6.1.6 in /site (#181)
  • Bump vite from 6.4.1 to 6.4.2 in /site and /web (#182, #192)
  • Bump go.opentelemetry.io/otel/sdk from 1.42.0 to 1.43.0 (#185)
  • Bump astro from 5.17.3 to 5.18.1 in /site (#163)
  • Bump picomatch from 4.0.3 to 4.0.4 in /site and /web (#159, #160)
  • Bump smol-toml from 1.6.0 to 1.6.1 in /site (#158)

[0.25.0] - 2026-04-21

Added

  • Eval coverage grid generator — New coverage output that visualizes which skills have eval coverage across grader types (#92)

Fixed

  • SKILL.md injection and trigger fixture loadingwaza run now correctly injects SKILL.md content into the evaluation context, loads trigger test fixtures, and passes MCP server configuration to the engine (#191)

Dependencies

  • Bump h3 from 1.15.5 to 1.15.8 in /site (#144)

[0.24.0] - 2026-03-25

Changed

  • Strict YAML validation — All YAML parsers now use KnownFields(true) to reject unknown fields, catching typos and misconfigurations early (#132, #133)
  • max_workers renamed to workers — Config YAML key renamed for consistency across all config types (breaking change)
  • Unified token countingwaza check and waza tokens count now share the same counting logic for consistent results (#146)

Fixed

  • Typo in prompt grader — Fixed "prmopt" → "prompt" in error message

Dependencies

  • Bump h3 from 1.15.8 to 1.15.9 in /site (#155)
  • Bump github.com/buger/jsonparser from 1.1.1 to 1.1.2 (#149)

[0.21.0] - 2026-03-12

Added

  • waza new task from-prompt command — Record Copilot sessions into task YAML files for eval creation (#110)
  • Trigger heuristic grader — New grader type that scores based on trigger/anti-trigger matching heuristics (#90)
  • Eval scaffolding commandwaza eval new generates eval.yaml scaffolding for skills (#94)
  • Multi-trial flakiness detection — Detec...
Read more

Waza v0.31.0

Choose a tag to compare

@github-actions github-actions released this 28 Apr 20:11
bf77c75

What's Changed

  • refactor: complete vocabulary renames — BenchmarkSpec→EvalSpec, TestRunner→EvalRunner (#166) by @spboyer in #222
  • feat: support custom agent (.agent.md) file discovery and parsing #225 by @spboyer in #226
  • fix: mock engine echoes file content for CI evals (#227) by @spboyer in #228
  • fix: waza serve crashes when stdin is not a terminal by @spboyer in #224
  • chore(deps): Bump postcss from 8.5.6 to 8.5.12 in /site by @dependabot[bot] in #229
  • docs: cross-reference audit for recent renames and feature additions by @spboyer in #230
  • Release v0.31.0 by @spboyer in #231

Full Changelog: v0.30.1...v0.31.0

Waza azd Extension v0.31.0

Choose a tag to compare

@github-actions github-actions released this 28 Apr 20:08
bf77c75

Changelog

All notable changes to waza will be documented in this file.

The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning.

[Unreleased]

[0.31.0] - 2026-04-28

Added

  • Custom agent (.agent.md) eval support — Discover .agent.md files alongside SKILL.md, parse agent-specific frontmatter (tools, model, handoffs, mcp-servers, agents), auto-inject tool_constraint grader from agent tools: field, complete worked example under examples/custom-agent/, and new "Evaluating Custom Agents" docs guide (#226, closes #225)

Fixed

  • Mock engine echoes file content_output_contains expectations against file contents now work in CI without a real model. Mock response includes task metadata, file paths, and a 1KB content preview per resource (#228, closes #227)
  • waza serve no longer crashes when stdin isn't a terminal — MCP stdio server only starts when term.IsTerminal() is true; piped input or background mode no longer kills the HTTP dashboard (#224)

Changed

  • Vocabulary renames — Internal types renamed: BenchmarkSpecEvalSpec, TestRunnerEvalRunner. Not a breaking change for external consumers (types live in internal/) (#222)

Documentation

  • Cross-reference audit for recent renames + custom agent feature: added .agent.md coverage to quickstart, getting-started, GUIDE, TUTORIAL, examples README; updated mock engine descriptions in INTEGRATION-TESTING and eval-yaml guide (#230)

Dependencies

  • Bump postcss from 8.5.6 to 8.5.12 in /site (#229)

[0.30.1] - 2026-04-22

Documentation

  • Updated README with missing CLI commands — Added documentation for recently-added CLI commands that were missing from the README (#220)

[0.30.0] - 2026-04-22

Added

  • waza quality command — LLM-as-Judge skill quality scoring that evaluates skill output quality using a configurable judge model (#218)
  • Scope-reduction advisory checkwaza check now includes an advisory that flags skills with overly broad scope, helping authors tighten skill definitions (#219)

[0.29.0] - 2026-04-22

Added

  • --keep-workspace flag — Preserve the temporary workspace after task execution for debugging agent output (#123, #217)
  • --no-skills flag and disabled_skills config — Disable specific skills during evaluation to isolate behavior (#126, #216)
  • Non-blocking version update check — CLI now checks for newer waza versions in the background without slowing startup (#104, #214)
  • Per-task skill_directories — Specify different skill directories for individual tasks in eval YAML (#156, #215)

Dependencies

  • Bump astro and @astrojs/starlight in /site (#212)

[0.28.0] - 2026-04-21

Added

  • Follow-up prompts in eval YAML — Tasks can now include pre-written follow-up prompts for multi-turn evaluation conversations (#189, #209)
  • waza models command — List all available models supported by the configured engine (#208)
  • Early termination for trigger tests — Trigger tests can now stop early once the target skill is invoked, reducing evaluation time (#207)

Fixed

  • Stricter YAML validation — Audited all YAML parsers; unknown fields in TestCase definitions are now properly rejected (#132, #206)
  • Test fixture assertion syntax — Fixed invalid Python expression in a test fixture assertion (#197)
  • CI integration test stability — CI integration tests now correctly handle expected eval failures when using the mock executor (#210)

Documentation

  • Added Quick Start guide to the documentation site (#205)

[0.27.0] - 2026-04-21

Added

  • output_contains_any expectation — New expectation field that passes when the agent response contains any one of the specified strings (#203)
  • max_response_time_ms behavior rule — Enforce maximum response time constraints on agent execution (#201)
  • Task prompt from file — Task prompt field can now reference an external file path instead of inline text (#157, #200)
  • tool_calls grader — New grader type that validates the specific tool calls an agent makes during execution (#187, #202)

Fixed

  • Webserver test resilience — Webserver tests now skip gracefully when frontend assets are not built (#204)

[0.26.0] - 2026-04-21

Changed

  • Timestamped output directoriesrun --output-dir now groups result files by timestamp for cleaner organization (#153)
  • Improved debug logging — Debug output is now more structured and useful for troubleshooting (#152)

Fixed

  • --discover finds eval.yaml in nested layout — Skill discovery now correctly locates eval.yaml files in evals/{name}/ directories at the project root (#44)
  • Diff grader reads post-execution workspace — The diff grader now reads files from the workspace after agent execution completes, not before (#165, #196)
  • Grader config validation — Required grader configuration fields are now validated before evaluation starts (#195)
  • macOS install and trigger test count — Fixed macOS binary installation and an off-by-one error in trigger test counting (#164, #184, #193)

Documentation

  • Added cache command reference, prompt mode documentation, and complete YAML schema reference (#198)
  • Updated demo guide and added CI/CD integration guide (#112, #89, #194)

Dependencies

  • Bump defu from 6.1.4 to 6.1.6 in /site (#181)
  • Bump vite from 6.4.1 to 6.4.2 in /site and /web (#182, #192)
  • Bump go.opentelemetry.io/otel/sdk from 1.42.0 to 1.43.0 (#185)
  • Bump astro from 5.17.3 to 5.18.1 in /site (#163)
  • Bump picomatch from 4.0.3 to 4.0.4 in /site and /web (#159, #160)
  • Bump smol-toml from 1.6.0 to 1.6.1 in /site (#158)

[0.25.0] - 2026-04-21

Added

  • Eval coverage grid generator — New coverage output that visualizes which skills have eval coverage across grader types (#92)

Fixed

  • SKILL.md injection and trigger fixture loadingwaza run now correctly injects SKILL.md content into the evaluation context, loads trigger test fixtures, and passes MCP server configuration to the engine (#191)

Dependencies

  • Bump h3 from 1.15.5 to 1.15.8 in /site (#144)

[0.24.0] - 2026-03-25

Changed

  • Strict YAML validation — All YAML parsers now use KnownFields(true) to reject unknown fields, catching typos and misconfigurations early (#132, #133)
  • max_workers renamed to workers — Config YAML key renamed for consistency across all config types (breaking change)
  • Unified token countingwaza check and waza tokens count now share the same counting logic for consistent results (#146)

Fixed

  • Typo in prompt grader — Fixed "prmopt" → "prompt" in error message

Dependencies

  • Bump h3 from 1.15.8 to 1.15.9 in /site (#155)
  • Bump github.com/buger/jsonparser from 1.1.1 to 1.1.2 (#149)

[0.21.0] - 2026-03-12

Added

  • waza new task from-prompt command — Record Copilot sessions into task YAML files for eval creation (#110)
  • Trigger heuristic grader — New grader type that scores based on trigger/anti-trigger matching heuristics (#90)
  • Eval scaffolding commandwaza eval new generates eval.yaml scaffolding for skills (#94)
  • Multi-trial flakiness detection — Detect flaky evals across multiple trial runs (#103)
  • Snapshot auto-update workflow — Diff grader can now auto-update snapshot files on mismatch (#95)
  • Per-file token budget configuration — Configure token budgets per-file in .waza.yaml (#96)
  • Skill-aware thresholdswaza tokens compare supports skill-specific threshold configuration (#93)
  • Sensei scoring parity — WHEN triggers, spec-security, invalid level, and advisory checks 16-18 (#79)
  • CI/CD integration guide — GitHub Actions and Azure DevOps integration documentation (#100)
  • FileWriter service — Refactored waza init inventory with FileWriter abstraction (#63)

Fixed

  • waza suggest deadlockExecute() now applies the request timeout before calling Start(), preventing goroutine deadlock (#43)
  • ResourceFile.Content type — Changed from string to []byte for proper binary file handling (#117)
  • tokens compare in subdirectory — No longer shows all files as "added" when run from a subdirectory (#105)
  • --output-dir ignored — Fixed --output-dir having no effect for single-skill runs (#109)
  • Web dashboard build order — Build dashboard assets before Go compilation (#107)
  • Test file leak — Fixed test that leaked files into the repo (#120)
  • Config schema defaults — Aligned config.schema.json defaults with Go source of truth (#65)
  • Skill discovery path — Discover skills under .github/skills/ directory (#69)

Changed

  • Renamed config node max_workers to workers for consistency across all config types
    • This is a breaking change
  • Custom YAML deserializers for config types (#106)
  • Validate only known fields in YAML decoders. (#132)
  • Token limits priority inverted to .waza.yaml first (#64)
  • @wbreza added to CODEOWNERS (#111)
  • Go 1.26+ noted in agent instruction files (#108)

[0.9.0] - 2026-02-23

Added

  • A/B baseline testing--baseline flag runs each task with and without skill, computes weighted improvement scores across quality, tokens, turns, time, and task completion (#307)
  • Pairwise LLM judgingpairwise mode on prompt grader with position-swap bias mitigation. Three modes: pairwise, independent, both. Magnitude scoring from much-better to much-worse (#310)
  • Tool constraint grader — New tool_constraint grader type with expect_tools, reject_tools, max_turns, max_tokens constraints. Validates agent tool usage behavior (#391)
  • Auto skill discovery--discover flag walks directory trees for SKILL.md + eval.yaml pairs. --strict mode fails if any skill lacks...
Read more

v0.30.1

Choose a tag to compare

@spboyer spboyer released this 22 Apr 20:53
47a3d9c

v0.30.1

Documentation

  • README updated — Added missing waza models command documentation with usage examples and flags (#220)

Full Changelog: v0.30.0...v0.30.1

v0.30.0

Choose a tag to compare

@spboyer spboyer released this 22 Apr 19:57
6aaebec

What's New in v0.30.0

New Features

  • waza quality command (#98) — LLM-as-Judge skill quality scoring. Evaluates SKILL.md across 5 dimensions (clarity, completeness, trigger precision, scope coverage, anti-patterns) using the Copilot SDK. Scored 1-5 with visual bar output. Supports --format json for CI integration. (@spboyer)

  • Scope reduction advisory (#183) — waza check now warns when a skill has low capability scope, detecting potential token-limit compression loss. Parses USE FOR phrases, headings, and numbered procedures as capability signals. (@diberry)

Housekeeping

  • Closed 5 stale issues that were already implemented: #59 (token limits priority), #86 (per-file budgets), #81 (tokens diff), #83 (eval scaffolding), #162 (TypeSpec user query — answered)

Full Changelog: v0.29.0...v0.30.0