chore: Release v0.12.0 — registry and version sync by spboyer · Pull Request #36 · microsoft/waza

spboyer · 2026-03-02T18:18:48Z

Automated update from the Release workflow. Updates registry.json with v0.12.0 download URLs and syncs version.txt.

- Created ceremonies.md to outline team meeting structures including Design Review, Code Review, and Retrospective. - Established decisions.md to document key team decisions, including model selection directives and documentation maintenance routing. - Added identity files (now.md, wisdom.md) for tracking current focus areas and team insights. - Logged multiple sessions detailing implementation progress, performance audits, and planning activities. - Introduced routing.md to clarify work assignment processes and issue routing for team members. - Added orchestration-log directory for session logs and .gitkeep files for empty directories.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@spboyer

E2 sensei engine (#32, #33) shipped but SKILL.md still listed `waza dev` as "coming soon". Updated documentation to reflect implemented features. ## Changes **Help text block** - Added DEV FLAGS section: `--target`, `--max-iterations`, `--auto` - Updated command summary to include `[skill-path]` argument **waza dev command section** - Usage examples for scoring workflows - Compliance levels: Low → Medium → Medium-High → High with progression criteria - Scoring checks: description length (150+ chars, 1024 max), trigger phrases, anti-trigger phrases, routing clarity markers, token budget (500 soft / 5000 hard) **Future features section** - Marked trigger accuracy tests (#36), `--skip-integration` (#37), `--fast` (#38), improvement suggestions engine (#34) as "Coming Soon" ```bash # Score and improve to high compliance waza dev skills/my-skill --target high # Auto-apply without prompts waza dev skills/my-skill --auto --max-iterations 3 ``` Current state: waza skill at High compliance (638 chars, 13 triggers, 3 anti-triggers), 2097 tokens.  <details> <summary>Original prompt</summary> > > ---- > > *This section details on the original issue you should resolve* > > <issue_title>[E5] Update SKILL.md with sensei engine features when E2 completes</issue_title> > <issue_description>The waza SKILL.md (#52) was created with `waza dev` listed as 'coming soon'. Once Charles completes E2 (#32-38), update the SKILL.md to: > - Document `waza dev` command usage > - Add compliance scoring examples > - Add trigger accuracy test examples > - Update the help text block > > Blocked on: spboyer/waza#33, spboyer/waza#32</issue_description> > > ## Comments on the Issue (you are @copilot in this section) > > <comments> > <comment_new><author>@spboyer</author><body> > @copilot Please pick up this issue. Here's the scope: > > ## What's ready to document (blockers spboyer/waza#32 and spboyer/waza#33 are closed) > > 1. **`waza dev` command usage** — The sensei dev loop is implemented. Document the command, its flags (`--verbose`, `--target-score`, `--max-iterations`), and a usage example in the SKILL.md. > > 2. **Compliance scoring** — Document how compliance scoring works (the scoring system from spboyer/waza#33 is merged). > > ## What's still 'coming soon' (keep as placeholders) > > - Trigger accuracy tests (#36 — still open) > - `--skip-integration` flag (#37 — still open) > - `--fast` flag (#38 — still open) > - Improvement suggestions engine (#34 — still open) > > ## Files to update > - `skills/waza/SKILL.md` — the main SKILL.md for the waza skill > > Read the existing SKILL.md first, then update the sections that reference `waza dev` as 'coming soon' with actual documentation. Keep the token count reasonable (under 500 tokens for the SKILL.md frontmatter+body).</body></comment_new> > </comments> > </details>  - Fixes spboyer/waza#93  --- 💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey). --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: spboyer <7681382+spboyer@users.noreply.github.com>

## Implement trigger accuracy tests Trigger tests measure whether a skill activates for the right prompts and stays silent for the wrong ones. They run automatically when a `trigger_tests.yaml` file exists alongside `eval.yaml`. ### Example output ``` waza run eval.yaml Running benchmark: test-skill-eval Skill: test-skill Engine: copilot-sdk Model: gpt-5.1-codex-mini ✓ [1/1] Test that tests testing =================================================== BENCHMARK RESULTS =================================================== Total Tests: 1 Succeeded: 1 Failed: 0 Errors: 0 Success Rate: 100.0% Aggregate Score: 0.00 Min Score: 0.00 Max Score: 0.00 Std Dev: 0.0000 Duration: 4.675s --------------------------------------------------- PER-TASK BREAKDOWN --------------------------------------------------- ✓ Trigger test for test-skill [passed] pass_rate=100.0% avg=0.00 min=0.00 max=0.00 stddev=0.0000 avg_dur=4671ms --------------------------------------------------- TRIGGER ACCURACY --------------------------------------------------- Accuracy: 50.0% (2/4) Precision: 0.0% Recall: 0.0% F1: 0.0% TP: 0 FP: 0 FN: 2 TN: 2 ``` ### What's included **New `internal/trigger` package** with three components: - **Spec parsing** (`spec.go`) — defines the `trigger_tests.yaml` format: a `skill` name paired with `should_trigger_prompts` and `should_not_trigger_prompts` lists. Each prompt supports an optional `confidence` level (`high` or `medium`) and a `reason` for documentation. - **Discovery** (`discover.go`) — searches the eval spec directory for `trigger_tests.yaml` and returns the parsed spec (or nil if absent). - **Runner** (`runner.go`) — executes all trigger prompts against the agent engine concurrently (respecting the configured worker count), checks whether the target skill appeared in the response's `SkillInvocations`, and computes classification metrics. Engine errors count as incorrect classifications, with the error count tracked separately. **Trigger metrics (`internal/models/trigger_metrics.go`)** — moved from `internal/metrics` to `internal/models` and extended with: - Confidence weighting: `high` (default) = 1.0, `medium` = 0.5, so borderline prompts don't dominate the score. - `Errors` field for tracking engine failures distinct from misclassifications. - Standard classification metrics: accuracy, precision, recall, F1, and the TP/FP/TN/FN confusion matrix. **Integration with `cmd_run.go`:** - After the benchmark completes, the runner discovers and executes trigger tests if present. - If `trigger_accuracy` is listed in the eval spec's `metrics` section, the accuracy value is recorded as a `MeasureResult` and checked against its cutoff. - Trigger results appear in the printed summary and are included in the `EvaluationOutcome` JSON output. - The failure message now aggregates benchmark failures and trigger accuracy threshold violations. **`EvaluationOutcome` model** gains a `TriggerMetrics` field (`trigger_metrics` in JSON), populated when trigger tests run. ### Tests - `spec_test.go` — YAML parsing, validation of required fields, and confidence validation. - `runner_test.go` — end-to-end runner tests using stub engines (always-trigger, never-trigger, partial-error, all-error scenarios), plus a test that discovers and parses the example fixture. - `trigger_metrics_test.go` — confidence weighting cases (medium half-weight, empty defaults to high, all-medium). - Updated `cmd_run_test.go` and `cmd_run_shutdown_test.go` for the new error message format. ### Documentation Updated `docs/GRADERS.md` with a full section covering trigger test file format, confidence weighting, metrics, the `trigger_accuracy` metric configuration, and error handling behavior. Closes #36

* feat: Add squad ceremonies and decisions documentation - Created ceremonies.md to outline team meeting structures including Design Review, Code Review, and Retrospective. - Established decisions.md to document key team decisions, including model selection directives and documentation maintenance routing. - Added identity files (now.md, wisdom.md) for tracking current focus areas and team insights. - Logged multiple sessions detailing implementation progress, performance audits, and planning activities. - Introduced routing.md to clarify work assignment processes and issue routing for team members. - Added orchestration-log directory for session logs and .gitkeep files for empty directories. * chore: bump version to 0.12.0 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * chore: Update registry and sync versions for v0.12.0 --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

spboyer and others added 3 commits February 28, 2026 08:43

chore: bump version to 0.12.0

c3cff15

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

chore: Update registry and sync versions for v0.12.0

c9cb8f7

spboyer requested review from chlowell and richardpark-msft as code owners March 2, 2026 18:18

spboyer merged commit d7dd1aa into main Mar 2, 2026
5 checks passed

spboyer deleted the release/v0.12.0 branch March 2, 2026 18:24

spboyer mentioned this pull request Feb 28, 2026

🎯 Waza Platform Roadmap - Tracking Issue #8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: Release v0.12.0 — registry and version sync#36

chore: Release v0.12.0 — registry and version sync#36
spboyer merged 3 commits into
mainfrom
release/v0.12.0

spboyer commented Mar 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

spboyer commented Mar 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant