chore: Release v0.12.0 — registry and version sync#36
Merged
Conversation
- Created ceremonies.md to outline team meeting structures including Design Review, Code Review, and Retrospective. - Established decisions.md to document key team decisions, including model selection directives and documentation maintenance routing. - Added identity files (now.md, wisdom.md) for tracking current focus areas and team insights. - Logged multiple sessions detailing implementation progress, performance audits, and planning activities. - Introduced routing.md to clarify work assignment processes and issue routing for team members. - Added orchestration-log directory for session logs and .gitkeep files for empty directories.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
spboyer
added a commit
that referenced
this pull request
Mar 3, 2026
E2 sensei engine (#32, #33) shipped but SKILL.md still listed `waza dev` as "coming soon". Updated documentation to reflect implemented features. ## Changes **Help text block** - Added DEV FLAGS section: `--target`, `--max-iterations`, `--auto` - Updated command summary to include `[skill-path]` argument **waza dev command section** - Usage examples for scoring workflows - Compliance levels: Low → Medium → Medium-High → High with progression criteria - Scoring checks: description length (150+ chars, 1024 max), trigger phrases, anti-trigger phrases, routing clarity markers, token budget (500 soft / 5000 hard) **Future features section** - Marked trigger accuracy tests (#36), `--skip-integration` (#37), `--fast` (#38), improvement suggestions engine (#34) as "Coming Soon" ```bash # Score and improve to high compliance waza dev skills/my-skill --target high # Auto-apply without prompts waza dev skills/my-skill --auto --max-iterations 3 ``` Current state: waza skill at High compliance (638 chars, 13 triggers, 3 anti-triggers), 2097 tokens. <!-- START COPILOT ORIGINAL PROMPT --> <details> <summary>Original prompt</summary> > > ---- > > *This section details on the original issue you should resolve* > > <issue_title>[E5] Update SKILL.md with sensei engine features when E2 completes</issue_title> > <issue_description>The waza SKILL.md (#52) was created with `waza dev` listed as 'coming soon'. Once Charles completes E2 (#32-38), update the SKILL.md to: > - Document `waza dev` command usage > - Add compliance scoring examples > - Add trigger accuracy test examples > - Update the help text block > > Blocked on: spboyer/waza#33, spboyer/waza#32</issue_description> > > ## Comments on the Issue (you are @copilot in this section) > > <comments> > <comment_new><author>@spboyer</author><body> > @copilot Please pick up this issue. Here's the scope: > > ## What's ready to document (blockers spboyer/waza#32 and spboyer/waza#33 are closed) > > 1. **`waza dev` command usage** — The sensei dev loop is implemented. Document the command, its flags (`--verbose`, `--target-score`, `--max-iterations`), and a usage example in the SKILL.md. > > 2. **Compliance scoring** — Document how compliance scoring works (the scoring system from spboyer/waza#33 is merged). > > ## What's still 'coming soon' (keep as placeholders) > > - Trigger accuracy tests (#36 — still open) > - `--skip-integration` flag (#37 — still open) > - `--fast` flag (#38 — still open) > - Improvement suggestions engine (#34 — still open) > > ## Files to update > - `skills/waza/SKILL.md` — the main SKILL.md for the waza skill > > Read the existing SKILL.md first, then update the sections that reference `waza dev` as 'coming soon' with actual documentation. Keep the token count reasonable (under 500 tokens for the SKILL.md frontmatter+body).</body></comment_new> > </comments> > </details> <!-- START COPILOT CODING AGENT SUFFIX --> - Fixes spboyer/waza#93 <!-- START COPILOT CODING AGENT TIPS --> --- 💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey). --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: spboyer <7681382+spboyer@users.noreply.github.com>
spboyer
pushed a commit
that referenced
this pull request
Mar 3, 2026
## Implement trigger accuracy tests
Trigger tests measure whether a skill activates for the right prompts
and stays silent for the wrong ones. They run automatically when a
`trigger_tests.yaml` file exists alongside `eval.yaml`.
### Example output
```
waza run eval.yaml
Running benchmark: test-skill-eval
Skill: test-skill
Engine: copilot-sdk
Model: gpt-5.1-codex-mini
✓ [1/1] Test that tests testing
===================================================
BENCHMARK RESULTS
===================================================
Total Tests: 1
Succeeded: 1
Failed: 0
Errors: 0
Success Rate: 100.0%
Aggregate Score: 0.00
Min Score: 0.00
Max Score: 0.00
Std Dev: 0.0000
Duration: 4.675s
---------------------------------------------------
PER-TASK BREAKDOWN
---------------------------------------------------
✓ Trigger test for test-skill [passed]
pass_rate=100.0% avg=0.00 min=0.00 max=0.00 stddev=0.0000 avg_dur=4671ms
---------------------------------------------------
TRIGGER ACCURACY
---------------------------------------------------
Accuracy: 50.0% (2/4)
Precision: 0.0% Recall: 0.0% F1: 0.0%
TP: 0 FP: 0 FN: 2 TN: 2
```
### What's included
**New `internal/trigger` package** with three components:
- **Spec parsing** (`spec.go`) — defines the `trigger_tests.yaml`
format: a `skill` name paired with `should_trigger_prompts` and
`should_not_trigger_prompts` lists. Each prompt supports an optional
`confidence` level (`high` or `medium`) and a `reason` for
documentation.
- **Discovery** (`discover.go`) — searches the eval spec directory for
`trigger_tests.yaml` and returns the parsed spec (or nil if absent).
- **Runner** (`runner.go`) — executes all trigger prompts against the
agent engine concurrently (respecting the configured worker count),
checks whether the target skill appeared in the response's
`SkillInvocations`, and computes classification metrics. Engine errors
count as incorrect classifications, with the error count tracked
separately.
**Trigger metrics (`internal/models/trigger_metrics.go`)** — moved from
`internal/metrics` to `internal/models` and extended with:
- Confidence weighting: `high` (default) = 1.0, `medium` = 0.5, so
borderline prompts don't dominate the score.
- `Errors` field for tracking engine failures distinct from
misclassifications.
- Standard classification metrics: accuracy, precision, recall, F1, and
the TP/FP/TN/FN confusion matrix.
**Integration with `cmd_run.go`:**
- After the benchmark completes, the runner discovers and executes
trigger tests if present.
- If `trigger_accuracy` is listed in the eval spec's `metrics` section,
the accuracy value is recorded as a `MeasureResult` and checked against
its cutoff.
- Trigger results appear in the printed summary and are included in the
`EvaluationOutcome` JSON output.
- The failure message now aggregates benchmark failures and trigger
accuracy threshold violations.
**`EvaluationOutcome` model** gains a `TriggerMetrics` field
(`trigger_metrics` in JSON), populated when trigger tests run.
### Tests
- `spec_test.go` — YAML parsing, validation of required fields, and
confidence validation.
- `runner_test.go` — end-to-end runner tests using stub engines
(always-trigger, never-trigger, partial-error, all-error scenarios),
plus a test that discovers and parses the example fixture.
- `trigger_metrics_test.go` — confidence weighting cases (medium
half-weight, empty defaults to high, all-medium).
- Updated `cmd_run_test.go` and `cmd_run_shutdown_test.go` for the new
error message format.
### Documentation
Updated `docs/GRADERS.md` with a full section covering trigger test file
format, confidence weighting, metrics, the `trigger_accuracy` metric
configuration, and error handling behavior.
Closes #36
spboyer
added a commit
that referenced
this pull request
Mar 3, 2026
* feat: Add squad ceremonies and decisions documentation - Created ceremonies.md to outline team meeting structures including Design Review, Code Review, and Retrospective. - Established decisions.md to document key team decisions, including model selection directives and documentation maintenance routing. - Added identity files (now.md, wisdom.md) for tracking current focus areas and team insights. - Logged multiple sessions detailing implementation progress, performance audits, and planning activities. - Introduced routing.md to clarify work assignment processes and issue routing for team members. - Added orchestration-log directory for session logs and .gitkeep files for empty directories. * chore: bump version to 0.12.0 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * chore: Update registry and sync versions for v0.12.0 --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Automated update from the Release workflow. Updates registry.json with v0.12.0 download URLs and syncs version.txt.