Skip to content

chore: Release v0.12.0 — registry and version sync#36

Merged
spboyer merged 3 commits into
mainfrom
release/v0.12.0
Mar 2, 2026
Merged

chore: Release v0.12.0 — registry and version sync#36
spboyer merged 3 commits into
mainfrom
release/v0.12.0

Conversation

@spboyer

@spboyer spboyer commented Mar 2, 2026

Copy link
Copy Markdown
Member

Automated update from the Release workflow. Updates registry.json with v0.12.0 download URLs and syncs version.txt.

spboyer and others added 3 commits February 28, 2026 08:43
- Created ceremonies.md to outline team meeting structures including Design Review, Code Review, and Retrospective.
- Established decisions.md to document key team decisions, including model selection directives and documentation maintenance routing.
- Added identity files (now.md, wisdom.md) for tracking current focus areas and team insights.
- Logged multiple sessions detailing implementation progress, performance audits, and planning activities.
- Introduced routing.md to clarify work assignment processes and issue routing for team members.
- Added orchestration-log directory for session logs and .gitkeep files for empty directories.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@spboyer spboyer merged commit d7dd1aa into main Mar 2, 2026
5 checks passed
@spboyer spboyer deleted the release/v0.12.0 branch March 2, 2026 18:24
spboyer added a commit that referenced this pull request Mar 3, 2026
E2 sensei engine (#32, #33) shipped but SKILL.md still listed `waza dev`
as "coming soon". Updated documentation to reflect implemented features.

## Changes

**Help text block**
- Added DEV FLAGS section: `--target`, `--max-iterations`, `--auto`
- Updated command summary to include `[skill-path]` argument

**waza dev command section**
- Usage examples for scoring workflows
- Compliance levels: Low → Medium → Medium-High → High with progression
criteria
- Scoring checks: description length (150+ chars, 1024 max), trigger
phrases, anti-trigger phrases, routing clarity markers, token budget
(500 soft / 5000 hard)

**Future features section**
- Marked trigger accuracy tests (#36), `--skip-integration` (#37),
`--fast` (#38), improvement suggestions engine (#34) as "Coming Soon"

```bash
# Score and improve to high compliance
waza dev skills/my-skill --target high

# Auto-apply without prompts
waza dev skills/my-skill --auto --max-iterations 3
```

Current state: waza skill at High compliance (638 chars, 13 triggers, 3
anti-triggers), 2097 tokens.

<!-- START COPILOT ORIGINAL PROMPT -->



<details>

<summary>Original prompt</summary>

> 
> ----
> 
> *This section details on the original issue you should resolve*
> 
> <issue_title>[E5] Update SKILL.md with sensei engine features when E2
completes</issue_title>
> <issue_description>The waza SKILL.md (#52) was created with `waza dev`
listed as 'coming soon'. Once Charles completes E2 (#32-38), update the
SKILL.md to:
> - Document `waza dev` command usage
> - Add compliance scoring examples
> - Add trigger accuracy test examples
> - Update the help text block
> 
> Blocked on: spboyer/waza#33, spboyer/waza#32</issue_description>
> 
> ## Comments on the Issue (you are @copilot in this section)
> 
> <comments>
> <comment_new><author>@spboyer</author><body>
> @copilot Please pick up this issue. Here's the scope:
> 
> ## What's ready to document (blockers spboyer/waza#32 and
spboyer/waza#33 are closed)
> 
> 1. **`waza dev` command usage** — The sensei dev loop is implemented.
Document the command, its flags (`--verbose`, `--target-score`,
`--max-iterations`), and a usage example in the SKILL.md.
> 
> 2. **Compliance scoring** — Document how compliance scoring works (the
scoring system from spboyer/waza#33 is merged).
> 
> ## What's still 'coming soon' (keep as placeholders)
> 
> - Trigger accuracy tests (#36 — still open)
> - `--skip-integration` flag (#37 — still open)  
> - `--fast` flag (#38 — still open)
> - Improvement suggestions engine (#34 — still open)
> 
> ## Files to update
> - `skills/waza/SKILL.md` — the main SKILL.md for the waza skill
> 
> Read the existing SKILL.md first, then update the sections that
reference `waza dev` as 'coming soon' with actual documentation. Keep
the token count reasonable (under 500 tokens for the SKILL.md
frontmatter+body).</body></comment_new>
> </comments>
> 


</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

- Fixes spboyer/waza#93

<!-- START COPILOT CODING AGENT TIPS -->
---

💬 We'd love your input! Share your thoughts on Copilot coding agent in
our [2 minute survey](https://gh.io/copilot-coding-agent-survey).

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: spboyer <7681382+spboyer@users.noreply.github.com>
spboyer pushed a commit that referenced this pull request Mar 3, 2026
## Implement trigger accuracy tests

Trigger tests measure whether a skill activates for the right prompts
and stays silent for the wrong ones. They run automatically when a
`trigger_tests.yaml` file exists alongside `eval.yaml`.

### Example output

```
waza run eval.yaml
Running benchmark: test-skill-eval
Skill: test-skill
Engine: copilot-sdk
Model: gpt-5.1-codex-mini

✓ [1/1] Test that tests testing
===================================================
 BENCHMARK RESULTS
===================================================

Total Tests:    1
Succeeded:      1
Failed:         0
Errors:         0
Success Rate:   100.0%
Aggregate Score: 0.00
Min Score:      0.00
Max Score:      0.00
Std Dev:        0.0000
Duration:       4.675s

---------------------------------------------------
 PER-TASK BREAKDOWN
---------------------------------------------------
  ✓ Trigger test for test-skill [passed]
      pass_rate=100.0%  avg=0.00  min=0.00  max=0.00  stddev=0.0000  avg_dur=4671ms

---------------------------------------------------
 TRIGGER ACCURACY
---------------------------------------------------
  Accuracy:  50.0% (2/4)
  Precision: 0.0%  Recall: 0.0%  F1: 0.0%
  TP: 0  FP: 0  FN: 2  TN: 2
```

### What's included

**New `internal/trigger` package** with three components:

- **Spec parsing** (`spec.go`) — defines the `trigger_tests.yaml`
format: a `skill` name paired with `should_trigger_prompts` and
`should_not_trigger_prompts` lists. Each prompt supports an optional
`confidence` level (`high` or `medium`) and a `reason` for
documentation.
- **Discovery** (`discover.go`) — searches the eval spec directory for
`trigger_tests.yaml` and returns the parsed spec (or nil if absent).
- **Runner** (`runner.go`) — executes all trigger prompts against the
agent engine concurrently (respecting the configured worker count),
checks whether the target skill appeared in the response's
`SkillInvocations`, and computes classification metrics. Engine errors
count as incorrect classifications, with the error count tracked
separately.

**Trigger metrics (`internal/models/trigger_metrics.go`)** — moved from
`internal/metrics` to `internal/models` and extended with:

- Confidence weighting: `high` (default) = 1.0, `medium` = 0.5, so
borderline prompts don't dominate the score.
- `Errors` field for tracking engine failures distinct from
misclassifications.
- Standard classification metrics: accuracy, precision, recall, F1, and
the TP/FP/TN/FN confusion matrix.

**Integration with `cmd_run.go`:**

- After the benchmark completes, the runner discovers and executes
trigger tests if present.
- If `trigger_accuracy` is listed in the eval spec's `metrics` section,
the accuracy value is recorded as a `MeasureResult` and checked against
its cutoff.
- Trigger results appear in the printed summary and are included in the
`EvaluationOutcome` JSON output.
- The failure message now aggregates benchmark failures and trigger
accuracy threshold violations.

**`EvaluationOutcome` model** gains a `TriggerMetrics` field
(`trigger_metrics` in JSON), populated when trigger tests run.

### Tests

- `spec_test.go` — YAML parsing, validation of required fields, and
confidence validation.
- `runner_test.go` — end-to-end runner tests using stub engines
(always-trigger, never-trigger, partial-error, all-error scenarios),
plus a test that discovers and parses the example fixture.
- `trigger_metrics_test.go` — confidence weighting cases (medium
half-weight, empty defaults to high, all-medium).
- Updated `cmd_run_test.go` and `cmd_run_shutdown_test.go` for the new
error message format.

### Documentation

Updated `docs/GRADERS.md` with a full section covering trigger test file
format, confidence weighting, metrics, the `trigger_accuracy` metric
configuration, and error handling behavior.

Closes #36
spboyer added a commit that referenced this pull request Mar 3, 2026
* feat: Add squad ceremonies and decisions documentation

- Created ceremonies.md to outline team meeting structures including Design Review, Code Review, and Retrospective.
- Established decisions.md to document key team decisions, including model selection directives and documentation maintenance routing.
- Added identity files (now.md, wisdom.md) for tracking current focus areas and team insights.
- Logged multiple sessions detailing implementation progress, performance audits, and planning activities.
- Introduced routing.md to clarify work assignment processes and issue routing for team members.
- Added orchestration-log directory for session logs and .gitkeep files for empty directories.

* chore: bump version to 0.12.0

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* chore: Update registry and sync versions for v0.12.0

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant