Problem
A SKILL.md is the contract — it tells the model when to invoke (USE FOR), when not to (DO NOT USE FOR), and what parameters mean. But there's no automated check that an eval.yaml actually covers those promises. Drift between SKILL.md claims and what's tested is invisible until users hit it.
Proposal
Add waza spec verify (verification first; generation is out of scope here):
- Parses SKILL.md deterministically:
description, USE FOR triggers, DO NOT USE FOR triggers, parameter list. Emits machine-readable requirement IDs (req-use-001, req-dont-001, etc.) with source spans.
- Loads
eval.yaml and computes coverage: which requirement IDs are exercised by which tasks.
- Optional semantic matching (LLM-assisted) gated by a flag — deterministic parser runs first; LLM only fills gaps.
- Reports uncovered requirements. CI gate supports
--warn (exit 0 with warnings) and --fail (non-zero exit).
$ waza spec verify
✓ req-use-001 "summarize a PR diff" → covered by tasks: [pr-summary-basic, pr-summary-large]
✗ req-dont-002 "code review of security PRs" → no task exercises this
Why this matters for agentic-first
Agentic skills succeed or fail on when they engage. The router uses SKILL.md to make that decision, but evals usually only test the happy path. This issue makes negative-trigger coverage visible and gateable.
Acceptance criteria
Non-goals (filed separately)
Related
Problem
A SKILL.md is the contract — it tells the model when to invoke (
USE FOR), when not to (DO NOT USE FOR), and what parameters mean. But there's no automated check that aneval.yamlactually covers those promises. Drift between SKILL.md claims and what's tested is invisible until users hit it.Proposal
Add
waza spec verify(verification first; generation is out of scope here):description,USE FORtriggers,DO NOT USE FORtriggers, parameter list. Emits machine-readable requirement IDs (req-use-001,req-dont-001, etc.) with source spans.eval.yamland computes coverage: which requirement IDs are exercised by which tasks.--warn(exit 0 with warnings) and--fail(non-zero exit).Why this matters for agentic-first
Agentic skills succeed or fail on when they engage. The router uses SKILL.md to make that decision, but evals usually only test the happy path. This issue makes negative-trigger coverage visible and gateable.
Acceptance criteria
waza spec verifycommand emits requirement IDs with source spans (file:line ranges).USE FOR/DO NOT USE FOR/ parameter blocks; tested against the existing skills corpus.--semanticflag and uses the configured judge model.--warn(exit 0),--fail(exit 1 when uncovered ≥ threshold).site/with worked example and a CI snippet.Non-goals (filed separately)
Related