Problem
waza suggest already bootstraps test cases from a SKILL.md, but in practice authors:
- Get a small fixed batch and have to re-run for more coverage.
- Can't ask for cases targeting a specific behavior (negative triggers, edge fixtures, "DO NOT USE FOR" cases).
- Have no signal for which generated cases are high vs. low confidence.
- Risk overwriting hand-curated
eval.yaml entries on regeneration.
So skills still ship with 3–5 hand-written cases and discover edge behavior in production.
Proposal
Enhance the existing waza suggest (and its internal/suggest/ pipeline) rather than introducing a new command:
--count N — how many cases to propose.
--focus <category> — triggers | negative-triggers | edge-fixtures | do-not-use-for | parameters.
- Emit per-case
confidence and rationale (which SKILL.md span it came from) so authors can triage.
--dry-run prints proposals; --apply merges into eval.yaml. Never overwrite an existing task id without --force.
- Generated tasks must validate against the existing eval schema before they're written.
Why this matters for agentic-first
Coverage of agentic intent (when to invoke, when to refuse, which tools, which parameters) is the long tail of skill quality. The current suggest is great for a first pass; this issue is about giving authors a steerable second/third pass.
Acceptance criteria
Non-goals (filed separately)
Related
Problem
waza suggestalready bootstraps test cases from a SKILL.md, but in practice authors:eval.yamlentries on regeneration.So skills still ship with 3–5 hand-written cases and discover edge behavior in production.
Proposal
Enhance the existing
waza suggest(and itsinternal/suggest/pipeline) rather than introducing a new command:--count N— how many cases to propose.--focus <category>—triggers|negative-triggers|edge-fixtures|do-not-use-for|parameters.confidenceandrationale(which SKILL.md span it came from) so authors can triage.--dry-runprints proposals;--applymerges intoeval.yaml. Never overwrite an existing task id without--force.Why this matters for agentic-first
Coverage of agentic intent (when to invoke, when to refuse, which tools, which parameters) is the long tail of skill quality. The current
suggestis great for a first pass; this issue is about giving authors a steerable second/third pass.Acceptance criteria
waza suggestaccepts--count,--focus,--dry-run,--apply,--force.confidence(0–1) and arationalefield referencing the SKILL.md span.--applyrefuses to overwrite an existing task id unless--forceis set; surface a clear diff.internal/validation/schemabefore being written.site/with a worked example.Non-goals (filed separately)
Related
internal/suggest/