Skip to content

feat: Synthetic test case generation from skill spec #357

Description

@spboyer

Problem

waza suggest already bootstraps test cases from a SKILL.md, but in practice authors:

  1. Get a small fixed batch and have to re-run for more coverage.
  2. Can't ask for cases targeting a specific behavior (negative triggers, edge fixtures, "DO NOT USE FOR" cases).
  3. Have no signal for which generated cases are high vs. low confidence.
  4. Risk overwriting hand-curated eval.yaml entries on regeneration.

So skills still ship with 3–5 hand-written cases and discover edge behavior in production.

Proposal

Enhance the existing waza suggest (and its internal/suggest/ pipeline) rather than introducing a new command:

  • --count N — how many cases to propose.
  • --focus <category>triggers | negative-triggers | edge-fixtures | do-not-use-for | parameters.
  • Emit per-case confidence and rationale (which SKILL.md span it came from) so authors can triage.
  • --dry-run prints proposals; --apply merges into eval.yaml. Never overwrite an existing task id without --force.
  • Generated tasks must validate against the existing eval schema before they're written.

Why this matters for agentic-first

Coverage of agentic intent (when to invoke, when to refuse, which tools, which parameters) is the long tail of skill quality. The current suggest is great for a first pass; this issue is about giving authors a steerable second/third pass.

Acceptance criteria

  • waza suggest accepts --count, --focus, --dry-run, --apply, --force.
  • Each proposed case carries confidence (0–1) and a rationale field referencing the SKILL.md span.
  • --apply refuses to overwrite an existing task id unless --force is set; surface a clear diff.
  • Generated cases pass internal/validation/schema before being written.
  • Tests cover focus categories, overwrite safety, and schema validation.
  • Docs updated in site/ with a worked example.

Non-goals (filed separately)

Related

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions