feat: Synthetic test case generation from skill spec

## Problem

`waza suggest` already bootstraps test cases from a SKILL.md, but in practice authors:

1. Get a small fixed batch and have to re-run for more coverage.
2. Can't ask for cases targeting a specific behavior (negative triggers, edge fixtures, "DO NOT USE FOR" cases).
3. Have no signal for which generated cases are high vs. low confidence.
4. Risk overwriting hand-curated `eval.yaml` entries on regeneration.

So skills still ship with 3–5 hand-written cases and discover edge behavior in production.

## Proposal

Enhance the existing `waza suggest` (and its `internal/suggest/` pipeline) rather than introducing a new command:

- `--count N` — how many cases to propose.
- `--focus <category>` — `triggers` | `negative-triggers` | `edge-fixtures` | `do-not-use-for` | `parameters`.
- Emit per-case `confidence` and `rationale` (which SKILL.md span it came from) so authors can triage.
- `--dry-run` prints proposals; `--apply` merges into `eval.yaml`. Never overwrite an existing task id without `--force`.
- Generated tasks must validate against the existing eval schema before they're written.

## Why this matters for agentic-first

Coverage of agentic intent (when to invoke, when to refuse, which tools, which parameters) is the long tail of skill quality. The current `suggest` is great for a first pass; this issue is about giving authors a steerable second/third pass.

## Acceptance criteria

- [ ] `waza suggest` accepts `--count`, `--focus`, `--dry-run`, `--apply`, `--force`.
- [ ] Each proposed case carries `confidence` (0–1) and a `rationale` field referencing the SKILL.md span.
- [ ] `--apply` refuses to overwrite an existing task id unless `--force` is set; surface a clear diff.
- [ ] Generated cases pass `internal/validation/schema` before being written.
- [ ] Tests cover focus categories, overwrite safety, and schema validation.
- [ ] Docs updated in `site/` with a worked example.

## Non-goals (filed separately)

- Generating tests from captured traces — depends on #367 snapshot work.
- Coverage analysis against SKILL.md requirements — see #361.

## Related

- Existing module: `internal/suggest/`
- Roadmap: #66
- Coverage verification: #361
- Trace-derived generation: depends on #367


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Synthetic test case generation from skill spec #357

Problem

Proposal

Why this matters for agentic-first

Acceptance criteria

Non-goals (filed separately)

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

feat: Synthetic test case generation from skill spec #357

Description

Problem

Proposal

Why this matters for agentic-first

Acceptance criteria

Non-goals (filed separately)

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions