feat: Natural-language requirements → executable evals (spec-to-test)

## Problem

A SKILL.md is the contract — it tells the model when to invoke (`USE FOR`), when not to (`DO NOT USE FOR`), and what parameters mean. But there's no automated check that an `eval.yaml` actually *covers* those promises. Drift between SKILL.md claims and what's tested is invisible until users hit it.

## Proposal

Add `waza spec verify` (verification first; generation is out of scope here):

- Parses SKILL.md deterministically: `description`, `USE FOR` triggers, `DO NOT USE FOR` triggers, parameter list. Emits machine-readable requirement IDs (`req-use-001`, `req-dont-001`, etc.) with source spans.
- Loads `eval.yaml` and computes coverage: which requirement IDs are exercised by which tasks.
- Optional semantic matching (LLM-assisted) gated by a flag — deterministic parser runs first; LLM only fills gaps.
- Reports uncovered requirements. CI gate supports `--warn` (exit 0 with warnings) and `--fail` (non-zero exit).

```
$ waza spec verify
✓ req-use-001  "summarize a PR diff"          → covered by tasks: [pr-summary-basic, pr-summary-large]
✗ req-dont-002 "code review of security PRs"  → no task exercises this
```

## Why this matters for agentic-first

Agentic skills succeed or fail on *when* they engage. The router uses SKILL.md to make that decision, but evals usually only test the happy path. This issue makes negative-trigger coverage visible and gateable.

## Acceptance criteria

- [ ] `waza spec verify` command emits requirement IDs with source spans (file:line ranges).
- [ ] Deterministic parser handles `USE FOR` / `DO NOT USE FOR` / parameter blocks; tested against the existing skills corpus.
- [ ] Semantic matching is opt-in via `--semantic` flag and uses the configured judge model.
- [ ] CI modes: `--warn` (exit 0), `--fail` (exit 1 when uncovered ≥ threshold).
- [ ] Output formats: human, JSON, GitHub Actions annotations.
- [ ] Docs in `site/` with worked example and a CI snippet.

## Non-goals (filed separately)

- Generating cases to close coverage gaps — see #357.
- LLM-judged "did the agent follow the spirit of the spec" rubric — see #360.

## Related

- Roadmap: #66
- Test generation: #357
- Rubrics: #360


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Natural-language requirements → executable evals (spec-to-test) #361

Problem

Proposal

Why this matters for agentic-first

Acceptance criteria

Non-goals (filed separately)

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

feat: Natural-language requirements → executable evals (spec-to-test) #361

Description

Problem

Proposal

Why this matters for agentic-first

Acceptance criteria

Non-goals (filed separately)

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions