Skip to content

feat: Natural-language requirements → executable evals (spec-to-test) #361

Description

@spboyer

Problem

A SKILL.md is the contract — it tells the model when to invoke (USE FOR), when not to (DO NOT USE FOR), and what parameters mean. But there's no automated check that an eval.yaml actually covers those promises. Drift between SKILL.md claims and what's tested is invisible until users hit it.

Proposal

Add waza spec verify (verification first; generation is out of scope here):

  • Parses SKILL.md deterministically: description, USE FOR triggers, DO NOT USE FOR triggers, parameter list. Emits machine-readable requirement IDs (req-use-001, req-dont-001, etc.) with source spans.
  • Loads eval.yaml and computes coverage: which requirement IDs are exercised by which tasks.
  • Optional semantic matching (LLM-assisted) gated by a flag — deterministic parser runs first; LLM only fills gaps.
  • Reports uncovered requirements. CI gate supports --warn (exit 0 with warnings) and --fail (non-zero exit).
$ waza spec verify
✓ req-use-001  "summarize a PR diff"          → covered by tasks: [pr-summary-basic, pr-summary-large]
✗ req-dont-002 "code review of security PRs"  → no task exercises this

Why this matters for agentic-first

Agentic skills succeed or fail on when they engage. The router uses SKILL.md to make that decision, but evals usually only test the happy path. This issue makes negative-trigger coverage visible and gateable.

Acceptance criteria

  • waza spec verify command emits requirement IDs with source spans (file:line ranges).
  • Deterministic parser handles USE FOR / DO NOT USE FOR / parameter blocks; tested against the existing skills corpus.
  • Semantic matching is opt-in via --semantic flag and uses the configured judge model.
  • CI modes: --warn (exit 0), --fail (exit 1 when uncovered ≥ threshold).
  • Output formats: human, JSON, GitHub Actions annotations.
  • Docs in site/ with worked example and a CI snippet.

Non-goals (filed separately)

Related

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions