Summary
Add a waza eval new <skill-name> command that generates a starter eval.yaml with tasks and graders pre-configured based on the skill's SKILL.md content.
Motivation
Creating eval suites from scratch is tedious. Most skills follow similar patterns — trigger testing, output validation, token budgets. A scaffolding command can analyze the SKILL.md and generate a reasonable starting eval that covers the common validation patterns.
Proposed Implementation
Command: waza eval new <skill-name> [--output evals/<name>/eval.yaml]
- Read the skill's SKILL.md
- Analyze: description, triggers/anti-triggers, expected behaviors
- Generate eval.yaml with:
- 2-3 positive trigger tasks (skill should activate)
- 1-2 negative trigger tasks (skill should NOT activate)
- Graders:
text for keyword checks, behavior for token limits
- Placeholder prompts based on skill description
- Write to
evals/<skill-name>/eval.yaml (or custom path)
Generated eval structure
name: <skill-name>-eval
description: Auto-generated eval for <skill-name>
tasks:
- name: positive-trigger
prompt: "<generated from skill description>"
graders:
- type: text
params:
mode: keyword
keywords: ["<extracted from SKILL.md>"]
- name: negative-trigger
prompt: "<unrelated prompt>"
graders:
- type: text
params:
mode: keyword_absent
keywords: ["<skill-specific terms>"]
Acceptance Criteria
Summary
Add a
waza eval new <skill-name>command that generates a starter eval.yaml with tasks and graders pre-configured based on the skill's SKILL.md content.Motivation
Creating eval suites from scratch is tedious. Most skills follow similar patterns — trigger testing, output validation, token budgets. A scaffolding command can analyze the SKILL.md and generate a reasonable starting eval that covers the common validation patterns.
Proposed Implementation
Command:
waza eval new <skill-name> [--output evals/<name>/eval.yaml]textfor keyword checks,behaviorfor token limitsevals/<skill-name>/eval.yaml(or custom path)Generated eval structure
Acceptance Criteria
waza eval newcommand reads SKILL.md and generates eval.yaml