Skip to content

feat: Eval scaffolding command — waza eval new #83

Description

@spboyer

Summary

Add a waza eval new <skill-name> command that generates a starter eval.yaml with tasks and graders pre-configured based on the skill's SKILL.md content.

Motivation

Creating eval suites from scratch is tedious. Most skills follow similar patterns — trigger testing, output validation, token budgets. A scaffolding command can analyze the SKILL.md and generate a reasonable starting eval that covers the common validation patterns.

Proposed Implementation

Command: waza eval new <skill-name> [--output evals/<name>/eval.yaml]

  1. Read the skill's SKILL.md
  2. Analyze: description, triggers/anti-triggers, expected behaviors
  3. Generate eval.yaml with:
    • 2-3 positive trigger tasks (skill should activate)
    • 1-2 negative trigger tasks (skill should NOT activate)
    • Graders: text for keyword checks, behavior for token limits
    • Placeholder prompts based on skill description
  4. Write to evals/<skill-name>/eval.yaml (or custom path)

Generated eval structure

name: <skill-name>-eval
description: Auto-generated eval for <skill-name>
tasks:
  - name: positive-trigger
    prompt: "<generated from skill description>"
    graders:
      - type: text
        params:
          mode: keyword
          keywords: ["<extracted from SKILL.md>"]
  - name: negative-trigger  
    prompt: "<unrelated prompt>"
    graders:
      - type: text
        params:
          mode: keyword_absent
          keywords: ["<skill-specific terms>"]

Acceptance Criteria

  • waza eval new command reads SKILL.md and generates eval.yaml
  • Extracts keywords from skill description for trigger tasks
  • Generates both positive and negative test cases
  • Sensible defaults for graders based on skill content
  • Tests covering: basic generation, custom output path, missing SKILL.md error

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestgoPull requests that update go codepriority:p1This sprintsquad:linusAssigned to Linus (Backend Developer)

Fields

No fields configured for Feature.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions