feat: Add eval scaffolding command (waza eval new)#94
Conversation
|
Should this go in |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #94 +/- ##
=======================================
Coverage ? 72.82%
=======================================
Files ? 130
Lines ? 14816
Branches ? 0
=======================================
Hits ? 10790
Misses ? 3221
Partials ? 805
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Adds a new waza eval new <skill-name> CLI subcommand to scaffold an evaluation suite from a skill’s SKILL.md frontmatter, plus docs/tests to make it discoverable and verifiable.
Changes:
- Introduce
waza eval newcommand that parsesSKILL.mdtriggers and generateseval.yaml+ starter trigger/anti-trigger task YAMLs. - Add unit tests validating scaffold generation, custom output path behavior, and missing
SKILL.mderror handling. - Update CLI docs + README + command metadata expectations to include the new
evaltop-level command.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| site/src/content/docs/reference/cli.mdx | Documents the new waza eval new command in the CLI reference. |
| cmd/waza/root.go | Registers the new eval command in the root CLI. |
| cmd/waza/cmd_metadata_test.go | Updates metadata test to expect the new top-level eval command. |
| cmd/waza/cmd_eval_test.go | Adds tests covering scaffold generation, --output, and error cases. |
| cmd/waza/cmd_eval.go | Implements waza eval new scaffold generation logic. |
| README.md | Adds usage docs for waza eval new. |
| .squad/log/2026-03-05T00-36-issue-assignment-pipeline.md | Adds squad session log (non-functional change). |
| .squad/log/2026-03-05T00-26-rusty-token-diff-design.md | Adds squad session log (non-functional change). |
| .squad/decisions.md | Records squad decisions (non-functional change). |
Comments suppressed due to low confidence (2)
cmd/waza/cmd_eval.go:74
- When
--outputis not provided, the default output path is hard-coded toevals/<skill-name>/eval.yaml. If the project uses a custompaths.evalsin.waza.yaml, this will write scaffolding into the wrong directory. It would be more consistent with other commands to derive the default evals directory from project config/workspace detection.
if outputPath == "" {
outputPath = filepath.Join("evals", skillName, "eval.yaml")
}
tasksDir := filepath.Join(filepath.Dir(outputPath), "tasks")
site/src/content/docs/reference/cli.mdx:202
- Flag docs list
--outputwithout indicating it takes a path argument. For consistency with the README and the actual flag help, consider documenting it as--output <path>.
| Flag | Description |
|------|-------------|
| `--output` | Custom path for generated `eval.yaml` |
spboyer
left a comment
There was a problem hiding this comment.
LGTM — Rusty. waza eval new is a clean scaffolding command. Good use of SKILL.md frontmatter parsing for positive/negative trigger generation. extractKeywords with stop words is smart. Tests validate generated YAML through validation.ValidateEvalBytes/ValidateTaskBytes — nice. README + CLI reference updated. Ship it. (Self-authored PR — cannot self-approve via API.)
79893e4 to
02fb261
Compare
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
f016c5f to
c00d9c3
Compare
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
c28fc26 to
7b73e2d
Compare
wbreza
left a comment
There was a problem hiding this comment.
Code Review: PR #94 - feat: Add eval scaffolding command (waza eval new)
What Looks Good
- Smart SKILL.md resolution: workspace-aware detection with .waza.yaml config, fallback to conventional paths
- Safe overwrite protection and partial write cleanup
- Generated YAML validated in tests via ValidateEvalBytes/ValidateTaskBytes
- Custom output path works correctly
- Docs fully updated
Discussion: Command Overlap
Before approving, I want to flag a naming/scope discussion:
- waza new already creates a skill + eval scaffold together
- waza eval new (this PR) creates eval scaffold from an existing skill
- waza init detects skills missing evals and can create them following config
Questions worth resolving:
- Should this be waza new eval (verb-noun pattern) instead of waza eval new? The verb-noun pattern (waza new [skill|eval]) is more discoverable and consistent.
- Does waza init's gap-filling already cover the use case of adding evals to existing skills? If so, is a separate command needed or would init --skill=name suffice?
- If we keep eval as a subcommand group, what other subcommands go under it? If it's just new, the group may be premature.
Findings
Medium:
- Hardcoded 2 positive + 1 negative task structure may not suit skills with rich USE FOR sections (5+ phrases) or minimal descriptions.
Low:
2. Site docs use [skill-name] (optional) instead of (required).
3. extractKeywords stop words overlap with trigger_grader.go stop words.
Summary
| Priority | Count |
|---|---|
| Critical | 0 |
| High | 0 |
| Medium | 1 |
| Low | 2 |
Overall Assessment: Comment - implementation is solid, but the command naming and overlap with waza new and waza init should be discussed before merging.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
cmd/waza/cmd_eval.go:74
- There’s test coverage for custom
paths.skills, but no test verifying that the default output location respectspaths.evalsfrom.waza.yaml(or that output is anchored to the detected workspace root when invoked from inside a skill directory). Adding a regression test around the defaultoutputPathbehavior would help prevent generating evals in the wrong directory structure.
if outputPath == "" {
outputPath = filepath.Join("evals", skillName, "eval.yaml")
}
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@chlowell - good point. |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
7b73e2d to
ea4edd6
Compare
|
I like |
We should consider merging |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…back Addresses @chlowell's feedback to use 'waza new eval' instead of a separate 'eval' top-level verb. The eval subcommand is now registered under the existing 'new' command: waza new [skill-name] → create full skill + eval (unchanged) waza new eval <skill-name> → scaffold eval-only for existing skill - Remove top-level 'eval' command wrapper - Register newEvalNewCommand() under newNewCommand() - Update all tests, README, and CLI docs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
5d722be to
37f86f1
Compare
I made this change for now |
Closes #83
Working as Linus (Backend Developer)
⚠️ This task was flagged as "needs review" — please have a squad member review before merging.