Problem
Eval YAMLs live in skill repos with no notion of versioned datasets. There's no way to:
- Pin a skill to "eval pack v2.3" and update intentionally
- Share a curated test dataset across multiple skills (e.g., a common "Azure CLI usage" pack)
- Maintain a "golden set" of hard cases that should always pass
- Annotate tasks with human-reviewed expected outputs that get updated over time
Datasets are copy-pasted between repos and drift silently.
Proposal
Build on the existing registry design (#17, #15) to add dataset semantics:
waza dataset commands: add, pull, push, diff, freeze
- Versioned references in
eval.yaml:
datasets:
- ref: microsoft/skills-azure-cli@v2.3.0
- ref: ./local-golden-set.yaml
- Lockfile (
waza.lock) pins exact dataset content hashes for reproducibility
- Golden-set tagging: tasks can be tagged
golden: true and any regression on a golden task fails CI regardless of overall score
- Dataset diffs:
waza dataset diff v2.2 v2.3 shows added/changed/removed tasks for review
Why this matters
Without dataset hygiene, the entire eval ecosystem fragments. Shared, versioned datasets turn waza into a marketplace, not an island. Golden sets are the safety net that catches the regressions that matter even when overall pass rates look fine.
Acceptance criteria
Related
Problem
Eval YAMLs live in skill repos with no notion of versioned datasets. There's no way to:
Datasets are copy-pasted between repos and drift silently.
Proposal
Build on the existing registry design (#17, #15) to add dataset semantics:
waza datasetcommands:add,pull,push,diff,freezeeval.yaml:waza.lock) pins exact dataset content hashes for reproducibilitygolden: trueand any regression on a golden task fails CI regardless of overall scorewaza dataset diff v2.2 v2.3shows added/changed/removed tasks for reviewWhy this matters
Without dataset hygiene, the entire eval ecosystem fragments. Shared, versioned datasets turn waza into a marketplace, not an island. Golden sets are the safety net that catches the regressions that matter even when overall pass rates look fine.
Acceptance criteria
waza datasetcommand family implementedref:syntax accepted in eval.yaml with version pinningwaza.locklockfile generated and validatedgolden: truetag honored by regression gate (#regression-gate issue)site/with publish-and-consume walkthroughRelated