feat: Eval dataset versioning, sharing, and golden-set management

## Problem

Eval YAMLs live in skill repos with no notion of versioned datasets. There's no way to:

- Pin a skill to "eval pack v2.3" and update intentionally
- Share a curated test dataset across multiple skills (e.g., a common "Azure CLI usage" pack)
- Maintain a "golden set" of hard cases that should always pass
- Annotate tasks with human-reviewed expected outputs that get updated over time

Datasets are copy-pasted between repos and drift silently.

## Proposal

Build on the existing registry design (#17, #15) to add dataset semantics:

- **`waza dataset` commands**: `add`, `pull`, `push`, `diff`, `freeze`
- **Versioned references** in `eval.yaml`:
  ```yaml
  datasets:
    - ref: microsoft/skills-azure-cli@v2.3.0
    - ref: ./local-golden-set.yaml
  ```
- **Lockfile** (`waza.lock`) pins exact dataset content hashes for reproducibility
- **Golden-set tagging**: tasks can be tagged `golden: true` and any regression on a golden task fails CI regardless of overall score
- **Dataset diffs**: `waza dataset diff v2.2 v2.3` shows added/changed/removed tasks for review

## Why this matters

Without dataset hygiene, the entire eval ecosystem fragments. Shared, versioned datasets turn waza into a marketplace, not an island. Golden sets are the safety net that catches the regressions that matter even when overall pass rates look fine.

## Acceptance criteria

- [ ] `waza dataset` command family implemented
- [ ] `ref:` syntax accepted in eval.yaml with version pinning
- [ ] `waza.lock` lockfile generated and validated
- [ ] `golden: true` tag honored by regression gate (#regression-gate issue)
- [ ] Documented in `site/` with publish-and-consume walkthrough
- [ ] Tests cover ref resolution, lockfile, and golden-set gating

## Related

- Roadmap: #66
- Builds on: #17, #15, #18 (registry & graders)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Eval dataset versioning, sharing, and golden-set management #359

Problem

Proposal

Why this matters

Acceptance criteria

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

feat: Eval dataset versioning, sharing, and golden-set management #359

Description

Problem

Proposal

Why this matters

Acceptance criteria

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions