Skip to content

feat: Eval dataset versioning, sharing, and golden-set management #359

Description

@spboyer

Problem

Eval YAMLs live in skill repos with no notion of versioned datasets. There's no way to:

  • Pin a skill to "eval pack v2.3" and update intentionally
  • Share a curated test dataset across multiple skills (e.g., a common "Azure CLI usage" pack)
  • Maintain a "golden set" of hard cases that should always pass
  • Annotate tasks with human-reviewed expected outputs that get updated over time

Datasets are copy-pasted between repos and drift silently.

Proposal

Build on the existing registry design (#17, #15) to add dataset semantics:

  • waza dataset commands: add, pull, push, diff, freeze
  • Versioned references in eval.yaml:
    datasets:
      - ref: microsoft/skills-azure-cli@v2.3.0
      - ref: ./local-golden-set.yaml
  • Lockfile (waza.lock) pins exact dataset content hashes for reproducibility
  • Golden-set tagging: tasks can be tagged golden: true and any regression on a golden task fails CI regardless of overall score
  • Dataset diffs: waza dataset diff v2.2 v2.3 shows added/changed/removed tasks for review

Why this matters

Without dataset hygiene, the entire eval ecosystem fragments. Shared, versioned datasets turn waza into a marketplace, not an island. Golden sets are the safety net that catches the regressions that matter even when overall pass rates look fine.

Acceptance criteria

  • waza dataset command family implemented
  • ref: syntax accepted in eval.yaml with version pinning
  • waza.lock lockfile generated and validated
  • golden: true tag honored by regression gate (#regression-gate issue)
  • Documented in site/ with publish-and-consume walkthrough
  • Tests cover ref resolution, lockfile, and golden-set gating

Related

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions