docs: Eval & Grader Registry design doc (#13)#337
Merged
Conversation
Adds docs/design/13-eval-registry.md covering the design for a shared eval and grader registry. Design-only; no implementation. Note: issue #13 asked for docs/research/, but that path is gitignored. Placed in docs/design/ to match existing convention (135-improve-concurrency.md, 194-baseline-skill-impact.md) and to answer the open question from the issue validation comment. Decisions cover sub-issues: - #15 Go-module-style refs: ref syntax, SemVer + lockfile, content-addressed cache, gh/env auth, flat transitive deps. - #17 Composable eval construction: registry search/add/get/sync, deep-merge override rules, waza init --grader scaffolding. - #18 Plugin extensibility: WGP/1 protocol over WASM (sandboxed) and program (bring-your-own-binary); Go plugins and embedded scripting rejected with rationale. Includes security model, backward-compat impact, a 5-phase rollout (spec, local resolver, git backend, WASM runtime, hardening), open questions, rejected alternatives, and example end-state YAML + lockfile. Backend selection (#16) deferred to start of Phase 2. Refs #13 #15 #17 #18
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a design document proposing a shared eval & grader registry for waza, covering reference syntax/lockfiles, registry discovery & composition UX, and an extensibility model (WASM + external program protocol) intended to close the “shared registry” competitive gap vs. OpenAI Evals.
Changes:
- Introduces a full design doc for registry refs (
host/path@version#subpath), caching, lockfiles, auth, and transitive deps (#15). - Specifies CLI UX for discovery/composition (
waza registry search/add/get/sync/list) and deep-merge override rules (#17). - Proposes a plugin model and security posture (WGP/1 + WASM sandbox) with a phased rollout plan (#18).
Show a summary per file
| File | Description |
|---|---|
| docs/design/13-eval-registry.md | New design doc describing the eval/grader registry architecture, UX, security model, and rollout phases. |
Copilot's findings
- Files reviewed: 1/1 changed files
- Comments generated: 4
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
spboyer
commented
Jun 19, 2026
spboyer
left a comment
Member
Author
There was a problem hiding this comment.
Adds a registry design doc; the structure is solid, but several spec details would mislead or weaken implementation.
Issues to address:
- docs/design/13-eval-registry.md:119 - subpath grammar lacks explicit traversal and symlink-escape rejection
- docs/design/13-eval-registry.md:441 - compromised-index mitigation overstates first-use integrity guarantees
- docs/design/13-eval-registry.md:158 - cache path is Linux/XDG-only instead of OS cache-dir based
- docs/design/13-eval-registry.md:176 - GitLab credential command does not return a token
- docs/design/13-eval-registry.md:348 - wasmtime-go static dependency claim misses CGO/cross-compile tradeoffs
- docs/design/13-eval-registry.md:477 - GitHub refs rely on remote git archive, which GitHub does not support
- docs/design/13-eval-registry.md:309 - summary says one runtime but design uses WASM plus program runtimes
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This was referenced Jun 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #13. Refs #15, #17, #18.
Design-only doc for a shared eval & grader registry — waza's #1 competitive gap vs. OpenAI Evals. No code changes.
What's inside
docs/design/13-eval-registry.mdcovers:host/path@version#subpathsyntax, SemVer +eval.lock.yamlfor reproducibility, content-addressed cache,gh auth token/ env-var auth, flat transitive-dep resolution.waza registry search/add/get/sync/listCLI, federated index file, deep-merge override rules,waza init --graderscaffolding.ref:field onGraderConfig, schema update, results.jsonsourcefield), and a 5-phase rollout where each phase is independently shippable:programruntime (validates the contract without picking a backend)eval.yamlandeval.lock.yaml.Path note
Issue body specified
docs/research/waza-eval-registry-design.md, butdocs/research/is gitignored (.gitignoreline 116, "Internal research docs"). The validation comment on #13 explicitly asked which location is canonical (docs/design/vsdocs/plans/). I placed the doc atdocs/design/13-eval-registry.mdto match existing convention (135-improve-concurrency.md,194-baseline-skill-impact.md). Happy to move if the team prefersdocs/plans/.Out of scope (intentional)
Resolverinterface; concrete backend chosen at start of Phase 2 based on artifact-size benchmarks.github.com/waza-evals/openai-compat/*namespace) without committing to it here.Review asks
--frozenfor CI approach matches how teams want to consume registry graders.gh auth tokenfor GitHub by default, env-var overrides per host, never store secrets in~/.waza/credentials.yaml.docs/design/is my recommendation).