Problem
Waza already has a prompt grader (LLM-as-judge) in internal/graders/prompt_grader.go. What's missing is a rubric library so authors don't hand-write the same rubric prompts across every skill:
- Every author re-invents "helpfulness," "groundedness," "instruction-following," "refusal-correctness" from scratch.
- Rubrics in the wild are inconsistent in scale (1–5 vs. pass/fail vs. 0–1), wording, and bias.
- No way to reference a rubric by name (
rubric: groundedness) the way you reference a grader type.
Proposal
Ship a rubric preset library that the existing prompt grader can resolve by name:
graders:
- type: prompt
rubric: groundedness # resolves to a versioned rubric file
- type: prompt
rubric: ./rubrics/my-custom.md # local file
- Rubrics live as plain markdown/YAML files with a defined frontmatter shape (scale, scoring guide, examples).
- Ship a starter set:
groundedness, helpfulness, instruction-following, refusal-correctness, tool-use-appropriateness.
- Rubrics are versioned (semver in frontmatter).
- Each rubric has golden examples bundled so the rubric itself can be unit-tested.
Why this matters for agentic-first
Agentic skills need to be judged on behavior (did it refuse correctly? did it ground its answer in the right source?), not just output text. A shared rubric vocabulary makes evals comparable across skills and across teams.
Acceptance criteria
Non-goals (filed separately)
Related
Problem
Waza already has a
promptgrader (LLM-as-judge) ininternal/graders/prompt_grader.go. What's missing is a rubric library so authors don't hand-write the same rubric prompts across every skill:rubric: groundedness) the way you reference a grader type.Proposal
Ship a rubric preset library that the existing
promptgrader can resolve by name:groundedness,helpfulness,instruction-following,refusal-correctness,tool-use-appropriateness.Why this matters for agentic-first
Agentic skills need to be judged on behavior (did it refuse correctly? did it ground its answer in the right source?), not just output text. A shared rubric vocabulary makes evals comparable across skills and across teams.
Acceptance criteria
promptgrader resolvesrubric:by name (built-in) or path (local).internal/graders/data/rubrics/with golden examples.promptgrader usage (inlinecriteriastill works).site/listing rubrics with usage examples.Non-goals (filed separately)
Related
internal/graders/prompt_grader.go