feat: rubric preset library (closes #360) by spboyer · Pull Request #381 · microsoft/waza

spboyer · 2026-06-28T10:58:35Z

Closes #360.

Adds a reusable rubric preset library so eval authors can reference a versioned LLM-as-judge rubric by name (rubric: groundedness) or by path (rubric: ./my-rubric.md) instead of hand-writing the same judge prompt across every skill. No breaking changes — the inline prompt: form still works exactly as before.

What's in the box

Five built-in rubrics

Shipped under internal/graders/data/rubrics/ and embedded via go:embed:

Rubric	Scale	What it scores
`groundedness`	pass-fail	Claims supported by provided source context?
`helpfulness`	pass-fail	Actually addresses the user's request with actionable content?
`instruction-following`	pass-fail	Respects explicit format & constraint instructions?
`refusal-correctness`	pass-fail	Refuses what it should and complies with what it should (catches over- and under-refusal)?
`tool-use-appropriateness`	pass-fail	Right tools, sensible args, no extraneous calls?

Each rubric ships with goldens (at least one passing and one failing example) so the rubric body itself is regression-tested.

Schema

Markdown body + YAML frontmatter:

---
name: groundedness
version: 1.0.0
scale: pass-fail
description: ...
goldens:
  - { name: ..., output: ..., expected: pass | fail }
---
# Body — the actual judge prompt, ending with set_waza_grade_pass/fail instructions

Validated by ParseRubric + Rubric.Validate (semver version, known scale, required name/description/body).

Wiring

models.PromptGraderParameters gains an optional Rubric string field.
NewPromptGrader resolves the rubric at construction time (built-in by name, or file by path). The rubric body becomes the seed prompt unless an inline prompt: is also supplied (inline wins).
At Grade time, if the grader is rubric-bound and the grading context carries a candidate Output, the rubric body is rendered with the task input + candidate output appended under ## Candidate output. continue_session flows are untouched.
Rubric metadata (name, version, scale, source) travels in GraderResults.Details["rubric"] so dashboards/reports can attribute the verdict.

Tests

TestRubricGoldens_OracleJudge (and friends) walks every shipped rubric through the prompt grader with a mocked LLM that returns the golden's expected outcome — fast, deterministic, and free.

--- PASS: TestRubricGoldens_OracleJudge (0.00s)
    --- PASS: groundedness               (3 goldens)
    --- PASS: helpfulness                (3 goldens)
    --- PASS: instruction-following      (3 goldens)
    --- PASS: refusal-correctness        (4 goldens)
    --- PASS: tool-use-appropriateness   (4 goldens)

Full Go suite + lint both clean locally.

Docs

New site/src/content/docs/reference/rubrics.mdx: built-in list, full schema, rendering flow (mermaid), how to author your own. Wired into the Starlight sidebar.
Updated site/.../guides/graders.mdx prompt section: new rubric: config row and short usage examples.
README graders table updated to list the five built-ins.

Acceptance criteria (issue #360)

Rubric file schema defined and validated.
prompt grader resolves rubric: by name (built-in) or path (local).
Starter set of 5 rubrics under internal/graders/data/rubrics/ with golden examples.
Each shipped rubric has a test that runs goldens and asserts expected scores.
No breaking change to existing prompt grader usage.
Docs in site/ listing rubrics with usage examples.

Non-goals (per issue)

Judge calibration, multi-judge aggregation, and safety/adversarial rubrics are out of scope here — left to follow-ups (#365).

Adds a reusable rubric library so eval authors can reference a versioned LLM-as-judge rubric by name (`rubric: groundedness`) or by path (`rubric: ./my-rubric.md`) instead of hand-writing the same judge prompt across every skill. - Rubric file schema: markdown body + YAML frontmatter (name, version, scale, description, optional goldens). Validated by ParseRubric + Rubric.Validate. - Built-in rubrics shipped under internal/graders/data/rubrics/ and resolved via go:embed: groundedness, helpfulness, instruction-following, refusal-correctness, tool-use-appropriateness. Each ships with at least one passing and one failing golden. - PromptGraderParameters gains an optional Rubric field. The existing inline `prompt:` form is unchanged. If both are set, the inline prompt wins but the rubric metadata still travels through. - At Grade time, when a rubric is bound and a candidate Output is present, the rubric body is rendered with the task input + candidate output injected so independent-mode judges have something concrete to evaluate. continue_session flows are unaffected. - TestRubricGoldens_OracleJudge drives every shipped rubric through the prompt grader with a mocked LLM that always returns the golden's expected outcome — fast, deterministic, free. - Docs: new site/reference/rubrics page listing built-ins, schema, rendering flow (mermaid), and how to write your own. Linked from the Validators & Graders guide and the sidebar. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Adds a first-class rubric preset library to the existing prompt (LLM-as-judge) grader, enabling eval authors to reference versioned rubric markdown by built-in name or local path, with rubric metadata carried into grader results for attribution/reporting.

Changes:

Introduces rubric parsing/validation + built-in rubric embedding/lookup (internal/graders/rubric.go) and extends PromptGraderParameters with rubric:.
Wires the prompt grader to resolve rubrics at construction time and render the judge prompt with appended candidate output (internal/graders/prompt_grader.go).
Adds shipped rubric files + golden-driven unit tests, and updates docs/README to document the new rubric: option and built-in set.

Show a summary per file

File	Description
site/src/content/docs/reference/rubrics.mdx	New end-user reference for rubric schema, built-ins, and rendering flow.
site/src/content/docs/guides/graders.mdx	Documents `rubric:` option for the `prompt` grader and usage examples.
site/astro.config.mjs	Adds “Rubric Library” to the docs sidebar.
README.md	Lists the built-in rubrics in the `prompt` grader row.
internal/models/grader_params.go	Adds `Rubric` field to `PromptGraderParameters`.
internal/graders/rubric.go	Implements rubric parsing, validation, built-in embedding, file loading, and rendering.
internal/graders/rubric_test.go	Adds coverage for resolving/validating/rending rubrics and golden-based contract tests.
internal/graders/prompt_grader.go	Resolves `rubric:` and injects candidate output into the judge prompt + attaches rubric metadata in details.
internal/graders/data/rubrics/groundedness.md	Built-in rubric definition + goldens.
internal/graders/data/rubrics/helpfulness.md	Built-in rubric definition + goldens.
internal/graders/data/rubrics/instruction-following.md	Built-in rubric definition + goldens.
internal/graders/data/rubrics/refusal-correctness.md	Built-in rubric definition + goldens.
internal/graders/data/rubrics/tool-use-appropriateness.md	Built-in rubric definition + goldens.

Review details

Files reviewed: 13/13 changed files
Comments generated: 4
Review effort level: Low

Addresses 4 reviewer comments on PR #381: 1. LoadRubricFile: only expand leading "~/" (or bare "~") to home; leave "~name" literal. Avoids portable-shell ambiguity and the missing-separator concern on "~/..." paths. 2. renderJudgePrompt: skip Output injection when ContinueSession is true. In continue_session mode the judge resumes the agent's live session and reads conversation directly, so injecting Output is redundant (and potentially misleading if Output is a stale snapshot). 3. gradePairwise: attach rubric metadata to Details so the dashboard can attribute pairwise verdicts to the configured rubric, matching the independent grading path. 4. Oracle-judge golden test: assert that the expected tool was actually invoked. Loop-and-break previously silently succeeded if the tool wasn't present in req.Tools. Adds covering tests for each fix. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings June 28, 2026 10:58

Copilot started reviewing on behalf of spboyer June 28, 2026 10:59 View session

Copilot AI reviewed Jun 28, 2026

View reviewed changes

Comment thread internal/graders/rubric.go

Comment thread internal/graders/prompt_grader.go

Comment thread internal/graders/prompt_grader.go

Comment thread internal/graders/rubric_test.go

spboyer merged commit b38ca9b into main Jun 28, 2026
10 checks passed

spboyer deleted the spboyer-issue-360-rubric-library branch June 28, 2026 11:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: rubric preset library (closes #360)#381

feat: rubric preset library (closes #360)#381
spboyer merged 2 commits into
mainfrom
spboyer-issue-360-rubric-library

spboyer commented Jun 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

spboyer commented Jun 28, 2026

What's in the box

Five built-in rubrics

Schema

Wiring

Tests

Docs

Acceptance criteria (issue #360)

Non-goals (per issue)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Review details

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants