Skip to content

feat: rubric preset library (closes #360)#381

Merged
spboyer merged 2 commits into
mainfrom
spboyer-issue-360-rubric-library
Jun 28, 2026
Merged

feat: rubric preset library (closes #360)#381
spboyer merged 2 commits into
mainfrom
spboyer-issue-360-rubric-library

Conversation

@spboyer

@spboyer spboyer commented Jun 28, 2026

Copy link
Copy Markdown
Member

Closes #360.

Adds a reusable rubric preset library so eval authors can reference a versioned LLM-as-judge rubric by name (rubric: groundedness) or by path (rubric: ./my-rubric.md) instead of hand-writing the same judge prompt across every skill. No breaking changes — the inline prompt: form still works exactly as before.

What's in the box

Five built-in rubrics

Shipped under internal/graders/data/rubrics/ and embedded via go:embed:

Rubric Scale What it scores
groundedness pass-fail Claims supported by provided source context?
helpfulness pass-fail Actually addresses the user's request with actionable content?
instruction-following pass-fail Respects explicit format & constraint instructions?
refusal-correctness pass-fail Refuses what it should and complies with what it should (catches over- and under-refusal)?
tool-use-appropriateness pass-fail Right tools, sensible args, no extraneous calls?

Each rubric ships with goldens (at least one passing and one failing example) so the rubric body itself is regression-tested.

Schema

Markdown body + YAML frontmatter:

---
name: groundedness
version: 1.0.0
scale: pass-fail
description: ...
goldens:
  - { name: ..., output: ..., expected: pass | fail }
---
# Body — the actual judge prompt, ending with set_waza_grade_pass/fail instructions

Validated by ParseRubric + Rubric.Validate (semver version, known scale, required name/description/body).

Wiring

  • models.PromptGraderParameters gains an optional Rubric string field.
  • NewPromptGrader resolves the rubric at construction time (built-in by name, or file by path). The rubric body becomes the seed prompt unless an inline prompt: is also supplied (inline wins).
  • At Grade time, if the grader is rubric-bound and the grading context carries a candidate Output, the rubric body is rendered with the task input + candidate output appended under ## Candidate output. continue_session flows are untouched.
  • Rubric metadata (name, version, scale, source) travels in GraderResults.Details["rubric"] so dashboards/reports can attribute the verdict.

Tests

TestRubricGoldens_OracleJudge (and friends) walks every shipped rubric through the prompt grader with a mocked LLM that returns the golden's expected outcome — fast, deterministic, and free.

--- PASS: TestRubricGoldens_OracleJudge (0.00s)
    --- PASS: groundedness               (3 goldens)
    --- PASS: helpfulness                (3 goldens)
    --- PASS: instruction-following      (3 goldens)
    --- PASS: refusal-correctness        (4 goldens)
    --- PASS: tool-use-appropriateness   (4 goldens)

Full Go suite + lint both clean locally.

Docs

  • New site/src/content/docs/reference/rubrics.mdx: built-in list, full schema, rendering flow (mermaid), how to author your own. Wired into the Starlight sidebar.
  • Updated site/.../guides/graders.mdx prompt section: new rubric: config row and short usage examples.
  • README graders table updated to list the five built-ins.

Acceptance criteria (issue #360)

  • Rubric file schema defined and validated.
  • prompt grader resolves rubric: by name (built-in) or path (local).
  • Starter set of 5 rubrics under internal/graders/data/rubrics/ with golden examples.
  • Each shipped rubric has a test that runs goldens and asserts expected scores.
  • No breaking change to existing prompt grader usage.
  • Docs in site/ listing rubrics with usage examples.

Non-goals (per issue)

Judge calibration, multi-judge aggregation, and safety/adversarial rubrics are out of scope here — left to follow-ups (#365).

Adds a reusable rubric library so eval authors can reference a versioned
LLM-as-judge rubric by name (`rubric: groundedness`) or by path
(`rubric: ./my-rubric.md`) instead of hand-writing the same judge
prompt across every skill.

- Rubric file schema: markdown body + YAML frontmatter (name, version,
  scale, description, optional goldens). Validated by ParseRubric +
  Rubric.Validate.
- Built-in rubrics shipped under internal/graders/data/rubrics/ and
  resolved via go:embed: groundedness, helpfulness,
  instruction-following, refusal-correctness, tool-use-appropriateness.
  Each ships with at least one passing and one failing golden.
- PromptGraderParameters gains an optional Rubric field. The existing
  inline `prompt:` form is unchanged. If both are set, the inline
  prompt wins but the rubric metadata still travels through.
- At Grade time, when a rubric is bound and a candidate Output is
  present, the rubric body is rendered with the task input + candidate
  output injected so independent-mode judges have something concrete to
  evaluate. continue_session flows are unaffected.
- TestRubricGoldens_OracleJudge drives every shipped rubric through the
  prompt grader with a mocked LLM that always returns the golden's
  expected outcome — fast, deterministic, free.
- Docs: new site/reference/rubrics page listing built-ins, schema,
  rendering flow (mermaid), and how to write your own. Linked from the
  Validators & Graders guide and the sidebar.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 28, 2026 10:58

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a first-class rubric preset library to the existing prompt (LLM-as-judge) grader, enabling eval authors to reference versioned rubric markdown by built-in name or local path, with rubric metadata carried into grader results for attribution/reporting.

Changes:

  • Introduces rubric parsing/validation + built-in rubric embedding/lookup (internal/graders/rubric.go) and extends PromptGraderParameters with rubric:.
  • Wires the prompt grader to resolve rubrics at construction time and render the judge prompt with appended candidate output (internal/graders/prompt_grader.go).
  • Adds shipped rubric files + golden-driven unit tests, and updates docs/README to document the new rubric: option and built-in set.
Show a summary per file
File Description
site/src/content/docs/reference/rubrics.mdx New end-user reference for rubric schema, built-ins, and rendering flow.
site/src/content/docs/guides/graders.mdx Documents rubric: option for the prompt grader and usage examples.
site/astro.config.mjs Adds “Rubric Library” to the docs sidebar.
README.md Lists the built-in rubrics in the prompt grader row.
internal/models/grader_params.go Adds Rubric field to PromptGraderParameters.
internal/graders/rubric.go Implements rubric parsing, validation, built-in embedding, file loading, and rendering.
internal/graders/rubric_test.go Adds coverage for resolving/validating/rending rubrics and golden-based contract tests.
internal/graders/prompt_grader.go Resolves rubric: and injects candidate output into the judge prompt + attaches rubric metadata in details.
internal/graders/data/rubrics/groundedness.md Built-in rubric definition + goldens.
internal/graders/data/rubrics/helpfulness.md Built-in rubric definition + goldens.
internal/graders/data/rubrics/instruction-following.md Built-in rubric definition + goldens.
internal/graders/data/rubrics/refusal-correctness.md Built-in rubric definition + goldens.
internal/graders/data/rubrics/tool-use-appropriateness.md Built-in rubric definition + goldens.

Review details

  • Files reviewed: 13/13 changed files
  • Comments generated: 4
  • Review effort level: Low

Comment thread internal/graders/rubric.go
Comment thread internal/graders/prompt_grader.go
Comment thread internal/graders/prompt_grader.go
Comment thread internal/graders/rubric_test.go
Addresses 4 reviewer comments on PR #381:

1. LoadRubricFile: only expand leading "~/" (or bare "~") to home;
   leave "~name" literal. Avoids portable-shell ambiguity and the
   missing-separator concern on "~/..." paths.

2. renderJudgePrompt: skip Output injection when ContinueSession is
   true. In continue_session mode the judge resumes the agent's live
   session and reads conversation directly, so injecting Output is
   redundant (and potentially misleading if Output is a stale snapshot).

3. gradePairwise: attach rubric metadata to Details so the dashboard
   can attribute pairwise verdicts to the configured rubric, matching
   the independent grading path.

4. Oracle-judge golden test: assert that the expected tool was
   actually invoked. Loop-and-break previously silently succeeded if
   the tool wasn't present in req.Tools.

Adds covering tests for each fix.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@spboyer spboyer merged commit b38ca9b into main Jun 28, 2026
10 checks passed
@spboyer spboyer deleted the spboyer-issue-360-rubric-library branch June 28, 2026 11:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: First-class LLM-as-judge grader with reusable rubrics

3 participants