feat: rubric preset library (closes #360)#381
Merged
Conversation
Adds a reusable rubric library so eval authors can reference a versioned LLM-as-judge rubric by name (`rubric: groundedness`) or by path (`rubric: ./my-rubric.md`) instead of hand-writing the same judge prompt across every skill. - Rubric file schema: markdown body + YAML frontmatter (name, version, scale, description, optional goldens). Validated by ParseRubric + Rubric.Validate. - Built-in rubrics shipped under internal/graders/data/rubrics/ and resolved via go:embed: groundedness, helpfulness, instruction-following, refusal-correctness, tool-use-appropriateness. Each ships with at least one passing and one failing golden. - PromptGraderParameters gains an optional Rubric field. The existing inline `prompt:` form is unchanged. If both are set, the inline prompt wins but the rubric metadata still travels through. - At Grade time, when a rubric is bound and a candidate Output is present, the rubric body is rendered with the task input + candidate output injected so independent-mode judges have something concrete to evaluate. continue_session flows are unaffected. - TestRubricGoldens_OracleJudge drives every shipped rubric through the prompt grader with a mocked LLM that always returns the golden's expected outcome — fast, deterministic, free. - Docs: new site/reference/rubrics page listing built-ins, schema, rendering flow (mermaid), and how to write your own. Linked from the Validators & Graders guide and the sidebar. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a first-class rubric preset library to the existing prompt (LLM-as-judge) grader, enabling eval authors to reference versioned rubric markdown by built-in name or local path, with rubric metadata carried into grader results for attribution/reporting.
Changes:
- Introduces rubric parsing/validation + built-in rubric embedding/lookup (
internal/graders/rubric.go) and extendsPromptGraderParameterswithrubric:. - Wires the
promptgrader to resolve rubrics at construction time and render the judge prompt with appended candidate output (internal/graders/prompt_grader.go). - Adds shipped rubric files + golden-driven unit tests, and updates docs/README to document the new
rubric:option and built-in set.
Show a summary per file
| File | Description |
|---|---|
| site/src/content/docs/reference/rubrics.mdx | New end-user reference for rubric schema, built-ins, and rendering flow. |
| site/src/content/docs/guides/graders.mdx | Documents rubric: option for the prompt grader and usage examples. |
| site/astro.config.mjs | Adds “Rubric Library” to the docs sidebar. |
| README.md | Lists the built-in rubrics in the prompt grader row. |
| internal/models/grader_params.go | Adds Rubric field to PromptGraderParameters. |
| internal/graders/rubric.go | Implements rubric parsing, validation, built-in embedding, file loading, and rendering. |
| internal/graders/rubric_test.go | Adds coverage for resolving/validating/rending rubrics and golden-based contract tests. |
| internal/graders/prompt_grader.go | Resolves rubric: and injects candidate output into the judge prompt + attaches rubric metadata in details. |
| internal/graders/data/rubrics/groundedness.md | Built-in rubric definition + goldens. |
| internal/graders/data/rubrics/helpfulness.md | Built-in rubric definition + goldens. |
| internal/graders/data/rubrics/instruction-following.md | Built-in rubric definition + goldens. |
| internal/graders/data/rubrics/refusal-correctness.md | Built-in rubric definition + goldens. |
| internal/graders/data/rubrics/tool-use-appropriateness.md | Built-in rubric definition + goldens. |
Review details
- Files reviewed: 13/13 changed files
- Comments generated: 4
- Review effort level: Low
Addresses 4 reviewer comments on PR #381: 1. LoadRubricFile: only expand leading "~/" (or bare "~") to home; leave "~name" literal. Avoids portable-shell ambiguity and the missing-separator concern on "~/..." paths. 2. renderJudgePrompt: skip Output injection when ContinueSession is true. In continue_session mode the judge resumes the agent's live session and reads conversation directly, so injecting Output is redundant (and potentially misleading if Output is a stale snapshot). 3. gradePairwise: attach rubric metadata to Details so the dashboard can attribute pairwise verdicts to the configured rubric, matching the independent grading path. 4. Oracle-judge golden test: assert that the expected tool was actually invoked. Loop-and-break previously silently succeeded if the tool wasn't present in req.Tools. Adds covering tests for each fix. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #360.
Adds a reusable rubric preset library so eval authors can reference a versioned LLM-as-judge rubric by name (
rubric: groundedness) or by path (rubric: ./my-rubric.md) instead of hand-writing the same judge prompt across every skill. No breaking changes — the inlineprompt:form still works exactly as before.What's in the box
Five built-in rubrics
Shipped under
internal/graders/data/rubrics/and embedded viago:embed:groundednesshelpfulnessinstruction-followingrefusal-correctnesstool-use-appropriatenessEach rubric ships with goldens (at least one passing and one failing example) so the rubric body itself is regression-tested.
Schema
Markdown body + YAML frontmatter:
Validated by
ParseRubric+Rubric.Validate(semver version, known scale, required name/description/body).Wiring
models.PromptGraderParametersgains an optionalRubric stringfield.NewPromptGraderresolves the rubric at construction time (built-in by name, or file by path). The rubric body becomes the seed prompt unless an inlineprompt:is also supplied (inline wins).Output, the rubric body is rendered with the task input + candidate output appended under## Candidate output.continue_sessionflows are untouched.name,version,scale,source) travels inGraderResults.Details["rubric"]so dashboards/reports can attribute the verdict.Tests
TestRubricGoldens_OracleJudge(and friends) walks every shipped rubric through the prompt grader with a mocked LLM that returns the golden's expected outcome — fast, deterministic, and free.Full Go suite + lint both clean locally.
Docs
site/src/content/docs/reference/rubrics.mdx: built-in list, full schema, rendering flow (mermaid), how to author your own. Wired into the Starlight sidebar.site/.../guides/graders.mdxpromptsection: newrubric:config row and short usage examples.Acceptance criteria (issue #360)
promptgrader resolvesrubric:by name (built-in) or path (local).internal/graders/data/rubrics/with golden examples.promptgrader usage.site/listing rubrics with usage examples.Non-goals (per issue)
Judge calibration, multi-judge aggregation, and safety/adversarial rubrics are out of scope here — left to follow-ups (#365).