Skip to content

feat: First-class LLM-as-judge grader with reusable rubrics #360

Description

@spboyer

Problem

Waza already has a prompt grader (LLM-as-judge) in internal/graders/prompt_grader.go. What's missing is a rubric library so authors don't hand-write the same rubric prompts across every skill:

  • Every author re-invents "helpfulness," "groundedness," "instruction-following," "refusal-correctness" from scratch.
  • Rubrics in the wild are inconsistent in scale (1–5 vs. pass/fail vs. 0–1), wording, and bias.
  • No way to reference a rubric by name (rubric: groundedness) the way you reference a grader type.

Proposal

Ship a rubric preset library that the existing prompt grader can resolve by name:

graders:
  - type: prompt
    rubric: groundedness   # resolves to a versioned rubric file
  - type: prompt
    rubric: ./rubrics/my-custom.md   # local file
  • Rubrics live as plain markdown/YAML files with a defined frontmatter shape (scale, scoring guide, examples).
  • Ship a starter set: groundedness, helpfulness, instruction-following, refusal-correctness, tool-use-appropriateness.
  • Rubrics are versioned (semver in frontmatter).
  • Each rubric has golden examples bundled so the rubric itself can be unit-tested.

Why this matters for agentic-first

Agentic skills need to be judged on behavior (did it refuse correctly? did it ground its answer in the right source?), not just output text. A shared rubric vocabulary makes evals comparable across skills and across teams.

Acceptance criteria

  • Rubric file schema defined and validated (frontmatter + body).
  • prompt grader resolves rubric: by name (built-in) or path (local).
  • Starter set of 5 rubrics shipped under internal/graders/data/rubrics/ with golden examples.
  • Each shipped rubric has a test that runs it against goldens and asserts expected scores.
  • No breaking change to existing prompt grader usage (inline criteria still works).
  • Docs in site/ listing rubrics with usage examples.

Non-goals (filed separately)

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions