Skip to content

feat: add waza quality command — LLM-as-Judge skill quality scoring #98

Description

@spboyer

Summary

Add a new waza quality command that uses the Copilot SDK to evaluate skill content quality across multiple dimensions using an LLM-as-Judge pattern. This complements the existing waza check (deterministic/heuristic compliance) and waza run (runtime output evaluation) by filling the gap: "are the skill's instructions well-engineered?"

Why this matters

waza check validates structural compliance (frontmatter, token budget, spec conformance) but cannot assess whether instructions are clear, examples are realistic, or safety guardrails are adequate. These are content quality dimensions that require semantic understanding — exactly what an LLM judge excels at.

Tool What it answers
waza check Is the skill structurally valid?
waza run Does the skill produce correct output?
waza quality (new) Are the skill's instructions well-engineered?

Proposed Command

# Evaluate a single skill
waza quality skills/my-skill

# Evaluate all skills in workspace
waza quality

# JSON output for CI integration
waza quality --format json

# Specific model override
waza quality --model gpt-4.1

Evaluation Dimensions (6 core)

Each dimension is scored 1-5 by the LLM judge. The criteria should be defined as markdown files (easy to customize/extend) and embedded into the binary.

1. Instruction Clarity

Evaluates whether the skill's SKILL.md is clear, specific, and unambiguous enough for consistent LLM behavior.

Score Criteria
5 Numbered workflow phases or clearly structured steps. Explicit scope boundaries. Defined output format with example. MUST/MUST NOT rules.
4 Clear role definition. Specific do/don't rules. Structured sections. Output format mentioned but not demonstrated.
3 Goal is stated but lacks phases, scope boundaries, or output format. Results vary between runs.
2 Instructions <10 lines or vague language. Agent guesses at behavior.
1 No SKILL.md or only a title with no instructions.

2. Behavioral Completeness

Evaluates whether the skill handles edge cases, errors, and exit conditions — not just the happy path.

Score Criteria
5 Documents >=3 specific failure scenarios with remediation. Cleanup/exit protocol. Retry limits. Handles empty/missing inputs.
4 Handles >=2 common error cases. Has an exit condition.
3 Happy path works. Error handling is implicit.
2 No error handling instructions. Agent gets stuck on missing files or failed commands.
1 No consideration of failure. Agent may loop indefinitely.

3. Example Quality

Evaluates whether the skill includes concrete examples that anchor LLM behavior.

Score Criteria
5 >=2 realistic walkthroughs (input -> actions -> output). Edge case examples. Expected output format shown verbatim.
4 >=1 complete walkthrough. Output format demonstrated. Common scenarios covered.
3 Examples exist but are fragments — no complete input-to-output flow.
2 Only trivial examples that don't match actual task complexity.
1 No examples at all.

4. Safety & Guardrails

Evaluates whether the skill prevents harmful or unintended actions.

Score Criteria
5 Explicit MUST NOT list. Destructive actions require user confirmation. Scope limits defined. Credential/PII exposure prevented. Prompt injection defenses for skills processing external content.
4 >=2 explicit prohibitions. Destructive actions guarded. Basic data/instruction separation.
3 Some scope constraints exist but coverage is incomplete.
2 No prohibitions or scope boundaries. Safety relies on inherent limitations only.
1 Could expose credentials, delete files, or modify system state without guardrails.

Proportionality: Scale expectations to risk. A read-only skill needs less guardrail scaffolding than one that modifies files or calls external APIs. Prompt injection criteria apply only when the skill processes untrusted external content.

5. User Experience

Evaluates whether purpose is obvious and the skill provides feedback during operation.

Score Criteria
5 Trigger phrases documented. Zero-configuration. Progress updates during multi-step ops. Structured, actionable output.
4 Clear entry point. <=1 setup step. Formatted, useful output.
3 Usable from context, but trigger phrases undocumented. Output functional but unformatted.
2 Must read source to understand invocation. Raw dump output.
1 Cannot determine how to invoke without contacting the author.

6. Robustness & Edge Cases

Evaluates reliability under adverse or unexpected conditions.

Score Criteria
5 Handles adversarial inputs proportional to risk. Loop guards. Timeouts/limits. External deps documented with fallback.
4 Most edge cases covered. Retry limits. Handles missing deps.
3 Works for expected inputs. Some resilience but untested on extremes.
2 Fragile. Fails on unexpected inputs. No retry limits.
1 Breaks on non-trivial inputs. No robustness consideration.

Script Quality is intentionally omitted — waza skills are prompt-only SKILL.md files. If scripts are added in the future, a 7th dimension can be introduced.


Implementation Notes

Use the Copilot SDK (not Azure OpenAI)

The existing prompt_grader.go already demonstrates the exact pattern needed:

  • Create a Copilot SDK session via copilot.NewClient + client.CreateSession
  • Define structured tool calls for the LLM to invoke (similar to set_waza_grade_pass / set_waza_grade_fail)
  • Send the evaluation prompt with the skill's full content
  • Collect structured responses

This means zero incremental API cost — it uses the developer's existing Copilot license. No Azure OpenAI keys or endpoints needed.

Suggested tool-call interface for the judge

Tool: set_dimension_score
Parameters:
  dimension: string   (e.g. "instruction_clarity")
  score: int           (1-5)
  rationale: string    (1-2 sentences)

Tool: set_quality_summary
Parameters:
  summary: string      (2-3 sentence overall assessment)
  improvements: array  (top 3 actionable suggestions)

Criteria as embedded markdown

Store criteria definitions in internal/checks/quality/criteria/*.md and embed via //go:embed. This makes them:

  • Versionable with the binary
  • Customizable (teams can override with local files)
  • Readable as standalone documentation

System prompt considerations

Include these evaluator bias controls in the system prompt:

  • Anti-verbosity bias: Score based on precision and structure, not length
  • Anti-position bias: Evaluate each dimension independently
  • Proportionality: Scale expectations to the skill's scope and complexity
  • Evidence-based: Every score must cite specific content from the skill

Output format

Follow the existing readiness report pattern:

🤖 Quality Assessment (LLM Judge)
  ✅ Instruction Clarity        4/5  Clear phases with explicit MUST/MUST NOT rules
  ✅ Behavioral Completeness    4/5  Handles 3 failure scenarios with remediation
  ⚪ Example Quality            3/5  Examples exist but no end-to-end walkthrough
  ✅ Safety & Guardrails        4/5  Explicit prohibitions and scope limits
  🌟 User Experience            5/5  Zero-config with progress feedback
  ✅ Robustness                 4/5  Loop guards and dependency fallbacks

  📊 Average: 4.0/5.0 — PASS (threshold: 3.0)

  🔧 Top improvements:
     1. Add end-to-end walkthrough showing input -> tool calls -> output
     2. Document timeout behavior for external API calls
     3. Add prompt injection defense for user-supplied file content

Score caching

Use a content hash (e.g., SHA of SKILL.md + references/) to skip re-evaluation when content hasn't changed. Store cached scores in .waza/quality-scores.json in the workspace.

Passing threshold

Default: average >= 3.0/5.0. Make configurable via flag or config file.


Integration with existing commands

  • waza check --deep could be an alias that runs check + quality together
  • waza quality as a standalone command for focused quality evaluation
  • JSON output (--format json) for CI pipelines and downstream tooling
  • Results could feed into waza compare for tracking quality across skill versions

Acceptance Criteria

  • waza quality skills/my-skill evaluates a skill on 6 dimensions via Copilot SDK
  • Scores are 1-5 per dimension with rationale text
  • Terminal output follows readiness report style (emoji + scores + rationale)
  • JSON output available via --format json
  • Criteria are embedded markdown files, overridable with local files
  • Content hash caching skips unchanged skills
  • Passing threshold configurable (default 3.0)
  • Unit tests for criteria loading, score aggregation, and output formatting

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Fields

    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions