Skip to content

feat: add waza quality command — LLM-as-Judge skill quality scoring#218

Merged
spboyer merged 3 commits into
mainfrom
squad/98-waza-quality
Apr 22, 2026
Merged

feat: add waza quality command — LLM-as-Judge skill quality scoring#218
spboyer merged 3 commits into
mainfrom
squad/98-waza-quality

Conversation

@spboyer

@spboyer spboyer commented Apr 22, 2026

Copy link
Copy Markdown
Member

Summary

Closes #98

Adds waza quality <skill-path> — an LLM-as-Judge command that evaluates skill content quality across five dimensions, each scored 1–5:

Dimension What it measures
clarity Instruction clarity, structure, step ordering
completeness Edge case coverage, detail level
trigger_precision USE FOR / DO NOT USE FOR quality
scope_coverage Boundary definition, capability explicitness
anti_patterns Avoidance of vague/conflicting instructions

New files

  • internal/quality/rubric.go — Dimension definitions, validation
  • internal/quality/judge.go — Prompt construction, copilot SDK execution, JSON response parsing
  • internal/quality/report.go — Table (with visual score bars) and JSON formatters
  • cmd/waza/cmd_quality.go — CLI command registration
  • cmd/waza/cmd_quality_test.go — 8 command-level tests
  • internal/quality/*_test.go — 20 unit tests

CLI usage

waza quality skills/my-skill                    # table output
waza quality skills/my-skill --format json      # JSON for CI
waza quality skills/my-skill --model gpt-4o     # specific judge model

Flags

Flag Default Description
--model project default Model to use as judge
--format table Output: table or json
--rubric Custom rubric file (reserved, errors for now)

Design decisions

  • Judge prompt requests structured JSON: {dimensions: [{name, score, feedback}], overall_score, summary}
  • Copilot SDK mocked in all tests — no real LLM calls
  • Auth failures produce clear copilot login message (same pattern as waza models)
  • Partial responses with validation issues still display with a warning

Tests

  • go test ./... — all pass
  • go vet ./... — clean
  • Site builds: cd site && npm run build — ✅

Docs updated

  • README.md — new command section
  • site/src/content/docs/reference/cli.mdx — full CLI reference entry

…98

Add `waza quality <skill-path>` command that uses an LLM to evaluate
skill content quality across five dimensions:

- clarity: instruction clarity and structure
- completeness: coverage of edge cases and detail level
- trigger_precision: USE FOR / DO NOT USE FOR definition quality
- scope_coverage: boundary clarity and capability explicitness
- anti_patterns: avoidance of vague/conflicting instructions

Implementation:
- internal/quality/rubric.go: dimension definitions and validation
- internal/quality/judge.go: prompt construction, LLM execution, response parsing
- internal/quality/report.go: table and JSON output formatting
- cmd/waza/cmd_quality.go: CLI command with --model, --format, --rubric flags
- 28 tests covering rubric validation, judge execution, response parsing,
  report formatting, auth errors, and edge cases
- Documentation in README.md and site CLI reference

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions Bot enabled auto-merge (squash) April 22, 2026 18:14
Copilot AI and others added 2 commits April 22, 2026 14:16
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Apply gofmt formatting to scope_reduction.go (struct alignment, blank line)
- Apply gofmt formatting to scope_reduction_test.go (struct alignment)
- Fix errcheck: use comma-ok form for all type assertions

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@spboyer spboyer merged commit c473b41 into main Apr 22, 2026
6 checks passed
@spboyer spboyer deleted the squad/98-waza-quality branch April 22, 2026 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add waza quality command — LLM-as-Judge skill quality scoring

2 participants