Summary
Add a new waza quality command that uses the Copilot SDK to evaluate skill content quality across multiple dimensions using an LLM-as-Judge pattern. This complements the existing waza check (deterministic/heuristic compliance) and waza run (runtime output evaluation) by filling the gap: "are the skill's instructions well-engineered?"
Why this matters
waza check validates structural compliance (frontmatter, token budget, spec conformance) but cannot assess whether instructions are clear, examples are realistic, or safety guardrails are adequate. These are content quality dimensions that require semantic understanding — exactly what an LLM judge excels at.
| Tool |
What it answers |
waza check |
Is the skill structurally valid? |
waza run |
Does the skill produce correct output? |
waza quality (new) |
Are the skill's instructions well-engineered? |
Proposed Command
# Evaluate a single skill
waza quality skills/my-skill
# Evaluate all skills in workspace
waza quality
# JSON output for CI integration
waza quality --format json
# Specific model override
waza quality --model gpt-4.1
Evaluation Dimensions (6 core)
Each dimension is scored 1-5 by the LLM judge. The criteria should be defined as markdown files (easy to customize/extend) and embedded into the binary.
1. Instruction Clarity
Evaluates whether the skill's SKILL.md is clear, specific, and unambiguous enough for consistent LLM behavior.
| Score |
Criteria |
| 5 |
Numbered workflow phases or clearly structured steps. Explicit scope boundaries. Defined output format with example. MUST/MUST NOT rules. |
| 4 |
Clear role definition. Specific do/don't rules. Structured sections. Output format mentioned but not demonstrated. |
| 3 |
Goal is stated but lacks phases, scope boundaries, or output format. Results vary between runs. |
| 2 |
Instructions <10 lines or vague language. Agent guesses at behavior. |
| 1 |
No SKILL.md or only a title with no instructions. |
2. Behavioral Completeness
Evaluates whether the skill handles edge cases, errors, and exit conditions — not just the happy path.
| Score |
Criteria |
| 5 |
Documents >=3 specific failure scenarios with remediation. Cleanup/exit protocol. Retry limits. Handles empty/missing inputs. |
| 4 |
Handles >=2 common error cases. Has an exit condition. |
| 3 |
Happy path works. Error handling is implicit. |
| 2 |
No error handling instructions. Agent gets stuck on missing files or failed commands. |
| 1 |
No consideration of failure. Agent may loop indefinitely. |
3. Example Quality
Evaluates whether the skill includes concrete examples that anchor LLM behavior.
| Score |
Criteria |
| 5 |
>=2 realistic walkthroughs (input -> actions -> output). Edge case examples. Expected output format shown verbatim. |
| 4 |
>=1 complete walkthrough. Output format demonstrated. Common scenarios covered. |
| 3 |
Examples exist but are fragments — no complete input-to-output flow. |
| 2 |
Only trivial examples that don't match actual task complexity. |
| 1 |
No examples at all. |
4. Safety & Guardrails
Evaluates whether the skill prevents harmful or unintended actions.
| Score |
Criteria |
| 5 |
Explicit MUST NOT list. Destructive actions require user confirmation. Scope limits defined. Credential/PII exposure prevented. Prompt injection defenses for skills processing external content. |
| 4 |
>=2 explicit prohibitions. Destructive actions guarded. Basic data/instruction separation. |
| 3 |
Some scope constraints exist but coverage is incomplete. |
| 2 |
No prohibitions or scope boundaries. Safety relies on inherent limitations only. |
| 1 |
Could expose credentials, delete files, or modify system state without guardrails. |
Proportionality: Scale expectations to risk. A read-only skill needs less guardrail scaffolding than one that modifies files or calls external APIs. Prompt injection criteria apply only when the skill processes untrusted external content.
5. User Experience
Evaluates whether purpose is obvious and the skill provides feedback during operation.
| Score |
Criteria |
| 5 |
Trigger phrases documented. Zero-configuration. Progress updates during multi-step ops. Structured, actionable output. |
| 4 |
Clear entry point. <=1 setup step. Formatted, useful output. |
| 3 |
Usable from context, but trigger phrases undocumented. Output functional but unformatted. |
| 2 |
Must read source to understand invocation. Raw dump output. |
| 1 |
Cannot determine how to invoke without contacting the author. |
6. Robustness & Edge Cases
Evaluates reliability under adverse or unexpected conditions.
| Score |
Criteria |
| 5 |
Handles adversarial inputs proportional to risk. Loop guards. Timeouts/limits. External deps documented with fallback. |
| 4 |
Most edge cases covered. Retry limits. Handles missing deps. |
| 3 |
Works for expected inputs. Some resilience but untested on extremes. |
| 2 |
Fragile. Fails on unexpected inputs. No retry limits. |
| 1 |
Breaks on non-trivial inputs. No robustness consideration. |
Script Quality is intentionally omitted — waza skills are prompt-only SKILL.md files. If scripts are added in the future, a 7th dimension can be introduced.
Implementation Notes
Use the Copilot SDK (not Azure OpenAI)
The existing prompt_grader.go already demonstrates the exact pattern needed:
- Create a Copilot SDK session via
copilot.NewClient + client.CreateSession
- Define structured tool calls for the LLM to invoke (similar to
set_waza_grade_pass / set_waza_grade_fail)
- Send the evaluation prompt with the skill's full content
- Collect structured responses
This means zero incremental API cost — it uses the developer's existing Copilot license. No Azure OpenAI keys or endpoints needed.
Suggested tool-call interface for the judge
Tool: set_dimension_score
Parameters:
dimension: string (e.g. "instruction_clarity")
score: int (1-5)
rationale: string (1-2 sentences)
Tool: set_quality_summary
Parameters:
summary: string (2-3 sentence overall assessment)
improvements: array (top 3 actionable suggestions)
Criteria as embedded markdown
Store criteria definitions in internal/checks/quality/criteria/*.md and embed via //go:embed. This makes them:
- Versionable with the binary
- Customizable (teams can override with local files)
- Readable as standalone documentation
System prompt considerations
Include these evaluator bias controls in the system prompt:
- Anti-verbosity bias: Score based on precision and structure, not length
- Anti-position bias: Evaluate each dimension independently
- Proportionality: Scale expectations to the skill's scope and complexity
- Evidence-based: Every score must cite specific content from the skill
Output format
Follow the existing readiness report pattern:
🤖 Quality Assessment (LLM Judge)
✅ Instruction Clarity 4/5 Clear phases with explicit MUST/MUST NOT rules
✅ Behavioral Completeness 4/5 Handles 3 failure scenarios with remediation
⚪ Example Quality 3/5 Examples exist but no end-to-end walkthrough
✅ Safety & Guardrails 4/5 Explicit prohibitions and scope limits
🌟 User Experience 5/5 Zero-config with progress feedback
✅ Robustness 4/5 Loop guards and dependency fallbacks
📊 Average: 4.0/5.0 — PASS (threshold: 3.0)
🔧 Top improvements:
1. Add end-to-end walkthrough showing input -> tool calls -> output
2. Document timeout behavior for external API calls
3. Add prompt injection defense for user-supplied file content
Score caching
Use a content hash (e.g., SHA of SKILL.md + references/) to skip re-evaluation when content hasn't changed. Store cached scores in .waza/quality-scores.json in the workspace.
Passing threshold
Default: average >= 3.0/5.0. Make configurable via flag or config file.
Integration with existing commands
waza check --deep could be an alias that runs check + quality together
waza quality as a standalone command for focused quality evaluation
- JSON output (
--format json) for CI pipelines and downstream tooling
- Results could feed into
waza compare for tracking quality across skill versions
Acceptance Criteria
Summary
Add a new
waza qualitycommand that uses the Copilot SDK to evaluate skill content quality across multiple dimensions using an LLM-as-Judge pattern. This complements the existingwaza check(deterministic/heuristic compliance) andwaza run(runtime output evaluation) by filling the gap: "are the skill's instructions well-engineered?"Why this matters
waza checkvalidates structural compliance (frontmatter, token budget, spec conformance) but cannot assess whether instructions are clear, examples are realistic, or safety guardrails are adequate. These are content quality dimensions that require semantic understanding — exactly what an LLM judge excels at.waza checkwaza runwaza quality(new)Proposed Command
Evaluation Dimensions (6 core)
Each dimension is scored 1-5 by the LLM judge. The criteria should be defined as markdown files (easy to customize/extend) and embedded into the binary.
1. Instruction Clarity
Evaluates whether the skill's SKILL.md is clear, specific, and unambiguous enough for consistent LLM behavior.
2. Behavioral Completeness
Evaluates whether the skill handles edge cases, errors, and exit conditions — not just the happy path.
3. Example Quality
Evaluates whether the skill includes concrete examples that anchor LLM behavior.
4. Safety & Guardrails
Evaluates whether the skill prevents harmful or unintended actions.
5. User Experience
Evaluates whether purpose is obvious and the skill provides feedback during operation.
6. Robustness & Edge Cases
Evaluates reliability under adverse or unexpected conditions.
Implementation Notes
Use the Copilot SDK (not Azure OpenAI)
The existing
prompt_grader.goalready demonstrates the exact pattern needed:copilot.NewClient+client.CreateSessionset_waza_grade_pass/set_waza_grade_fail)This means zero incremental API cost — it uses the developer's existing Copilot license. No Azure OpenAI keys or endpoints needed.
Suggested tool-call interface for the judge
Criteria as embedded markdown
Store criteria definitions in
internal/checks/quality/criteria/*.mdand embed via//go:embed. This makes them:System prompt considerations
Include these evaluator bias controls in the system prompt:
Output format
Follow the existing readiness report pattern:
Score caching
Use a content hash (e.g., SHA of SKILL.md + references/) to skip re-evaluation when content hasn't changed. Store cached scores in
.waza/quality-scores.jsonin the workspace.Passing threshold
Default: average >= 3.0/5.0. Make configurable via flag or config file.
Integration with existing commands
waza check --deepcould be an alias that runscheck+qualitytogetherwaza qualityas a standalone command for focused quality evaluation--format json) for CI pipelines and downstream toolingwaza comparefor tracking quality across skill versionsAcceptance Criteria
waza quality skills/my-skillevaluates a skill on 6 dimensions via Copilot SDK--format json