feat: add waza quality command — LLM-as-Judge skill quality scoring

## Summary

Add a new `waza quality` command that uses the Copilot SDK to evaluate skill **content quality** across multiple dimensions using an LLM-as-Judge pattern. This complements the existing `waza check` (deterministic/heuristic compliance) and `waza run` (runtime output evaluation) by filling the gap: **"are the skill's instructions well-engineered?"**

### Why this matters

`waza check` validates structural compliance (frontmatter, token budget, spec conformance) but cannot assess whether instructions are *clear*, examples are *realistic*, or safety guardrails are *adequate*. These are content quality dimensions that require semantic understanding — exactly what an LLM judge excels at.

| Tool | What it answers |
|------|----------------|
| `waza check` | Is the skill structurally valid? |
| `waza run` | Does the skill produce correct output? |
| `waza quality` **(new)** | Are the skill's instructions well-engineered? |

---

## Proposed Command

```bash
# Evaluate a single skill
waza quality skills/my-skill

# Evaluate all skills in workspace
waza quality

# JSON output for CI integration
waza quality --format json

# Specific model override
waza quality --model gpt-4.1
```

---

## Evaluation Dimensions (6 core)

Each dimension is scored **1-5** by the LLM judge. The criteria should be defined as markdown files (easy to customize/extend) and embedded into the binary.

### 1. Instruction Clarity

Evaluates whether the skill's SKILL.md is clear, specific, and unambiguous enough for consistent LLM behavior.

| Score | Criteria |
|-------|----------|
| **5** | Numbered workflow phases or clearly structured steps. Explicit scope boundaries. Defined output format with example. MUST/MUST NOT rules. |
| **4** | Clear role definition. Specific do/don't rules. Structured sections. Output format mentioned but not demonstrated. |
| **3** | Goal is stated but lacks phases, scope boundaries, or output format. Results vary between runs. |
| **2** | Instructions <10 lines or vague language. Agent guesses at behavior. |
| **1** | No SKILL.md or only a title with no instructions. |

### 2. Behavioral Completeness

Evaluates whether the skill handles edge cases, errors, and exit conditions — not just the happy path.

| Score | Criteria |
|-------|----------|
| **5** | Documents >=3 specific failure scenarios with remediation. Cleanup/exit protocol. Retry limits. Handles empty/missing inputs. |
| **4** | Handles >=2 common error cases. Has an exit condition. |
| **3** | Happy path works. Error handling is implicit. |
| **2** | No error handling instructions. Agent gets stuck on missing files or failed commands. |
| **1** | No consideration of failure. Agent may loop indefinitely. |

### 3. Example Quality

Evaluates whether the skill includes concrete examples that anchor LLM behavior.

| Score | Criteria |
|-------|----------|
| **5** | >=2 realistic walkthroughs (input -> actions -> output). Edge case examples. Expected output format shown verbatim. |
| **4** | >=1 complete walkthrough. Output format demonstrated. Common scenarios covered. |
| **3** | Examples exist but are fragments — no complete input-to-output flow. |
| **2** | Only trivial examples that don't match actual task complexity. |
| **1** | No examples at all. |

### 4. Safety & Guardrails

Evaluates whether the skill prevents harmful or unintended actions.

| Score | Criteria |
|-------|----------|
| **5** | Explicit MUST NOT list. Destructive actions require user confirmation. Scope limits defined. Credential/PII exposure prevented. Prompt injection defenses for skills processing external content. |
| **4** | >=2 explicit prohibitions. Destructive actions guarded. Basic data/instruction separation. |
| **3** | Some scope constraints exist but coverage is incomplete. |
| **2** | No prohibitions or scope boundaries. Safety relies on inherent limitations only. |
| **1** | Could expose credentials, delete files, or modify system state without guardrails. |

> **Proportionality:** Scale expectations to risk. A read-only skill needs less guardrail scaffolding than one that modifies files or calls external APIs. Prompt injection criteria apply only when the skill processes untrusted external content.

### 5. User Experience

Evaluates whether purpose is obvious and the skill provides feedback during operation.

| Score | Criteria |
|-------|----------|
| **5** | Trigger phrases documented. Zero-configuration. Progress updates during multi-step ops. Structured, actionable output. |
| **4** | Clear entry point. <=1 setup step. Formatted, useful output. |
| **3** | Usable from context, but trigger phrases undocumented. Output functional but unformatted. |
| **2** | Must read source to understand invocation. Raw dump output. |
| **1** | Cannot determine how to invoke without contacting the author. |

### 6. Robustness & Edge Cases

Evaluates reliability under adverse or unexpected conditions.

| Score | Criteria |
|-------|----------|
| **5** | Handles adversarial inputs proportional to risk. Loop guards. Timeouts/limits. External deps documented with fallback. |
| **4** | Most edge cases covered. Retry limits. Handles missing deps. |
| **3** | Works for expected inputs. Some resilience but untested on extremes. |
| **2** | Fragile. Fails on unexpected inputs. No retry limits. |
| **1** | Breaks on non-trivial inputs. No robustness consideration. |

> **Script Quality** is intentionally omitted — waza skills are prompt-only SKILL.md files. If scripts are added in the future, a 7th dimension can be introduced.

---

## Implementation Notes

### Use the Copilot SDK (not Azure OpenAI)

The existing `prompt_grader.go` already demonstrates the exact pattern needed:

- Create a Copilot SDK session via `copilot.NewClient` + `client.CreateSession`
- Define structured tool calls for the LLM to invoke (similar to `set_waza_grade_pass` / `set_waza_grade_fail`)
- Send the evaluation prompt with the skill's full content
- Collect structured responses

This means **zero incremental API cost** — it uses the developer's existing Copilot license. No Azure OpenAI keys or endpoints needed.

### Suggested tool-call interface for the judge

```
Tool: set_dimension_score
Parameters:
  dimension: string   (e.g. "instruction_clarity")
  score: int           (1-5)
  rationale: string    (1-2 sentences)

Tool: set_quality_summary
Parameters:
  summary: string      (2-3 sentence overall assessment)
  improvements: array  (top 3 actionable suggestions)
```

### Criteria as embedded markdown

Store criteria definitions in `internal/checks/quality/criteria/*.md` and embed via `//go:embed`. This makes them:
- Versionable with the binary
- Customizable (teams can override with local files)
- Readable as standalone documentation

### System prompt considerations

Include these evaluator bias controls in the system prompt:
- **Anti-verbosity bias:** Score based on precision and structure, not length
- **Anti-position bias:** Evaluate each dimension independently
- **Proportionality:** Scale expectations to the skill's scope and complexity
- **Evidence-based:** Every score must cite specific content from the skill

### Output format

Follow the existing readiness report pattern:

```
🤖 Quality Assessment (LLM Judge)
  ✅ Instruction Clarity        4/5  Clear phases with explicit MUST/MUST NOT rules
  ✅ Behavioral Completeness    4/5  Handles 3 failure scenarios with remediation
  ⚪ Example Quality            3/5  Examples exist but no end-to-end walkthrough
  ✅ Safety & Guardrails        4/5  Explicit prohibitions and scope limits
  🌟 User Experience            5/5  Zero-config with progress feedback
  ✅ Robustness                 4/5  Loop guards and dependency fallbacks

  📊 Average: 4.0/5.0 — PASS (threshold: 3.0)

  🔧 Top improvements:
     1. Add end-to-end walkthrough showing input -> tool calls -> output
     2. Document timeout behavior for external API calls
     3. Add prompt injection defense for user-supplied file content
```

### Score caching

Use a content hash (e.g., SHA of SKILL.md + references/) to skip re-evaluation when content hasn't changed. Store cached scores in `.waza/quality-scores.json` in the workspace.

### Passing threshold

Default: average >= 3.0/5.0. Make configurable via flag or config file.

---

## Integration with existing commands

- `waza check --deep` could be an alias that runs `check` + `quality` together
- `waza quality` as a standalone command for focused quality evaluation
- JSON output (`--format json`) for CI pipelines and downstream tooling
- Results could feed into `waza compare` for tracking quality across skill versions

---

## Acceptance Criteria

- [ ] `waza quality skills/my-skill` evaluates a skill on 6 dimensions via Copilot SDK
- [ ] Scores are 1-5 per dimension with rationale text
- [ ] Terminal output follows readiness report style (emoji + scores + rationale)
- [ ] JSON output available via `--format json`
- [ ] Criteria are embedded markdown files, overridable with local files
- [ ] Content hash caching skips unchanged skills
- [ ] Passing threshold configurable (default 3.0)
- [ ] Unit tests for criteria loading, score aggregation, and output formatting


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add waza quality command — LLM-as-Judge skill quality scoring #98

Summary

Why this matters

Proposed Command

Evaluation Dimensions (6 core)

1. Instruction Clarity

2. Behavioral Completeness

3. Example Quality

4. Safety & Guardrails

5. User Experience

6. Robustness & Edge Cases

Implementation Notes

Use the Copilot SDK (not Azure OpenAI)

Suggested tool-call interface for the judge

Criteria as embedded markdown

System prompt considerations

Output format

Score caching

Passing threshold

Integration with existing commands

Acceptance Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Tool	What it answers
`waza check`	Is the skill structurally valid?
`waza run`	Does the skill produce correct output?
`waza quality` (new)	Are the skill's instructions well-engineered?

Score	Criteria
5	Numbered workflow phases or clearly structured steps. Explicit scope boundaries. Defined output format with example. MUST/MUST NOT rules.
4	Clear role definition. Specific do/don't rules. Structured sections. Output format mentioned but not demonstrated.
3	Goal is stated but lacks phases, scope boundaries, or output format. Results vary between runs.
2	Instructions <10 lines or vague language. Agent guesses at behavior.
1	No SKILL.md or only a title with no instructions.

Score	Criteria
5	Documents >=3 specific failure scenarios with remediation. Cleanup/exit protocol. Retry limits. Handles empty/missing inputs.
4	Handles >=2 common error cases. Has an exit condition.
3	Happy path works. Error handling is implicit.
2	No error handling instructions. Agent gets stuck on missing files or failed commands.
1	No consideration of failure. Agent may loop indefinitely.

Score	Criteria
5	>=2 realistic walkthroughs (input -> actions -> output). Edge case examples. Expected output format shown verbatim.
4	>=1 complete walkthrough. Output format demonstrated. Common scenarios covered.
3	Examples exist but are fragments — no complete input-to-output flow.
2	Only trivial examples that don't match actual task complexity.
1	No examples at all.

Score	Criteria
5	Explicit MUST NOT list. Destructive actions require user confirmation. Scope limits defined. Credential/PII exposure prevented. Prompt injection defenses for skills processing external content.
4	>=2 explicit prohibitions. Destructive actions guarded. Basic data/instruction separation.
3	Some scope constraints exist but coverage is incomplete.
2	No prohibitions or scope boundaries. Safety relies on inherent limitations only.
1	Could expose credentials, delete files, or modify system state without guardrails.

Score	Criteria
5	Trigger phrases documented. Zero-configuration. Progress updates during multi-step ops. Structured, actionable output.
4	Clear entry point. <=1 setup step. Formatted, useful output.
3	Usable from context, but trigger phrases undocumented. Output functional but unformatted.
2	Must read source to understand invocation. Raw dump output.
1	Cannot determine how to invoke without contacting the author.

Score	Criteria
5	Handles adversarial inputs proportional to risk. Loop guards. Timeouts/limits. External deps documented with fallback.
4	Most edge cases covered. Retry limits. Handles missing deps.
3	Works for expected inputs. Some resilience but untested on extremes.
2	Fragile. Fails on unexpected inputs. No retry limits.
1	Breaks on non-trivial inputs. No robustness consideration.

Uh oh!

feat: add waza quality command — LLM-as-Judge skill quality scoring #98

Description

Summary

Why this matters

Proposed Command

Evaluation Dimensions (6 core)

1. Instruction Clarity

2. Behavioral Completeness

3. Example Quality

4. Safety & Guardrails

5. User Experience

6. Robustness & Edge Cases

Implementation Notes

Use the Copilot SDK (not Azure OpenAI)

Suggested tool-call interface for the judge

Criteria as embedded markdown

System prompt considerations

Output format

Score caching

Passing threshold

Integration with existing commands

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions