Labels: bug, compliance
Problem
When restructuring a Copilot skill for waza High compliance (~500 token limit), the
compression process caused silent scope loss — the skill's entry point (SKILL.md) narrowed
from 3 equally-supported workflows to 1, without any waza tooling flagging the regression.
What happened
A skill originally supported three peer commands:
- review — score and improve a single document
- batch — review multiple URLs from a file
- evaluate — test how well Copilot accomplishes a task end-to-end
Each had its own trigger phrases, examples, scoring rubric, and output structure in the
SKILL.md.
To meet the ~500 token budget for waza High compliance, the skill was restructured using
progressive disclosure (compact SKILL.md + detailed reference files). The reference files
preserved all content, but the SKILL.md entry point was compressed to only cover review —
dropping batch and evaluate from:
- The description field (routing/trigger)
- The USE FOR trigger phrases
- The workflow steps
- The examples
- The scoring summary
An agent reading the compliant SKILL.md would never discover that batch and evaluate exist —
they're only in reference files the agent has no reason to load.
Why this matters
- Trigger accuracy drops — Users asking "evaluate how Copilot handles this task" or "batch
review these URLs" won't trigger the skill because those phrases aren't in the description
- Workflow completeness drops — An agent that does trigger the skill will only know how to
review, missing 2 of 3 core capabilities
- Scoring rubric invisible — The evaluate scoring (6-category, 100-point) and review scoring
(7-category, 100-point) are both absent from the entry point
- The waza spec check reported 9/9 compliance — it validated structure and token count but
couldn't detect that the skill's functional scope was reduced
Suggestion
The waza spec should include a check (or guidance) for scope preservation during compression:
- Compare the description and USE FOR trigger phrases between the pre-compliance and
post-compliance versions
- Flag if major capabilities (commands, workflows, modes) present in the original are absent
from the compressed version
- Alternatively, provide guidance that progressive disclosure should preserve scope in the
entry point and only move detail to references — not entire capability categories
Reproduction
Any multi-command skill compressed to ~500 tokens is at risk. The pattern is:
- Original skill covers N commands/modes equally
- Token compression prioritizes the most common command
- Less common commands get moved entirely to references
- Waza spec check passes (structure + tokens are correct)
- Agent behavior regresses (scope silently narrowed)
What waza has today:
- Intent resolution rubric (intent_resolution.yaml) — scores 1-5 how well an agent satisfies
user intent
- Trigger graders — positive/negative mode to test if a skill correctly fires (or doesn't) on
specific prompts
- Task adherence graders — verify the agent stays within scope
- Custom prompt graders — LLM-as-judge for arbitrary criteria
What it does NOT do automatically:
- No before/after scope diff. Waza doesn't compare a pre-compliance SKILL.md against a
post-compliance one to detect capability loss. It validates the file it's given — if all three
modes are missing from the triggers, the trigger tests will still pass (they test what's in
the file, not what's missing).
- No intent regression detection. There's no built-in "run the same eval suite against old
and new SKILL.md and flag score drops." You'd have to set this up manually with baseline runs.
The gap for the waza issue: The tooling could catch this — if the eval suite included trigger
tasks for all three modes AND you ran the suite before AND after compression. But the waza
spec compliance check (Spec 9/9, Tokens 500/500) only validates structure, not behavioral
equivalence.
This strengthens the issue. The suggestion should be: waza should require eval baseline
comparison when compressing a skill — run the existing evals on the old SKILL.md, then on the
new one, and flag any score regression. The primitives exist; the workflow doesn't connect
them.
The waza compression agent fabricated content when it didn't have enough source
material, rather than omitting what it didn't know.
cc @diberry — can provide the specific skill and before/after diffs from the internal repo.
Labels: bug, compliance
Problem
When restructuring a Copilot skill for waza High compliance (~500 token limit), the
compression process caused silent scope loss — the skill's entry point (SKILL.md) narrowed
from 3 equally-supported workflows to 1, without any waza tooling flagging the regression.
What happened
A skill originally supported three peer commands:
Each had its own trigger phrases, examples, scoring rubric, and output structure in the
SKILL.md.
To meet the ~500 token budget for waza High compliance, the skill was restructured using
progressive disclosure (compact SKILL.md + detailed reference files). The reference files
preserved all content, but the SKILL.md entry point was compressed to only cover review —
dropping batch and evaluate from:
An agent reading the compliant SKILL.md would never discover that batch and evaluate exist —
they're only in reference files the agent has no reason to load.
Why this matters
review these URLs" won't trigger the skill because those phrases aren't in the description
review, missing 2 of 3 core capabilities
(7-category, 100-point) are both absent from the entry point
couldn't detect that the skill's functional scope was reduced
Suggestion
The waza spec should include a check (or guidance) for scope preservation during compression:
post-compliance versions
from the compressed version
entry point and only move detail to references — not entire capability categories
Reproduction
Any multi-command skill compressed to ~500 tokens is at risk. The pattern is:
What waza has today:
user intent
specific prompts
What it does NOT do automatically:
post-compliance one to detect capability loss. It validates the file it's given — if all three
modes are missing from the triggers, the trigger tests will still pass (they test what's in
the file, not what's missing).
and new SKILL.md and flag score drops." You'd have to set this up manually with baseline runs.
The gap for the waza issue: The tooling could catch this — if the eval suite included trigger
tasks for all three modes AND you ran the suite before AND after compression. But the waza
spec compliance check (Spec 9/9, Tokens 500/500) only validates structure, not behavioral
equivalence.
This strengthens the issue. The suggestion should be: waza should require eval baseline
comparison when compressing a skill — run the existing evals on the old SKILL.md, then on the
new one, and flag any score regression. The primitives exist; the workflow doesn't connect
them.
The waza compression agent fabricated content when it didn't have enough source
material, rather than omitting what it didn't know.
cc @diberry — can provide the specific skill and before/after diffs from the internal repo.