Skip to content

Waza compliance token limit forces scope reduction that silently degrades skill quality #183

Description

@diberry

Labels: bug, compliance

Problem

When restructuring a Copilot skill for waza High compliance (~500 token limit), the
compression process caused silent scope loss — the skill's entry point (SKILL.md) narrowed
from 3 equally-supported workflows to 1, without any waza tooling flagging the regression.

What happened

A skill originally supported three peer commands:

  • review — score and improve a single document
  • batch — review multiple URLs from a file
  • evaluate — test how well Copilot accomplishes a task end-to-end

Each had its own trigger phrases, examples, scoring rubric, and output structure in the
SKILL.md.

To meet the ~500 token budget for waza High compliance, the skill was restructured using
progressive disclosure (compact SKILL.md + detailed reference files). The reference files
preserved all content, but the SKILL.md entry point was compressed to only cover review —
dropping batch and evaluate from:

  • The description field (routing/trigger)
  • The USE FOR trigger phrases
  • The workflow steps
  • The examples
  • The scoring summary

An agent reading the compliant SKILL.md would never discover that batch and evaluate exist —
they're only in reference files the agent has no reason to load.

Why this matters

  1. Trigger accuracy drops — Users asking "evaluate how Copilot handles this task" or "batch
    review these URLs" won't trigger the skill because those phrases aren't in the description
  2. Workflow completeness drops — An agent that does trigger the skill will only know how to
    review, missing 2 of 3 core capabilities
  3. Scoring rubric invisible — The evaluate scoring (6-category, 100-point) and review scoring
    (7-category, 100-point) are both absent from the entry point
  4. The waza spec check reported 9/9 compliance — it validated structure and token count but
    couldn't detect that the skill's functional scope was reduced

Suggestion

The waza spec should include a check (or guidance) for scope preservation during compression:

  • Compare the description and USE FOR trigger phrases between the pre-compliance and
    post-compliance versions
  • Flag if major capabilities (commands, workflows, modes) present in the original are absent
    from the compressed version
  • Alternatively, provide guidance that progressive disclosure should preserve scope in the
    entry point and only move detail to references — not entire capability categories

Reproduction

Any multi-command skill compressed to ~500 tokens is at risk. The pattern is:

  1. Original skill covers N commands/modes equally
  2. Token compression prioritizes the most common command
  3. Less common commands get moved entirely to references
  4. Waza spec check passes (structure + tokens are correct)
  5. Agent behavior regresses (scope silently narrowed)

What waza has today:

  • Intent resolution rubric (intent_resolution.yaml) — scores 1-5 how well an agent satisfies
    user intent
  • Trigger graders — positive/negative mode to test if a skill correctly fires (or doesn't) on
    specific prompts
  • Task adherence graders — verify the agent stays within scope
  • Custom prompt graders — LLM-as-judge for arbitrary criteria

What it does NOT do automatically:

  • No before/after scope diff. Waza doesn't compare a pre-compliance SKILL.md against a
    post-compliance one to detect capability loss. It validates the file it's given — if all three
    modes are missing from the triggers, the trigger tests will still pass (they test what's in
    the file, not what's missing).
  • No intent regression detection. There's no built-in "run the same eval suite against old
    and new SKILL.md and flag score drops." You'd have to set this up manually with baseline runs.

The gap for the waza issue: The tooling could catch this — if the eval suite included trigger
tasks for all three modes AND you ran the suite before AND after compression. But the waza
spec compliance check (Spec 9/9, Tokens 500/500) only validates structure, not behavioral
equivalence.

This strengthens the issue. The suggestion should be: waza should require eval baseline
comparison when compressing a skill — run the existing evals on the old SKILL.md, then on the
new one, and flag any score regression. The primitives exist; the workflow doesn't connect
them.

The waza compression agent fabricated content when it didn't have enough source
material, rather than omitting what it didn't know.

cc @diberry — can provide the specific skill and before/after diffs from the internal repo.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions