Add a forbidden_skills field to the skill_invocation grader, and relax the requirement that required_skills be non-empty. This would let evaluators express "skill X must not be invoked here, but other skills are fine" — the natural shape for negative-trigger tasks. Today this can only be approximated with behavior + forbidden_tools: [skill], which over-forbids by rejecting every skill invocation regardless of name.
Motivation: negative trigger tasks
We run "trigger-precision" evals: prompts paired with an expectation about whether a particular skill should be invoked. For each skill we have:
- Positive tasks — the prompt should activate skill
S. Expressed today with:
- type: skill_invocation
name: S-invoked
config:
required_skills: [S]
mode: any_order
allow_extra: true
- Negative tasks — the prompt should not activate skill
S. The accurate question is "skill S was not invoked"; we don't actually care whether the agent reached for some other (unrelated) skill, since that might still be the right thing to do for the prompt.
The closest thing today is the behavior grader:
- type: behavior
name: no-skill-invoked
config:
forbidden_tools: [skill]
That works only because:
behavior looks at tool names, not arguments — so forbidden_tools: [skill] forbids all skill invocations.
- The eval CWD currently exposes only one discoverable skill, so "any skill" and "this skill" are equivalent.
Both are accidents of our current setup, not a faithful expression of the test. As soon as additional skills are discoverable (own repo grows, config.skill_directories adds external sets), the grader starts producing false negatives on negative tasks: the agent legitimately invokes an unrelated skill and our negative-trigger task fails.
Proposal
Add forbidden_skills and make required_skills optional (default []):
- type: skill_invocation
name: S-not-invoked
config:
forbidden_skills: [S]
allow_extra: true
Reading: "skill S must not appear in runs[].skill_invocations; any other skill invocations are fine; no invocation at all is also fine."
Semantics with existing fields
required_skills |
forbidden_skills |
allow_extra |
Meaning |
[A, B] |
[] |
true |
(today) A and B must fire; others are fine. |
[A, B] |
[] |
false |
(today) A and B must fire; no others. |
[] |
[X] |
true |
(new) X must not fire; others (including none) are fine. ← negative-trigger case |
[A] |
[X] |
true |
(new) A must fire, X must not, others are fine. ← multi-skill routing tests |
[A] |
[X] |
false |
(new) A must fire, X must not, nothing else may fire either. |
[] |
[X] |
false |
Arguably meaningless — allow_extra: false with empty required_skills already implies "no skill may fire", which subsumes the prohibition on X. Either reject this combination with a validation error, or treat it as equivalent to [] / [] / false (no skills allowed at all). |
[] |
[] |
false |
Edge case worth specifying — could mean "no skill may fire" (most useful for full-suite hygiene), or could be rejected as under-specified. |
Suggested validation: require at least one of required_skills or forbidden_skills to be non-empty; otherwise the grader has nothing to check.
Scoring
A minimal interpretation:
- Each entry in
forbidden_skills is one check; it passes iff that skill is absent from runs[].skill_invocations.
- Combined with existing scoring (F1 over
required_skills, optional allow_extra penalty), the composite score remains passed_checks / total_checks-style or weighted average — whichever fits Waza's current shape best.
The mode field could remain meaningful only when required_skills is non-empty; when only forbidden_skills is set, mode is ignored (or required to be omitted).
Why this is better than alternatives we considered
- A second
behavior grader entry per task - not viable: behavior is tool-name-scoped and can't filter by skill name.
- An LLM
prompt grader - works in theory but adds judge cost and non-determinism to a tier whose whole point is being cheap and fast.
- A custom
program grader - works (we'd parse runs[].skill_invocations from JSON) but is boilerplate every adopter would re-invent. The semantics belong in the built-in grader.
Environment
- Waza 0.31.0
executor: copilot-sdk
Add a
forbidden_skillsfield to theskill_invocationgrader, and relax the requirement thatrequired_skillsbe non-empty. This would let evaluators express "skill X must not be invoked here, but other skills are fine" — the natural shape for negative-trigger tasks. Today this can only be approximated withbehavior+forbidden_tools: [skill], which over-forbids by rejecting every skill invocation regardless of name.Motivation: negative trigger tasks
We run "trigger-precision" evals: prompts paired with an expectation about whether a particular skill should be invoked. For each skill we have:
S. Expressed today with:S. The accurate question is "skillSwas not invoked"; we don't actually care whether the agent reached for some other (unrelated) skill, since that might still be the right thing to do for the prompt.The closest thing today is the
behaviorgrader:That works only because:
behaviorlooks at tool names, not arguments — soforbidden_tools: [skill]forbids all skill invocations.Both are accidents of our current setup, not a faithful expression of the test. As soon as additional skills are discoverable (own repo grows,
config.skill_directoriesadds external sets), the grader starts producing false negatives on negative tasks: the agent legitimately invokes an unrelated skill and our negative-trigger task fails.Proposal
Add
forbidden_skillsand makerequired_skillsoptional (default[]):Reading: "skill
Smust not appear inruns[].skill_invocations; any other skill invocations are fine; no invocation at all is also fine."Semantics with existing fields
required_skillsforbidden_skillsallow_extra[A, B][]true[A, B][]false[][X]true[A][X]true[A][X]false[][X]falseallow_extra: falsewith emptyrequired_skillsalready implies "no skill may fire", which subsumes the prohibition on X. Either reject this combination with a validation error, or treat it as equivalent to[]/[]/false(no skills allowed at all).[][]falseSuggested validation: require at least one of
required_skillsorforbidden_skillsto be non-empty; otherwise the grader has nothing to check.Scoring
A minimal interpretation:
forbidden_skillsis one check; it passes iff that skill is absent fromruns[].skill_invocations.required_skills, optionalallow_extrapenalty), the composite score remainspassed_checks / total_checks-style or weighted average — whichever fits Waza's current shape best.The
modefield could remain meaningful only whenrequired_skillsis non-empty; when onlyforbidden_skillsis set,modeis ignored (or required to be omitted).Why this is better than alternatives we considered
behaviorgrader entry per task - not viable:behavioris tool-name-scoped and can't filter by skill name.promptgrader - works in theory but adds judge cost and non-determinism to a tier whose whole point is being cheap and fast.programgrader - works (we'd parseruns[].skill_invocationsfrom JSON) but is boilerplate every adopter would re-invent. The semantics belong in the built-in grader.Environment
executor: copilot-sdk