Feature request: `forbidden_skills` for `skill_invocation` grader

Add a `forbidden_skills` field to the `skill_invocation` grader, and relax the requirement that `required_skills` be non-empty. This would let evaluators express **"skill X must not be invoked here, but other skills are fine"** — the natural shape for negative-trigger tasks. Today this can only be approximated with `behavior` + `forbidden_tools: [skill]`, which over-forbids by rejecting every skill invocation regardless of name.

## Motivation: negative trigger tasks

We run "trigger-precision" evals: prompts paired with an expectation about whether a particular skill should be invoked. For each skill we have:

- **Positive tasks** — the prompt should activate skill `S`. Expressed today with:
  ```yaml
  - type: skill_invocation
    name: S-invoked
    config:
      required_skills: [S]
      mode: any_order
      allow_extra: true
  ```
- **Negative tasks** — the prompt should *not* activate skill `S`. The accurate question is "skill `S` was not invoked"; we don't actually care whether the agent reached for some other (unrelated) skill, since that might still be the right thing to do for the prompt.

The closest thing today is the `behavior` grader:

```yaml
- type: behavior
  name: no-skill-invoked
  config:
    forbidden_tools: [skill]
```

That works only because:

1. `behavior` looks at tool names, not arguments — so `forbidden_tools: [skill]` forbids *all* skill invocations.
2. The eval CWD currently exposes only one discoverable skill, so "any skill" and "this skill" are equivalent.

Both are accidents of our current setup, not a faithful expression of the test. As soon as additional skills are discoverable (own repo grows, `config.skill_directories` adds external sets), the grader starts producing false negatives on negative tasks: the agent legitimately invokes an unrelated skill and our negative-trigger task fails.

## Proposal

Add `forbidden_skills` and make `required_skills` optional (default `[]`):

```yaml
- type: skill_invocation
  name: S-not-invoked
  config:
    forbidden_skills: [S]
    allow_extra: true
```

Reading: "skill `S` must not appear in `runs[].skill_invocations`; any other skill invocations are fine; no invocation at all is also fine."

### Semantics with existing fields

| `required_skills` | `forbidden_skills` | `allow_extra` | Meaning |
|---|---|---|---|
| `[A, B]` | `[]` | `true` | (today) A and B must fire; others are fine. |
| `[A, B]` | `[]` | `false` | (today) A and B must fire; no others. |
| `[]` | `[X]` | `true` | **(new)** X must not fire; others (including none) are fine. ← negative-trigger case |
| `[A]` | `[X]` | `true` | **(new)** A must fire, X must not, others are fine. ← multi-skill routing tests |
| `[A]` | `[X]` | `false` | **(new)** A must fire, X must not, nothing else may fire either. |
| `[]` | `[X]` | `false` | **Arguably meaningless** — `allow_extra: false` with empty `required_skills` already implies "no skill may fire", which subsumes the prohibition on X. Either reject this combination with a validation error, or treat it as equivalent to `[]` / `[]` / `false` (no skills allowed at all). |
| `[]` | `[]` | `false` | Edge case worth specifying — could mean "no skill may fire" (most useful for full-suite hygiene), or could be rejected as under-specified. |

Suggested validation: require at least one of `required_skills` or `forbidden_skills` to be non-empty; otherwise the grader has nothing to check.

### Scoring

A minimal interpretation:

- Each entry in `forbidden_skills` is one check; it passes iff that skill is absent from `runs[].skill_invocations`.
- Combined with existing scoring (F1 over `required_skills`, optional `allow_extra` penalty), the composite score remains `passed_checks / total_checks`-style or weighted average — whichever fits Waza's current shape best.

The `mode` field could remain meaningful only when `required_skills` is non-empty; when only `forbidden_skills` is set, `mode` is ignored (or required to be omitted).

## Why this is better than alternatives we considered

- **A second `behavior` grader entry per task** - not viable: `behavior` is tool-name-scoped and can't filter by skill name.
- **An LLM `prompt` grader** - works in theory but adds judge cost and non-determinism to a tier whose whole point is being cheap and fast.
- **A custom `program` grader** - works (we'd parse `runs[].skill_invocations` from JSON) but is boilerplate every adopter would re-invent. The semantics belong in the built-in grader.

## Environment

- Waza 0.31.0
- `executor: copilot-sdk`

`required_skills`	`forbidden_skills`	`allow_extra`	Meaning
`[A, B]`	`[]`	`true`	(today) A and B must fire; others are fine.
`[A, B]`	`[]`	`false`	(today) A and B must fire; no others.
`[]`	`[X]`	`true`	(new) X must not fire; others (including none) are fine. ← negative-trigger case
`[A]`	`[X]`	`true`	(new) A must fire, X must not, others are fine. ← multi-skill routing tests
`[A]`	`[X]`	`false`	(new) A must fire, X must not, nothing else may fire either.
`[]`	`[X]`	`false`	Arguably meaningless — `allow_extra: false` with empty `required_skills` already implies "no skill may fire", which subsumes the prohibition on X. Either reject this combination with a validation error, or treat it as equivalent to `[]` / `[]` / `false` (no skills allowed at all).
`[]`	`[]`	`false`	Edge case worth specifying — could mean "no skill may fire" (most useful for full-suite hygiene), or could be rejected as under-specified.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature request: `forbidden_skills` for `skill_invocation` grader #286

Motivation: negative trigger tasks

Proposal

Semantics with existing fields

Scoring

Why this is better than alternatives we considered

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Feature request: forbidden_skills for skill_invocation grader #286

Description

Motivation: negative trigger tasks

Proposal

Semantics with existing fields

Scoring

Why this is better than alternatives we considered

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Feature request: `forbidden_skills` for `skill_invocation` grader #286