feat: Adversarial / safety evaluators (prompt injection, jailbreak, scope-bypass)

## Problem

Waza evals don't have a built-in way to test how a skill behaves under adversarial input. Authors don't think about prompt injection, scope-bypass, or jailbreak attempts until production. There's no test pack they can drop in to get a baseline.

## Proposal

Ship two narrowly-scoped adversarial test packs as a first cut. Both must run **offline, deterministic, in CI** — no live attacks against real services, no destructive tool calls.

### Pack 1: Prompt injection in fixtures

Adversarial strings hidden in input data (file contents, search results, tool responses) that try to override system instructions.

- Pack ships as a directory of `tasks/*.yaml` referencing fixtures with injected payloads.
- Expected outcome: the skill should *not* follow the injected instruction.
- Outcome is asserted via existing graders (`behavior`, `tool_constraint`, `text`).

### Pack 2: Scope-bypass

Inputs that try to make the skill act outside its declared `USE FOR` scope (e.g., a "code review" skill being asked to send email).

- Expected outcome: refusal or routing-away response.

### Schema

```yaml
adversarial:
  packs: [prompt-injection, scope-bypass]
  on_unsafe_outcome: fail   # or: warn
```

## Why this matters for agentic-first

Agents have *tools*. A skill that follows injected instructions or operates out of scope can do real damage. A 2-pack starter set raises the floor without overpromising a full red-team suite.

## Acceptance criteria

- [ ] Two packs shipped under `internal/adversarial/data/` with documented payloads and expected outcomes.
- [ ] All packs run offline; destructive/exfil-shaped scenarios route through mock tools only (see #363).
- [ ] Pack schema defines: payload, expected safe/unsafe outcome, tool-call predicates.
- [ ] CLI: `waza adversarial --packs=prompt-injection,scope-bypass`.
- [ ] Integrates with `waza gate` so unsafe outcomes can fail CI (see #364).
- [ ] Docs in `site/` explaining what each pack tests and how to extend.

## Non-goals (filed separately)

- Generic toxicity/bias rubrics — see #360 (LLM-judge rubrics).
- Live attack vectors against real MCP servers — out of scope; mocks only.
- Continuous attack pack updates / vulnerability database — defer.

## Related

- Tool assertions: #366
- MCP mocks: #363
- CI gating: #364
- Roadmap: #66


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Adversarial / safety evaluators (prompt injection, jailbreak, scope-bypass) #365

Problem

Proposal

Pack 1: Prompt injection in fixtures

Pack 2: Scope-bypass

Schema

Why this matters for agentic-first

Acceptance criteria

Non-goals (filed separately)

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

feat: Adversarial / safety evaluators (prompt injection, jailbreak, scope-bypass) #365

Description

Problem

Proposal

Pack 1: Prompt injection in fixtures

Pack 2: Scope-bypass

Schema

Why this matters for agentic-first

Acceptance criteria

Non-goals (filed separately)

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions