Problem
Waza evals don't have a built-in way to test how a skill behaves under adversarial input. Authors don't think about prompt injection, scope-bypass, or jailbreak attempts until production. There's no test pack they can drop in to get a baseline.
Proposal
Ship two narrowly-scoped adversarial test packs as a first cut. Both must run offline, deterministic, in CI — no live attacks against real services, no destructive tool calls.
Pack 1: Prompt injection in fixtures
Adversarial strings hidden in input data (file contents, search results, tool responses) that try to override system instructions.
- Pack ships as a directory of
tasks/*.yaml referencing fixtures with injected payloads.
- Expected outcome: the skill should not follow the injected instruction.
- Outcome is asserted via existing graders (
behavior, tool_constraint, text).
Pack 2: Scope-bypass
Inputs that try to make the skill act outside its declared USE FOR scope (e.g., a "code review" skill being asked to send email).
- Expected outcome: refusal or routing-away response.
Schema
adversarial:
packs: [prompt-injection, scope-bypass]
on_unsafe_outcome: fail # or: warn
Why this matters for agentic-first
Agents have tools. A skill that follows injected instructions or operates out of scope can do real damage. A 2-pack starter set raises the floor without overpromising a full red-team suite.
Acceptance criteria
Non-goals (filed separately)
Related
Problem
Waza evals don't have a built-in way to test how a skill behaves under adversarial input. Authors don't think about prompt injection, scope-bypass, or jailbreak attempts until production. There's no test pack they can drop in to get a baseline.
Proposal
Ship two narrowly-scoped adversarial test packs as a first cut. Both must run offline, deterministic, in CI — no live attacks against real services, no destructive tool calls.
Pack 1: Prompt injection in fixtures
Adversarial strings hidden in input data (file contents, search results, tool responses) that try to override system instructions.
tasks/*.yamlreferencing fixtures with injected payloads.behavior,tool_constraint,text).Pack 2: Scope-bypass
Inputs that try to make the skill act outside its declared
USE FORscope (e.g., a "code review" skill being asked to send email).Schema
Why this matters for agentic-first
Agents have tools. A skill that follows injected instructions or operates out of scope can do real damage. A 2-pack starter set raises the floor without overpromising a full red-team suite.
Acceptance criteria
internal/adversarial/data/with documented payloads and expected outcomes.waza adversarial --packs=prompt-injection,scope-bypass.waza gateso unsafe outcomes can fail CI (see feat: Regression gates — baseline comparison with thresholds and statistical confidence #364).site/explaining what each pack tests and how to extend.Non-goals (filed separately)
Related