Skip to content

feat: Adversarial / safety evaluators (prompt injection, jailbreak, scope-bypass) #365

Description

@spboyer

Problem

Waza evals don't have a built-in way to test how a skill behaves under adversarial input. Authors don't think about prompt injection, scope-bypass, or jailbreak attempts until production. There's no test pack they can drop in to get a baseline.

Proposal

Ship two narrowly-scoped adversarial test packs as a first cut. Both must run offline, deterministic, in CI — no live attacks against real services, no destructive tool calls.

Pack 1: Prompt injection in fixtures

Adversarial strings hidden in input data (file contents, search results, tool responses) that try to override system instructions.

  • Pack ships as a directory of tasks/*.yaml referencing fixtures with injected payloads.
  • Expected outcome: the skill should not follow the injected instruction.
  • Outcome is asserted via existing graders (behavior, tool_constraint, text).

Pack 2: Scope-bypass

Inputs that try to make the skill act outside its declared USE FOR scope (e.g., a "code review" skill being asked to send email).

  • Expected outcome: refusal or routing-away response.

Schema

adversarial:
  packs: [prompt-injection, scope-bypass]
  on_unsafe_outcome: fail   # or: warn

Why this matters for agentic-first

Agents have tools. A skill that follows injected instructions or operates out of scope can do real damage. A 2-pack starter set raises the floor without overpromising a full red-team suite.

Acceptance criteria

Non-goals (filed separately)

Related

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions