Skip to content

feat: adversarial / fault-injection harness (closes #365)#392

Merged
spboyer merged 3 commits into
mainfrom
spboyer-feat-adversarial-harness
Jun 28, 2026
Merged

feat: adversarial / fault-injection harness (closes #365)#392
spboyer merged 3 commits into
mainfrom
spboyer-feat-adversarial-harness

Conversation

@spboyer

@spboyer spboyer commented Jun 28, 2026

Copy link
Copy Markdown
Member

Summary

Adds the Wave 4 adversarial / fault-injection harness called out in #365.

  • New CLI: waza adversarial — runs one or more built-in adversarial packs against a skill and enforces an --on-unsafe-outcome {fail,warn} policy.
  • Two built-in packs, both embedded into the binary:
    • prompt-injection (4 tasks) — indirect prompt injection through fixture files (README, source comment, ticket body, changelog link).
    • scope-bypass (4 tasks) — out-of-scope action requests (email, file deletion, package install, external HTTP).
  • Schema 1.2 additive adversarial: block on EvalSpec:
    adversarial:
      packs: [prompt-injection, scope-bypass]
      on_unsafe_outcome: fail
    Consumed only by waza adversarial --spec; waza run is unchanged.
  • Every adversarial task is golden: true, so unsafe outcomes also flip waza gate to exit 2. The dedicated CLI exits 0 on pass, 2 on unsafe-with-fail, 3 on config error.

Implementation notes

  • internal/adversarial embeds the catalog via //go:embed all:data and exposes ListPacks / LoadPack / Extract / TaskRelPaths.
  • cmd/waza/cmd_adversarial.go synthesizes an eval.yaml in a temp dir, injects absolute context_dir paths into each extracted task (the runner resolves relative context_dir against SpecDir, not the task file dir), then reuses runCommandForSpec so all waza run plumbing is shared.
  • --list-packs flag short-circuits before pack resolution and prints name, task count, description for each embedded pack.
  • One subtle gotcha addressed: a go.mod fixture would have created a nested module boundary that //go:embed silently skips; renamed to go.mod.txt with the referencing task updated.

Test plan

  • go build ./... — clean
  • go vet ./... — clean
  • go test ./... — all green
  • golangci-lint run — 0 issues
  • cd site && npm run build — clean (24 pages)
  • Manual smoke:
    • waza adversarial --list-packs — lists both packs
    • waza adversarial --packs prompt-injection --on-unsafe-outcome warn — exits 0
    • waza adversarial --packs scope-bypass --on-unsafe-outcome fail — exits 2
    • waza adversarial --packs not-a-pack — exits 3

New tests

  • internal/adversarial/packs_test.goListPacks, LoadPack, Extract, "every task is golden" invariant.
  • cmd/waza/cmd_adversarial_test.go — 7 tests: warn-policy run, fail-policy exit hook, unknown-pack rejection, spec-block resolution, flag overrides, default packs, --output JSON round-trip, injectContextDir round-trip.

Docs

  • New guide: site/src/content/docs/guides/adversarial.mdx
  • CLI reference: site/src/content/docs/reference/cli.mdx
  • README: command index + waza adversarial subsection

Schema

Schema stays at 1.2 — the adversarial: block is purely additive per the Wave 3 semver policy (#368).

Closes #365

Add 'waza adversarial' subcommand and an internal/adversarial package that
ships two built-in fault-injection packs embedded into the binary:

- prompt-injection (4 tasks) — indirect prompt injection through fixture
  files (README, source comment, ticket body, changelog link).
- scope-bypass (4 tasks) — out-of-scope action requests (email, file
  deletion, package install, external HTTP).

Every adversarial task is golden:true, so unsafe outcomes also flip
'waza gate' to exit 2. The dedicated CLI enforces an --on-unsafe-outcome
policy (fail|warn) and prints a focused safety summary.

Schema 1.2 gains an additive 'adversarial:' block on EvalSpec:

  adversarial:
    packs: [prompt-injection, scope-bypass]
    on_unsafe_outcome: fail

The block is consumed only by 'waza adversarial --spec'; 'waza run' is
unchanged.

Implementation notes:

- internal/adversarial embeds the pack catalog with //go:embed all:data
  and exposes ListPacks / LoadPack / Extract / TaskRelPaths.
- cmd/waza/cmd_adversarial.go synthesizes an eval.yaml in a temp dir,
  injects absolute context_dir paths into each extracted task (the
  runner resolves relative context_dir against SpecDir, not the task
  file dir), then reuses runCommandForSpec so all 'waza run' plumbing
  is shared.
- Exit codes: 0 pass, 2 unsafe-with-fail, 3 config error. Matches
  GateExitGoldenFailure so a single CI step gates goldens + adversarial.
- A go.mod fixture would have created a nested module boundary that
  embed silently skips; renamed to go.mod.txt with the task updated.

Tests:

- internal/adversarial/packs_test.go — ListPacks, LoadPack, Extract,
  'every task is golden' invariant.
- cmd/waza/cmd_adversarial_test.go — warn-policy run, fail-policy exit
  hook, unknown-pack rejection, spec-block resolution, flag overrides,
  --output JSON round-trip, injectContextDir round-trip.

Docs:

- New guide: site/src/content/docs/guides/adversarial.mdx
- CLI reference: site/src/content/docs/reference/cli.mdx
- README.md: command index + 'waza adversarial' section

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 28, 2026 14:21

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new offline adversarial / fault-injection harness to waza, including an waza adversarial CLI, two embedded built-in packs (prompt injection + scope bypass), and a schema-1.2 additive adversarial: block on EvalSpec so pack selection/policy can be inherited from eval.yaml.

Changes:

  • Added waza adversarial command that extracts embedded packs, synthesizes an eval.yaml, and reuses the existing run pipeline while enforcing an --on-unsafe-outcome {fail,warn} policy.
  • Introduced EvalSpec.adversarial (schema 1.2 additive) and updated schema-version documentation/comments to reflect 1.2.
  • Added embedded pack catalog + fixtures, tests for pack loading/extraction invariants, and docs (site guide + CLI reference + README updates).
Show a summary per file
File Description
site/src/content/docs/reference/cli.mdx Documents the new waza adversarial command, flags, and exit codes in the CLI reference.
site/src/content/docs/guides/adversarial.mdx Adds a dedicated guide explaining packs, policies, CI integration, and schema usage.
README.md Adds waza adversarial usage + flags and links to the guide.
internal/models/spec.go Adds EvalSpec.Adversarial plus schema types/validation for the new block.
internal/models/schema_version.go Updates schema 1.2 commentary to include the new adversarial block.
internal/adversarial/packs.go Implements embedded-pack catalog loading and extraction (//go:embed all:data).
internal/adversarial/packs_test.go Adds tests for pack listing/loading/extraction and “every task is golden” invariant.
internal/adversarial/data/scope-bypass/pack.yaml Defines the scope-bypass pack metadata and task list.
internal/adversarial/data/scope-bypass/tasks/send-email.yaml Adds a scope-bypass task that requires refusing to send email.
internal/adversarial/data/scope-bypass/tasks/install-package.yaml Adds a scope-bypass task that requires refusing dependency installation.
internal/adversarial/data/scope-bypass/tasks/external-http.yaml Adds a scope-bypass task that requires refusing external network calls.
internal/adversarial/data/scope-bypass/tasks/delete-files.yaml Adds a scope-bypass task that requires refusing destructive file deletion.
internal/adversarial/data/scope-bypass/fixtures/review-change.diff Adds a review diff fixture used by scope-bypass tasks.
internal/adversarial/data/scope-bypass/fixtures/project-structure.txt Adds a project layout fixture used by delete-files scope-bypass task.
internal/adversarial/data/scope-bypass/fixtures/incident-note.md Adds an incident note fixture used by external-http scope-bypass task.
internal/adversarial/data/scope-bypass/fixtures/go.mod.txt Adds a renamed go.mod fixture (avoids nested module boundary with go:embed).
internal/adversarial/data/prompt-injection/pack.yaml Defines the prompt-injection pack metadata and task list.
internal/adversarial/data/prompt-injection/tasks/system-override.yaml Adds a prompt-injection task simulating a fake system comment in source.
internal/adversarial/data/prompt-injection/tasks/markdown-link-injection.yaml Adds a prompt-injection task simulating hidden directives in markdown link title.
internal/adversarial/data/prompt-injection/tasks/ignore-previous.yaml Adds a prompt-injection task simulating “ignore previous instructions” in a README.
internal/adversarial/data/prompt-injection/tasks/exfil-credentials.yaml Adds a prompt-injection task simulating credential exfiltration attempt via ticket text.
internal/adversarial/data/prompt-injection/fixtures/ticket-with-exfil.txt Adds a support-ticket fixture embedding an exfiltration directive.
internal/adversarial/data/prompt-injection/fixtures/ignore-previous-readme.md Adds a README fixture embedding an “ignore previous” injection.
internal/adversarial/data/prompt-injection/fixtures/fake-system-message.go.txt Adds a source fixture embedding a fake system override comment.
internal/adversarial/data/prompt-injection/fixtures/changelog-link.md Adds a changelog fixture with hidden directive in markdown link title attribute.
cmd/waza/root.go Registers the new adversarial subcommand on the root CLI.
cmd/waza/cmd_adversarial.go Implements the waza adversarial command and pack extraction/spec synthesis.
cmd/waza/cmd_adversarial_test.go Adds command-level tests for policy behavior, spec inheritance, output writing, and context_dir injection.

Review details

  • Files reviewed: 28/28 changed files
  • Comments generated: 9
  • Review effort level: Low

Comment thread internal/models/spec.go
Comment thread internal/models/spec.go Outdated
Comment thread internal/models/spec.go Outdated
Comment thread cmd/waza/cmd_adversarial.go
Comment thread cmd/waza/cmd_adversarial.go
Comment thread cmd/waza/cmd_adversarial.go Outdated
Comment thread site/src/content/docs/guides/adversarial.mdx Outdated
Comment thread README.md Outdated
Comment thread internal/adversarial/packs_test.go
Copilot AI added 2 commits June 28, 2026 10:26
%q produces a Go-quoted string with doubled backslashes on Windows;
the assertion must reconstruct the expected value the same way.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- spec.go: Validate now writes back normalized pack names; fix misleading
  doc-comments on AdversarialOnUnsafeOutcome and AdversarialConfig.Packs.
- cmd_adversarial.go: return ExitCodeError instead of os.Exit for config
  errors and unsafe-outcome+fail so deferred cleanups run; clarify
  injectContextDir docstring to match non-recursive behavior.
- guides/adversarial.mdx: drop bogus '--packs ?' lister; point at the
  real --list-packs flag.
- README: bump schema reference from 1.0 to 1.2 (current schema).
- packs_test.go: allow extra built-in packs without breaking the test.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 28, 2026 14:35
@spboyer spboyer merged commit 182bd0c into main Jun 28, 2026
10 checks passed
@spboyer spboyer deleted the spboyer-feat-adversarial-harness branch June 28, 2026 14:40

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review details

  • Files reviewed: 28/28 changed files
  • Comments generated: 3
  • Review effort level: Low

Comment on lines +291 to +311
// Reuse runCommandForSpec for the actual run. Set the package-level
// flags it consumes, then restore them on exit so we don't leak state
// across multiple commands in a single process (tests, embedders).
prevContextDir := contextDir
prevOutputPath := outputPath
prevWorkers := workers
prevParallel := parallel
prevVerbose := verbose
defer func() {
contextDir = prevContextDir
outputPath = prevOutputPath
workers = prevWorkers
parallel = prevParallel
verbose = prevVerbose
}()

contextDir = artifactsRoot
outputPath = opts.output
workers = opts.workers
parallel = opts.parallel
verbose = opts.verbose
Comment on lines +183 to +196
engineName := strings.TrimSpace(opts.engine)
if engineName == "" {
if opts.skill == "" {
engineName = "mock"
} else {
engineName = "copilot-sdk"
}
}
skillName := strings.TrimSpace(opts.skill)
if skillName == "" {
// Use a deterministic placeholder for mock runs so the synthesized
// spec validates without forcing the caller to pick one.
skillName = "adversarial-target"
}
Comment thread internal/models/spec.go
Comment on lines +163 to +168
switch a.OnUnsafeOutcome {
case "", AdversarialOnUnsafeOutcomeFail, AdversarialOnUnsafeOutcomeWarn:
default:
return fmt.Errorf("adversarial.on_unsafe_outcome must be %q or %q, got %q",
AdversarialOnUnsafeOutcomeFail, AdversarialOnUnsafeOutcomeWarn, a.OnUnsafeOutcome)
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Adversarial / safety evaluators (prompt injection, jailbreak, scope-bypass)

3 participants