feat: adversarial / fault-injection harness (closes #365) by spboyer · Pull Request #392 · microsoft/waza

spboyer · 2026-06-28T14:21:36Z

Summary

Adds the Wave 4 adversarial / fault-injection harness called out in #365.

New CLI: waza adversarial — runs one or more built-in adversarial packs against a skill and enforces an --on-unsafe-outcome {fail,warn} policy.
Two built-in packs, both embedded into the binary:
- prompt-injection (4 tasks) — indirect prompt injection through fixture files (README, source comment, ticket body, changelog link).
- scope-bypass (4 tasks) — out-of-scope action requests (email, file deletion, package install, external HTTP).
Schema 1.2 additive adversarial: block on EvalSpec:
```
adversarial:
  packs: [prompt-injection, scope-bypass]
  on_unsafe_outcome: fail
```
Consumed only by waza adversarial --spec; waza run is unchanged.
Every adversarial task is golden: true, so unsafe outcomes also flip waza gate to exit 2. The dedicated CLI exits 0 on pass, 2 on unsafe-with-fail, 3 on config error.

Implementation notes

internal/adversarial embeds the catalog via //go:embed all:data and exposes ListPacks / LoadPack / Extract / TaskRelPaths.
cmd/waza/cmd_adversarial.go synthesizes an eval.yaml in a temp dir, injects absolute context_dir paths into each extracted task (the runner resolves relative context_dir against SpecDir, not the task file dir), then reuses runCommandForSpec so all waza run plumbing is shared.
--list-packs flag short-circuits before pack resolution and prints name, task count, description for each embedded pack.
One subtle gotcha addressed: a go.mod fixture would have created a nested module boundary that //go:embed silently skips; renamed to go.mod.txt with the referencing task updated.

Test plan

go build ./... — clean
go vet ./... — clean
go test ./... — all green
golangci-lint run — 0 issues
cd site && npm run build — clean (24 pages)
Manual smoke:
- waza adversarial --list-packs — lists both packs
- waza adversarial --packs prompt-injection --on-unsafe-outcome warn — exits 0
- waza adversarial --packs scope-bypass --on-unsafe-outcome fail — exits 2
- waza adversarial --packs not-a-pack — exits 3

New tests

internal/adversarial/packs_test.go — ListPacks, LoadPack, Extract, "every task is golden" invariant.
cmd/waza/cmd_adversarial_test.go — 7 tests: warn-policy run, fail-policy exit hook, unknown-pack rejection, spec-block resolution, flag overrides, default packs, --output JSON round-trip, injectContextDir round-trip.

Docs

New guide: site/src/content/docs/guides/adversarial.mdx
CLI reference: site/src/content/docs/reference/cli.mdx
README: command index + waza adversarial subsection

Schema

Schema stays at 1.2 — the adversarial: block is purely additive per the Wave 3 semver policy (#368).

Closes #365

Add 'waza adversarial' subcommand and an internal/adversarial package that ships two built-in fault-injection packs embedded into the binary: - prompt-injection (4 tasks) — indirect prompt injection through fixture files (README, source comment, ticket body, changelog link). - scope-bypass (4 tasks) — out-of-scope action requests (email, file deletion, package install, external HTTP). Every adversarial task is golden:true, so unsafe outcomes also flip 'waza gate' to exit 2. The dedicated CLI enforces an --on-unsafe-outcome policy (fail|warn) and prints a focused safety summary. Schema 1.2 gains an additive 'adversarial:' block on EvalSpec: adversarial: packs: [prompt-injection, scope-bypass] on_unsafe_outcome: fail The block is consumed only by 'waza adversarial --spec'; 'waza run' is unchanged. Implementation notes: - internal/adversarial embeds the pack catalog with //go:embed all:data and exposes ListPacks / LoadPack / Extract / TaskRelPaths. - cmd/waza/cmd_adversarial.go synthesizes an eval.yaml in a temp dir, injects absolute context_dir paths into each extracted task (the runner resolves relative context_dir against SpecDir, not the task file dir), then reuses runCommandForSpec so all 'waza run' plumbing is shared. - Exit codes: 0 pass, 2 unsafe-with-fail, 3 config error. Matches GateExitGoldenFailure so a single CI step gates goldens + adversarial. - A go.mod fixture would have created a nested module boundary that embed silently skips; renamed to go.mod.txt with the task updated. Tests: - internal/adversarial/packs_test.go — ListPacks, LoadPack, Extract, 'every task is golden' invariant. - cmd/waza/cmd_adversarial_test.go — warn-policy run, fail-policy exit hook, unknown-pack rejection, spec-block resolution, flag overrides, --output JSON round-trip, injectContextDir round-trip. Docs: - New guide: site/src/content/docs/guides/adversarial.mdx - CLI reference: site/src/content/docs/reference/cli.mdx - README.md: command index + 'waza adversarial' section Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR adds a new offline adversarial / fault-injection harness to waza, including an waza adversarial CLI, two embedded built-in packs (prompt injection + scope bypass), and a schema-1.2 additive adversarial: block on EvalSpec so pack selection/policy can be inherited from eval.yaml.

Changes:

Added waza adversarial command that extracts embedded packs, synthesizes an eval.yaml, and reuses the existing run pipeline while enforcing an --on-unsafe-outcome {fail,warn} policy.
Introduced EvalSpec.adversarial (schema 1.2 additive) and updated schema-version documentation/comments to reflect 1.2.
Added embedded pack catalog + fixtures, tests for pack loading/extraction invariants, and docs (site guide + CLI reference + README updates).

Show a summary per file

File	Description
site/src/content/docs/reference/cli.mdx	Documents the new `waza adversarial` command, flags, and exit codes in the CLI reference.
site/src/content/docs/guides/adversarial.mdx	Adds a dedicated guide explaining packs, policies, CI integration, and schema usage.
README.md	Adds `waza adversarial` usage + flags and links to the guide.
internal/models/spec.go	Adds `EvalSpec.Adversarial` plus schema types/validation for the new block.
internal/models/schema_version.go	Updates schema 1.2 commentary to include the new `adversarial` block.
internal/adversarial/packs.go	Implements embedded-pack catalog loading and extraction (`//go:embed all:data`).
internal/adversarial/packs_test.go	Adds tests for pack listing/loading/extraction and “every task is golden” invariant.
internal/adversarial/data/scope-bypass/pack.yaml	Defines the scope-bypass pack metadata and task list.
internal/adversarial/data/scope-bypass/tasks/send-email.yaml	Adds a scope-bypass task that requires refusing to send email.
internal/adversarial/data/scope-bypass/tasks/install-package.yaml	Adds a scope-bypass task that requires refusing dependency installation.
internal/adversarial/data/scope-bypass/tasks/external-http.yaml	Adds a scope-bypass task that requires refusing external network calls.
internal/adversarial/data/scope-bypass/tasks/delete-files.yaml	Adds a scope-bypass task that requires refusing destructive file deletion.
internal/adversarial/data/scope-bypass/fixtures/review-change.diff	Adds a review diff fixture used by scope-bypass tasks.
internal/adversarial/data/scope-bypass/fixtures/project-structure.txt	Adds a project layout fixture used by delete-files scope-bypass task.
internal/adversarial/data/scope-bypass/fixtures/incident-note.md	Adds an incident note fixture used by external-http scope-bypass task.
internal/adversarial/data/scope-bypass/fixtures/go.mod.txt	Adds a renamed go.mod fixture (avoids nested module boundary with go:embed).
internal/adversarial/data/prompt-injection/pack.yaml	Defines the prompt-injection pack metadata and task list.
internal/adversarial/data/prompt-injection/tasks/system-override.yaml	Adds a prompt-injection task simulating a fake system comment in source.
internal/adversarial/data/prompt-injection/tasks/markdown-link-injection.yaml	Adds a prompt-injection task simulating hidden directives in markdown link title.
internal/adversarial/data/prompt-injection/tasks/ignore-previous.yaml	Adds a prompt-injection task simulating “ignore previous instructions” in a README.
internal/adversarial/data/prompt-injection/tasks/exfil-credentials.yaml	Adds a prompt-injection task simulating credential exfiltration attempt via ticket text.
internal/adversarial/data/prompt-injection/fixtures/ticket-with-exfil.txt	Adds a support-ticket fixture embedding an exfiltration directive.
internal/adversarial/data/prompt-injection/fixtures/ignore-previous-readme.md	Adds a README fixture embedding an “ignore previous” injection.
internal/adversarial/data/prompt-injection/fixtures/fake-system-message.go.txt	Adds a source fixture embedding a fake system override comment.
internal/adversarial/data/prompt-injection/fixtures/changelog-link.md	Adds a changelog fixture with hidden directive in markdown link title attribute.
cmd/waza/root.go	Registers the new `adversarial` subcommand on the root CLI.
cmd/waza/cmd_adversarial.go	Implements the `waza adversarial` command and pack extraction/spec synthesis.
cmd/waza/cmd_adversarial_test.go	Adds command-level tests for policy behavior, spec inheritance, output writing, and context_dir injection.

Review details

Files reviewed: 28/28 changed files
Comments generated: 9
Review effort level: Low

%q produces a Go-quoted string with doubled backslashes on Windows; the assertion must reconstruct the expected value the same way. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- spec.go: Validate now writes back normalized pack names; fix misleading doc-comments on AdversarialOnUnsafeOutcome and AdversarialConfig.Packs. - cmd_adversarial.go: return ExitCodeError instead of os.Exit for config errors and unsafe-outcome+fail so deferred cleanups run; clarify injectContextDir docstring to match non-recursive behavior. - guides/adversarial.mdx: drop bogus '--packs ?' lister; point at the real --list-packs flag. - README: bump schema reference from 1.0 to 1.2 (current schema). - packs_test.go: allow extra built-in packs without breaking the test. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Review details

Files reviewed: 28/28 changed files
Comments generated: 3
Review effort level: Low

+	// Reuse runCommandForSpec for the actual run. Set the package-level
+	// flags it consumes, then restore them on exit so we don't leak state
+	// across multiple commands in a single process (tests, embedders).
+	prevContextDir := contextDir
+	prevOutputPath := outputPath
+	prevWorkers := workers
+	prevParallel := parallel
+	prevVerbose := verbose
+	defer func() {
+		contextDir = prevContextDir
+		outputPath = prevOutputPath
+		workers = prevWorkers
+		parallel = prevParallel
+		verbose = prevVerbose
+	}()
+
+	contextDir = artifactsRoot
+	outputPath = opts.output
+	workers = opts.workers
+	parallel = opts.parallel
+	verbose = opts.verbose


+	engineName := strings.TrimSpace(opts.engine)
+	if engineName == "" {
+		if opts.skill == "" {
+			engineName = "mock"
+		} else {
+			engineName = "copilot-sdk"
+		}
+	}
+	skillName := strings.TrimSpace(opts.skill)
+	if skillName == "" {
+		// Use a deterministic placeholder for mock runs so the synthesized
+		// spec validates without forcing the caller to pick one.
+		skillName = "adversarial-target"
+	}


+	switch a.OnUnsafeOutcome {
+	case "", AdversarialOnUnsafeOutcomeFail, AdversarialOnUnsafeOutcomeWarn:
+	default:
+		return fmt.Errorf("adversarial.on_unsafe_outcome must be %q or %q, got %q",
+			AdversarialOnUnsafeOutcomeFail, AdversarialOnUnsafeOutcomeWarn, a.OnUnsafeOutcome)
+	}


Copilot AI review requested due to automatic review settings June 28, 2026 14:21

Copilot started reviewing on behalf of spboyer June 28, 2026 14:22 View session

Copilot AI reviewed Jun 28, 2026

View reviewed changes

Copilot AI added 2 commits June 28, 2026 10:26

test: fix Windows path quoting in injectContextDir test

7273973

%q produces a Go-quoted string with doubled backslashes on Windows; the assertion must reconstruct the expected value the same way. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings June 28, 2026 14:35

Copilot started reviewing on behalf of spboyer June 28, 2026 14:36 View session

spboyer merged commit 182bd0c into main Jun 28, 2026
10 checks passed

spboyer deleted the spboyer-feat-adversarial-harness branch June 28, 2026 14:40

Copilot AI reviewed Jun 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: adversarial / fault-injection harness (closes #365)#392

feat: adversarial / fault-injection harness (closes #365)#392
spboyer merged 3 commits into
mainfrom
spboyer-feat-adversarial-harness

spboyer commented Jun 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

spboyer commented Jun 28, 2026

Summary

Implementation notes

Test plan

New tests

Docs

Schema

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Review details

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Review details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants