feat: adversarial / fault-injection harness (closes #365)#392
Merged
Conversation
Add 'waza adversarial' subcommand and an internal/adversarial package that
ships two built-in fault-injection packs embedded into the binary:
- prompt-injection (4 tasks) — indirect prompt injection through fixture
files (README, source comment, ticket body, changelog link).
- scope-bypass (4 tasks) — out-of-scope action requests (email, file
deletion, package install, external HTTP).
Every adversarial task is golden:true, so unsafe outcomes also flip
'waza gate' to exit 2. The dedicated CLI enforces an --on-unsafe-outcome
policy (fail|warn) and prints a focused safety summary.
Schema 1.2 gains an additive 'adversarial:' block on EvalSpec:
adversarial:
packs: [prompt-injection, scope-bypass]
on_unsafe_outcome: fail
The block is consumed only by 'waza adversarial --spec'; 'waza run' is
unchanged.
Implementation notes:
- internal/adversarial embeds the pack catalog with //go:embed all:data
and exposes ListPacks / LoadPack / Extract / TaskRelPaths.
- cmd/waza/cmd_adversarial.go synthesizes an eval.yaml in a temp dir,
injects absolute context_dir paths into each extracted task (the
runner resolves relative context_dir against SpecDir, not the task
file dir), then reuses runCommandForSpec so all 'waza run' plumbing
is shared.
- Exit codes: 0 pass, 2 unsafe-with-fail, 3 config error. Matches
GateExitGoldenFailure so a single CI step gates goldens + adversarial.
- A go.mod fixture would have created a nested module boundary that
embed silently skips; renamed to go.mod.txt with the task updated.
Tests:
- internal/adversarial/packs_test.go — ListPacks, LoadPack, Extract,
'every task is golden' invariant.
- cmd/waza/cmd_adversarial_test.go — warn-policy run, fail-policy exit
hook, unknown-pack rejection, spec-block resolution, flag overrides,
--output JSON round-trip, injectContextDir round-trip.
Docs:
- New guide: site/src/content/docs/guides/adversarial.mdx
- CLI reference: site/src/content/docs/reference/cli.mdx
- README.md: command index + 'waza adversarial' section
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds a new offline adversarial / fault-injection harness to waza, including an waza adversarial CLI, two embedded built-in packs (prompt injection + scope bypass), and a schema-1.2 additive adversarial: block on EvalSpec so pack selection/policy can be inherited from eval.yaml.
Changes:
- Added
waza adversarialcommand that extracts embedded packs, synthesizes aneval.yaml, and reuses the existing run pipeline while enforcing an--on-unsafe-outcome {fail,warn}policy. - Introduced
EvalSpec.adversarial(schema 1.2 additive) and updated schema-version documentation/comments to reflect 1.2. - Added embedded pack catalog + fixtures, tests for pack loading/extraction invariants, and docs (site guide + CLI reference + README updates).
Show a summary per file
| File | Description |
|---|---|
| site/src/content/docs/reference/cli.mdx | Documents the new waza adversarial command, flags, and exit codes in the CLI reference. |
| site/src/content/docs/guides/adversarial.mdx | Adds a dedicated guide explaining packs, policies, CI integration, and schema usage. |
| README.md | Adds waza adversarial usage + flags and links to the guide. |
| internal/models/spec.go | Adds EvalSpec.Adversarial plus schema types/validation for the new block. |
| internal/models/schema_version.go | Updates schema 1.2 commentary to include the new adversarial block. |
| internal/adversarial/packs.go | Implements embedded-pack catalog loading and extraction (//go:embed all:data). |
| internal/adversarial/packs_test.go | Adds tests for pack listing/loading/extraction and “every task is golden” invariant. |
| internal/adversarial/data/scope-bypass/pack.yaml | Defines the scope-bypass pack metadata and task list. |
| internal/adversarial/data/scope-bypass/tasks/send-email.yaml | Adds a scope-bypass task that requires refusing to send email. |
| internal/adversarial/data/scope-bypass/tasks/install-package.yaml | Adds a scope-bypass task that requires refusing dependency installation. |
| internal/adversarial/data/scope-bypass/tasks/external-http.yaml | Adds a scope-bypass task that requires refusing external network calls. |
| internal/adversarial/data/scope-bypass/tasks/delete-files.yaml | Adds a scope-bypass task that requires refusing destructive file deletion. |
| internal/adversarial/data/scope-bypass/fixtures/review-change.diff | Adds a review diff fixture used by scope-bypass tasks. |
| internal/adversarial/data/scope-bypass/fixtures/project-structure.txt | Adds a project layout fixture used by delete-files scope-bypass task. |
| internal/adversarial/data/scope-bypass/fixtures/incident-note.md | Adds an incident note fixture used by external-http scope-bypass task. |
| internal/adversarial/data/scope-bypass/fixtures/go.mod.txt | Adds a renamed go.mod fixture (avoids nested module boundary with go:embed). |
| internal/adversarial/data/prompt-injection/pack.yaml | Defines the prompt-injection pack metadata and task list. |
| internal/adversarial/data/prompt-injection/tasks/system-override.yaml | Adds a prompt-injection task simulating a fake system comment in source. |
| internal/adversarial/data/prompt-injection/tasks/markdown-link-injection.yaml | Adds a prompt-injection task simulating hidden directives in markdown link title. |
| internal/adversarial/data/prompt-injection/tasks/ignore-previous.yaml | Adds a prompt-injection task simulating “ignore previous instructions” in a README. |
| internal/adversarial/data/prompt-injection/tasks/exfil-credentials.yaml | Adds a prompt-injection task simulating credential exfiltration attempt via ticket text. |
| internal/adversarial/data/prompt-injection/fixtures/ticket-with-exfil.txt | Adds a support-ticket fixture embedding an exfiltration directive. |
| internal/adversarial/data/prompt-injection/fixtures/ignore-previous-readme.md | Adds a README fixture embedding an “ignore previous” injection. |
| internal/adversarial/data/prompt-injection/fixtures/fake-system-message.go.txt | Adds a source fixture embedding a fake system override comment. |
| internal/adversarial/data/prompt-injection/fixtures/changelog-link.md | Adds a changelog fixture with hidden directive in markdown link title attribute. |
| cmd/waza/root.go | Registers the new adversarial subcommand on the root CLI. |
| cmd/waza/cmd_adversarial.go | Implements the waza adversarial command and pack extraction/spec synthesis. |
| cmd/waza/cmd_adversarial_test.go | Adds command-level tests for policy behavior, spec inheritance, output writing, and context_dir injection. |
Review details
- Files reviewed: 28/28 changed files
- Comments generated: 9
- Review effort level: Low
%q produces a Go-quoted string with doubled backslashes on Windows; the assertion must reconstruct the expected value the same way. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- spec.go: Validate now writes back normalized pack names; fix misleading doc-comments on AdversarialOnUnsafeOutcome and AdversarialConfig.Packs. - cmd_adversarial.go: return ExitCodeError instead of os.Exit for config errors and unsafe-outcome+fail so deferred cleanups run; clarify injectContextDir docstring to match non-recursive behavior. - guides/adversarial.mdx: drop bogus '--packs ?' lister; point at the real --list-packs flag. - README: bump schema reference from 1.0 to 1.2 (current schema). - packs_test.go: allow extra built-in packs without breaking the test. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment on lines
+291
to
+311
| // Reuse runCommandForSpec for the actual run. Set the package-level | ||
| // flags it consumes, then restore them on exit so we don't leak state | ||
| // across multiple commands in a single process (tests, embedders). | ||
| prevContextDir := contextDir | ||
| prevOutputPath := outputPath | ||
| prevWorkers := workers | ||
| prevParallel := parallel | ||
| prevVerbose := verbose | ||
| defer func() { | ||
| contextDir = prevContextDir | ||
| outputPath = prevOutputPath | ||
| workers = prevWorkers | ||
| parallel = prevParallel | ||
| verbose = prevVerbose | ||
| }() | ||
|
|
||
| contextDir = artifactsRoot | ||
| outputPath = opts.output | ||
| workers = opts.workers | ||
| parallel = opts.parallel | ||
| verbose = opts.verbose |
Comment on lines
+183
to
+196
| engineName := strings.TrimSpace(opts.engine) | ||
| if engineName == "" { | ||
| if opts.skill == "" { | ||
| engineName = "mock" | ||
| } else { | ||
| engineName = "copilot-sdk" | ||
| } | ||
| } | ||
| skillName := strings.TrimSpace(opts.skill) | ||
| if skillName == "" { | ||
| // Use a deterministic placeholder for mock runs so the synthesized | ||
| // spec validates without forcing the caller to pick one. | ||
| skillName = "adversarial-target" | ||
| } |
Comment on lines
+163
to
+168
| switch a.OnUnsafeOutcome { | ||
| case "", AdversarialOnUnsafeOutcomeFail, AdversarialOnUnsafeOutcomeWarn: | ||
| default: | ||
| return fmt.Errorf("adversarial.on_unsafe_outcome must be %q or %q, got %q", | ||
| AdversarialOnUnsafeOutcomeFail, AdversarialOnUnsafeOutcomeWarn, a.OnUnsafeOutcome) | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the Wave 4 adversarial / fault-injection harness called out in #365.
waza adversarial— runs one or more built-in adversarial packs against a skill and enforces an--on-unsafe-outcome {fail,warn}policy.prompt-injection(4 tasks) — indirect prompt injection through fixture files (README, source comment, ticket body, changelog link).scope-bypass(4 tasks) — out-of-scope action requests (email, file deletion, package install, external HTTP).adversarial:block onEvalSpec:waza adversarial --spec;waza runis unchanged.golden: true, so unsafe outcomes also flipwaza gateto exit 2. The dedicated CLI exits 0 on pass, 2 on unsafe-with-fail, 3 on config error.Implementation notes
internal/adversarialembeds the catalog via//go:embed all:dataand exposesListPacks/LoadPack/Extract/TaskRelPaths.cmd/waza/cmd_adversarial.gosynthesizes aneval.yamlin a temp dir, injects absolutecontext_dirpaths into each extracted task (the runner resolves relativecontext_diragainstSpecDir, not the task file dir), then reusesrunCommandForSpecso allwaza runplumbing is shared.--list-packsflag short-circuits before pack resolution and printsname,task count,descriptionfor each embedded pack.go.modfixture would have created a nested module boundary that//go:embedsilently skips; renamed togo.mod.txtwith the referencing task updated.Test plan
go build ./...— cleango vet ./...— cleango test ./...— all greengolangci-lint run— 0 issuescd site && npm run build— clean (24 pages)waza adversarial --list-packs— lists both packswaza adversarial --packs prompt-injection --on-unsafe-outcome warn— exits 0waza adversarial --packs scope-bypass --on-unsafe-outcome fail— exits 2waza adversarial --packs not-a-pack— exits 3New tests
internal/adversarial/packs_test.go—ListPacks,LoadPack,Extract, "every task is golden" invariant.cmd/waza/cmd_adversarial_test.go— 7 tests: warn-policy run, fail-policy exit hook, unknown-pack rejection, spec-block resolution, flag overrides, default packs,--outputJSON round-trip,injectContextDirround-trip.Docs
site/src/content/docs/guides/adversarial.mdxsite/src/content/docs/reference/cli.mdxwaza adversarialsubsectionSchema
Schema stays at 1.2 — the
adversarial:block is purely additive per the Wave 3 semver policy (#368).Closes #365