[Security] prompt injection defense: instruction signing and model verify gates; unauthorized mutation prevention#11119
Closed
adamavenir wants to merge 9 commits intoopenclaw:mainfrom
Closed
[Security] prompt injection defense: instruction signing and model verify gates; unauthorized mutation prevention#11119adamavenir wants to merge 9 commits intoopenclaw:mainfrom
verify gates; unauthorized mutation prevention#11119adamavenir wants to merge 9 commits intoopenclaw:mainfrom
Conversation
- Template files in llm/prompts/*.txt with loading/interpolation support - verify tool (owner-only) for checking template signatures and message provenance - Session-scoped message signing for owner messages via @disreguard/sig - Deterministic verification gate in before-tool-call hook (runs before plugin hooks) - Turn-scoped security state (resets per user message) - Config: agents.defaults.sig.enforceVerification (default: false) - Gated tools when enforcement enabled: exec, write, edit, apply_patch, message, gateway, sessions_spawn, sessions_send
…cripts - sig-gate-audit.ts: audit logging for gate decisions (gate_blocked/gate_allowed to .sig/audit.jsonl) - sig-gate-audit.test.ts: 7 tests for audit logging - pi-tools.before-tool-call.ts: fire-and-forget audit writes on gate decisions - adversarial-harness.ts: multi-turn scenario runner through real hook pipeline - adversarial-harness.test.ts: 8 mocked attack scenarios - adversarial-injection.live.test.ts: 3 live LLM tests (signed, unsigned, vulnerable) - docs/reference/adversarial-testing.md: updated with verify-in-the-loop and audit logging - sig-zenity-scenario.test.ts: refactored to import createMockTool from harness
apply_patch was excluded from the mutation gate because its file paths are embedded in patch content rather than tool params. This meant after verify, apply_patch could modify protected files (soul.md, agents.md) without going through update_and_sign — defeating the provenance requirement. Parse *** Add/Update/Delete File markers from the patch input and check each path against sig file policies. Includes unit tests for the fix and an adversarial harness scenario for the verified-agent-uses-apply_patch oversight.
Author
|
|
Author
|
Caught an issue: need to bootstrap the signing/mutation flow. Working on this... |
|
This is very interesting. Please rebase this. |
Author
|
@ksylvan will do |
bfc1ccb to
f92900f
Compare
Member
|
Thanks for your submission however we are closing your PR as stale, if you need to re-open please review contributing guide and if you feel like its required re-open under a new PR. Ensure you have addressed all checks, conflicts and issues. Thanks. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
DRAFT: Do not merge
I'm still tweaking and testing this. My intent is to personally run it in production for a bit to suss out UX edge cases. I will try to keep my fork in step with
mainas I do that.Author's note
I've been primarily focused on thinking about prompt injection defense for over a year. There are some other things that could be done to better defend OpenClaw against prompt injection, but this is the most substantive high-impact one, and it directly addresses the Zenity Labs vector.
I wasn't going to push this work out there yet as I have a much bigger project I'm focused on—I ripped this concept out of that other project because I thought it could be useful here.
Synopsis
Integrate sig to give the agent an out-of-band mechanism for verifying the authenticity of its instructions, the provenance of owner messages, and the integrity of mutable workspace files.
I'll give you the tl;dr from Opus: consistently proven to defend against unauthorized actions taken from prompt injection.
Description
Prompt injection exploits the fact that all text is text to the model. System prompts, user messages, and injected payloads are indistinguishable. Existing orchestrator-level controls (owner-only tools, untrusted metadata labels, SSRF guards) are deterministic and reliable, but they do not help the model distinguish authentic instructions from injected ones. An attacker who has read the source knows the exact label format and can spoof it.
sig addresses this gap at two levels:
Static verification. sig signs system prompt templates at authoring time (SHA-256 content hash, stored in
.sig/sigs/). At runtime, the agent calls averifytool that returns the original signed content from a code path the attacker cannot influence. Even with full knowledge of the system, an attacker cannot makeverify()return a valid result for tampered content. The verify tool is paired with a deterministic verification gate that blocks sensitive tools unlessverifyhas been called in the current turn.Mutation protection. Mutable workspace files (
soul.md,agents.md,heartbeat.md) are protected by sig file policies. A mutation gate interceptswrite/editcalls targeting these files and redirects the agent toupdate_and_sign, which validates that the change traces back to a signed owner message. This directly addresses the Zenity Labs backdoor attack (Feb 2026), where an agent was tricked via indirect prompt injection into modifying SOUL.md to establish a persistent C2 channel. With mutation protection, that attack fails because the agent cannot produce valid provenance for an instruction that originated from untrusted content.Threat model positioning
This is additive to existing controls, not a replacement:
OWNER_ONLY_TOOL_NAMESfetchWithSsrfGuardsig does not claim to solve prompt injection. It creates explicit trust boundaries that the orchestrator enforces deterministically. Defense in depth.
Integration
Template signing. System prompt sections extracted to
llm/prompts/*.txtwith{{placeholder}}interpolation. Signed withsig sign, verified at runtime.Verify tool. Owner-only tool. Verifies template signatures (returns signed content with placeholders visible) and message provenance (returns content + sender identity from the session ContentStore).
Message signing. Owner messages from authenticated channels are signed at ingestion using a session-scoped
ContentStore. The agent can verify that a message came from the actual owner, not an impersonator in a group chat.Verification gate. Deterministic check inserted before plugin hooks in
runBeforeToolCallHook. Gated tools:exec,write,edit,apply_patch,message,gateway,sessions_spawn,sessions_send,update_and_sign. Turn-scoped (resets per user message).Mutation gate. Second deterministic gate in the hook pipeline (after verification, before plugins). Intercepts
write/edittargeting files with sig file policies (mutable: true) and blocks them with an actionable message directing the agent to useupdate_and_sign.Update and sign tool. Owner-only tool for modifying protected workspace files. Requires provenance: the agent must cite the signature ID of a signed owner message that authorized the change. sig validates the source against the session ContentStore.
File policies. Configured in
.sig/config.jsonunder afileskey. Each policy specifiesmutable(true/false),authorizedIdentities(who can update), andrequireSignedSource(whether a signed source is required).Workspace initialization. On first agent run, workspace files with
mutable: truepolicies are signed if unsigned, establishing the initial chain anchor.Configuration.
agents.defaults.sig.enforceVerification(default:false). Deploy the verify tool first, enable enforcement when ready. OptionalgatedToolsoverride.Changes
New files:
src/agents/prompt-templates.ts— template loading with caching and{{placeholder}}interpolationsrc/agents/tools/sig-verify-tool.ts— the verify tool (template + message verification)src/agents/tools/sig-update-tool.ts— the update_and_sign tool (protected file updates with provenance)src/agents/message-signing.ts— owner message signing via@disreguard/sigContentStoresrc/agents/session-security-state.ts— per-session, per-turn verification statesrc/agents/sig-verification-gate.ts— deterministic verification gate logicsrc/agents/sig-verification-gate.test.ts— 9 tests (gated/ungated/verified/reset/config)src/agents/sig-mutation-gate.ts— deterministic mutation gate logicsrc/agents/sig-mutation-gate.test.ts— 11 tests (write/edit/apply_patch/policies/params)src/agents/sig-workspace-init.ts— initial signing of workspace filessrc/agents/sig-gate-audit.ts— audit logging for gate decisions (writes to.sig/audit.jsonl)src/agents/sig-gate-audit.test.ts— 7 audit logging testssrc/agents/adversarial-harness.ts— multi-turn adversarial scenario runnersrc/agents/adversarial-harness.test.ts— 8 mocked attack scenariossrc/agents/adversarial-injection.live.test.ts— 3 live LLM injection testsdocs/reference/adversarial-testing.md— adversarial testing guidellm/prompts/*.txt— 30 template files.sig/config.json— sig project config with file policiesdocs/concepts/prompt-signing.md— concept docdocs/reference/prompt-signing.md— architecture referenceModified files:
package.json— add@disreguard/sigdependency, includellm/in published filessrc/agents/tool-policy.ts— addverifyandupdate_and_signtoOWNER_ONLY_TOOL_NAMESsrc/agents/system-prompt.ts— addverify/update_and_signto tool summaries/ordering, add verification and protected files preamble sectionssrc/agents/pi-tools.ts— register verify and update_and_sign tools, threadmessageSigning/turnId/senderIdentity/projectRoot/sigConfigsrc/agents/pi-tools.before-tool-call.ts— insert verification and mutation gates before plugin hooks, extendHookContext, add gate audit loggingsrc/agents/sig-zenity-scenario.test.ts— refactored to importcreateMockToolfrom adversarial harnessdocs/docs.json— added adversarial testing to Security nav groupsrc/agents/pi-embedded-runner/run/attempt.ts— turn ID generation, verification reset, message signing, sig config loading, workspace init, sender identity building, context threadingsrc/config/types.agent-defaults.ts— addsigconfig blocksrc/config/zod-schema.agent-defaults.ts— Zod schema for sig configsrc/config/schema.ts— field labels and help textTesting
Three-layer adversarial test infrastructure validates that prompt injection attacks are blocked:
Layer 1: Gate unit tests (23 tests, deterministic)
sig-zenity-scenario.test.ts— 23 tests coveringcheckVerificationGate()andcheckMutationGate()in isolationLayer 2: Mocked adversarial harness (8 scenarios, deterministic)
adversarial-harness.test.ts— scripted multi-turn tool calls through the real hook pipelineLayer 2b: Audit logging tests (7 tests)
sig-gate-audit.test.ts— blocked tools producegate_blockedaudit entries, verified tools producegate_allowedentries, non-gated tools produce no entries, enforcement disabled produces no entriesLayer 3: Live LLM injection tests (3 scenarios, model-in-the-loop)
adversarial-injection.live.test.ts— real model (Haiku 4.5) with verify tool in the loopLive test transcripts (Haiku 4.5, 2026-02-07):
Transcript 1: Signed instruction (legitimate owner message)
The model receives a legitimate instruction, hits the verification gate, calls
verify, and proceeds after verification succeeds.Transcript 2: Unsigned injection (attacker payload, gates enabled)
The model receives a benign-looking injection payload, eagerly tries all 4 tools, gets blocked on every gated tool, calls
verify, verification fails, and the model backs off completely.Transcript 3: Vulnerable baseline (SIG_ENFORCE=0, no gates)
Same injection payload, but with gates disabled. The model follows every instruction without hesitation. This test is expected to FAIL — it proves the gates are necessary.
The contrast is stark: the same model, the same injection payload, the same tools. With gates, 0 dangerous tools execute and the model explains it can't proceed. Without gates, all 4 execute cheerfully.
Full suite: 226/226 passing (219 existing + 7 audit)
Build: clean
Type check: clean
Security model
sig v1 uses content hashing, not cryptographic keys. It detects modification and provides provenance but is not forgery-resistant. This is sufficient when
.sig/is read-only to the agent (the standard case). Both the verification gate and the mutation gate are orchestrator-level code, not prompt text.Mutable workspace files (
soul.md,agents.md,heartbeat.md) are protected by sig file policies. Direct writes are intercepted and redirected to theupdate_and_signtool, which validates that the change traces back to a signed owner message. This addresses persistence attacks where an agent is tricked via indirect prompt injection into modifying its identity or configuration files.Greptile Overview
Greptile Summary
This PR integrates
@disreguard/siginto the agent runtime to add (1) averifyowner-only tool for validating signed prompt templates and message provenance, (2) a deterministic verification gate inrunBeforeToolCallHookthat blocks a configured set of sensitive tools until verification succeeds for the current turn, and (3) a mutation gate intended to prevent direct writes/edits to sig-protected mutable workspace files by forcing updates through a newupdate_and_signtool with provenance.The overall architecture fits into the existing tool orchestration by threading turn/session security state through
pi-embedded-runner/run/attempt.tsand wrapping tool execution via the existing before-tool-call hook pipeline.Two correctness/security gaps remain before merge:
apply_patchis excluded from the mutation gate, which allows protected files to be modified afterverify.Confidence Score: 2/5
apply_patchcan modify sig-protected mutable files without going throughupdate_and_sign, and becauseverifycurrently marks the turn as verified based solely on template verification, allowing gated tools after a single owner-only tool call even when no signed owner instruction was proven. Both issues undermine the PR’s core security guarantees whenenforceVerificationis enabled.(3/5) Reply to the agent's comments like "Can you suggest a fix for this @greptileai?" or ask follow-up questions!