feat(gateway): support modular guardrails extensions for securing against indirect prompt injections and other agentic threats#6095
Conversation
Unify the guardrails system with the existing plugin hook infrastructure: - Add before_request and after_response hooks to plugin types - Extend before_tool_call/after_tool_call with richer context and modify capabilities (after_tool_call now returns results) - Wire up all four hook stages in the agent runner and tool adapter - Move Gray Swan implementation to extensions/grayswan/ plugin - Remove guardrail registry and apply functions The Gray Swan guardrail now follows the standard plugin pattern, registering handlers via api.on() for each stage. Configuration remains unchanged (guardrails.grayswan in openclaw.json). This enables third-party guardrail plugins using the same hook API.
…guration and move config responsibility to each plugin. Extract shared guardrail-utils
Add a shared base class/factory that simplifies guardrail extension implementations by handling common hook registration boilerplate. Extensions now implement a simple interface: - evaluate(ctx, config, api) -> GuardrailEvaluation - formatViolationMessage(evaluation, location) -> string - onRegister(api, config) [optional] The factory handles: - All 4 hook registrations (before_request, before_tool_call, after_tool_call, after_response) - Stage config resolution (enabled, mode, blockMode, includeHistory) - Error handling with failOpen support - Monitor mode (log-only) - Content extraction per stage Refactored extensions: - llamaguard: 634 → 373 lines (-41%) - grayswan: 590 → 394 lines (-33%) - gpt-oss-safeguard: 550 → 311 lines (-43%) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add two new execution-safety guardrails using the createGuardrailPlugin framework: command-safety-guard: - Blocks destructive commands (rm -rf, dd, mkfs, fork bombs) - Prevents credential exfiltration (cat ~/.ssh/id_*, base64 | curl) - Detects privilege escalation attempts (sudo passwd, visudo) - Configurable: extra patterns, allow patterns, disable rules security-audit: - Restricts access to sensitive files (SSH keys, API tokens, shell configs) - Covers cloud credentials (AWS, GCloud, Azure) - Covers package manager auth (npm, PyPI) - Operation-aware: some rules block read, others block write - Configurable: extra paths, allow paths, disable rules Both plugins support monitor mode for logging without blocking. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… and descriptions. Update package.json files to include plugin descriptions and extension paths for OpenClaw integration.
…d-security-audit-extension-metadata
Add missing patterns from PR 6569 Add README.mds
…security-extensions Fix implementations to use the correct tool names.
|
I noticed it was mentioned this work supplants #11119 and other work. I can't speak for the other work but it doesn't supersede #11119. #11119 is an instruction integrity verification layer. It would be complementary to this work. For the tl;dr, #11119 works like this:
|
Thanks for the clarification. I guess by supplants we meant "can be better implemented with". But yes i agree you also introduce new mechanisms. Updated the description to reflect this. |
|
Updated again to resolve the conflicts with the upstream |
|
This looks excellent. Thank you for explicitly calling out RFC #9030 as a use case this PR supports. One specific question regarding the richer context mentioned in the description:
For the Blind Critic pattern (RFC #9030), the security guarantee relies on the validator being context-isolated. It must evaluate the Does the
I want to ensure I can build the Blind Critic extension on this hook without accidentally leaking the malicious context into the safety check. |
|
This is a critical observation. The context leakage risk you're highlighting is exactly why I designed the security plugin (from #8093) around deterministic pattern matching rather than LLM-based validation. In my implementation, the
These methods evaluate the raw tool call payload string without needing—or being influenced by—the conversation history. Even if the message context contains a sophisticated prompt injection that convinced the model to "trust" a malicious input, the validator still flags it based purely on the payload's structural characteristics. That said, for developers building LLM-based guardrails (like a "Safety Monitor" agent that reasons about tool calls), your concern is spot-on. If that guardrail receives the full message history, it inherits the same prompt injection vulnerability as the primary model. I'd strongly support your suggestion #2. A configuration option like: before_tool_call: {
includeMessages: false, // Only send tool call + args, not conversation history
includeSystemPrompt: false
}...would provide two key benefits:
For my plugin, I don't technically need Great catch on this—it's easy to miss how "richer context" can become a security liability rather than a feature. |
|
@Reapor-Yurnero echoing the feedback above — there is strong consensus here on the need for context isolation. As discussed with @Shiva-destroyer (who implemented the ThreatScorer pattern), passing the full message history to a guardrail by default creates a context leakage vulnerability where the guardrail itself can be prompt-injected. To solve this, could we please add the granular configuration options to the // Proposed Interface Update
type GuardrailOptions = {
// Allows plugins to opt-out of receiving the "Hazardous" context.
// Default can be true (backward compatible), but security plugins will set these to false.
includeMessages?: boolean;
includeSystemPrompt?: boolean;
}This would allow us to build the Blind Critic pattern (RFC #9030) and other deterministic security plugins safely on top of your work. |
…rface # Conflicts: # src/agents/pi-embedded-runner/run.ts # src/agents/pi-embedded-runner/run/attempt.ts # src/agents/pi-tool-definition-adapter.ts # src/plugins/hooks.ts # src/plugins/types.ts
bfc1ccb to
f92900f
Compare
c2f81d0 to
a5c1013
Compare
|
@nwinter, thanks for the link In the context of this discussion, I put together a small, illustrative PR to explore one specific idea: treating the execution plan as a first-class artifact and single source of truth for execution and a way to enhance determinism and security: https://github.com/openclaw/openclaw/pull/21648 This is not meant as a complete solution, the goal is just to sanity-check direction in the context of the Grey Swan–based system by making things concrete in code.:
Happy to iterate, trim, or discard if this doesn’t align with where the group is headed. Appreciate all the work going into this PR. |
|
@Reapor-Yurnero Just a quick follow-up on my note above regarding the @nayname I really like the direction of PR #21648. Treating the execution plan as a first-class artifact makes the "Blind Critic" pattern much cleaner to implement. (To be clear upfront: I am not suggesting expanding the scope of #6095 or #21648 to implement all of the below right now! Both PRs are excellent as-is. I just want to map out how these foundational hooks perfectly set us up for the broader security roadmap). When you combine the core wiring from #6095 with the plan artifacts from #21648, we get the exact infrastructure needed to build out a complete defense matrix: Patterns enabled immediately by this foundation:
The Remaining Roadmap Failsafe:
Really fantastic collaborative work by everyone here. The combination of context-isolated guardrails and strict plan governance feels like the exact right blueprint for robust agentic security. |
|
This is the most important security feature OpenClaw can ship. The combination of tool access + external data processing + autonomous operation makes prompt injection uniquely dangerous for agents vs chatbots. I wrote a deep-dive for the community covering the current threat landscape, the CaMeL research, and practical defenses people can implement today while waiting for this to land: Prompt Injection Is Coming for Your OpenClaw Agent — Here's How to Stop It Key points from the research:
Would love to see this prioritized. The 49 reactions speak for themselves. |
Your link does not work. |
|
Please make this as a third-party plugin that you maintain yourself in your own repo. Docs: https://docs.openclaw.ai/plugin. Feel free to open a PR after to add it to our community plugins page: https://docs.openclaw.ai/plugins/community |
Modular Guardrail / Validators / Interceptors via Plugin Hooks
Summary
Introduces configurable pre- and post-message guardrail plugin system for monitoring all LLM traffic so that users can incorporate their guardrail of choice to block indirect prompt injection attacks and other policy violations. Initial selections are the open-weight gpt-oss-safeguard and cloud-based Gray Swan Cygnal, but any guardrail model can be configured similarly as a plugin. In addition to these model based guardrail, rule-based validators or monitors are also supported in this plugin based interface. Updates documentation, tests, and onboarding workflow to make configuration easy.
Why
OpenClaw is an agent with deep access to tools, files, networks, and external accounts. That makes prompt‑level attacks (especially indirect prompt injection / IPI) uniquely dangerous: a single malicious message or web page can steer an agent into data exfiltration, unsafe tool use, or policy bypass. The broader community has been paying increasing attention to these risks as more systems move from chatbots to tool‑enabled agents (see the below relevant PRs/issues)
Critically, OpenClaw needs defense‑in‑depth that can:
More importantly, all these should be fully customizable according to each user’s needs and desired policies.
This PR adds the minimal core hooks required for this protection and shows four diverse model‑based and non‑model guardrails via plugins.
Example effects:
What this PR does
extensions/grayswan-cygnal-guardrail(API-based model guardrail)extensions/gpt-oss-safeguard(open‑weight model guardrail)extensions/command-safety-guard(rule-based command validator forexec)extensions/security-audit(rule-based tool-call audit/monitoring)(The latter two were proposed by @pauloportella in feat: interceptor pipeline for tool, message, and params events #6569)
Core changes (kept minimal yet essential)
before_request,after_responsesrc/plugins/types.ts,src/plugins/hooks.tssrc/agents/pi-embedded-runner/run/attempt.ts,src/agents/pi-embedded-runner/run.tssrc/agents/pi-tool-definition-adapter.ts,src/agents/pi-embedded-runner/tool-split.ts,src/agents/pi-tools.before-tool-call.tssrc/plugins/guardrails-utils.ts,src/plugins/guardrails-utils.test.ts,src/plugin-sdk/index.tsdocs/gateway/guardrails.mdRationale for maintainers
This PR keeps core changes narrowly scoped to hook types and wiring; most of the guardrail logic lives in extensions. The result is a flexible guardrail surface with minimal risk to existing behavior. Also, happy to decompose the extensions to subPRs etc. if needed. Putting here more for demonstration purposes.
Testing
pnpm lintpnpm formatpnpm testpnpm buildAI assistance
pnpm lint,pnpm buildIssues that this would close
before_tool_callplugin hook in tool execution pipeline #5943Ongoing PRs that this would replace / complent
verifygates; unauthorized mutation prevention #11119 as pointed by its author, it also introduces new mechanism. but in general, can be more easily extended on top of this plugin-based hook structure.PRs that depends on this one
Greptile Overview
Greptile Summary
This PR wires a modular “guardrails” plugin system into the agent lifecycle by adding new hook stages (
before_request,after_response) and expanding the existing tool hooks to support inspection, mutation, and blocking (including returning synthetic tool results). The embedded runner now executes these hooks around model calls, and the tool definition adapter invokesbefore_tool_call/after_tool_callwith richer context (messages/system prompt). New guardrail utilities and example extensions demonstrate model-based and rule-based guardrails.Key review focus areas were correctness of hook result merging and the stability of event contracts (IDs/context) across call sites, since plugins will depend heavily on these semantics.
Confidence Score: 3/5