[Feature] Content-based prompt injection scanning on tool output

## Problem

OpenClaw wraps external/untrusted content in XML tags with a SECURITY NOTICE header before presenting it to the LLM. This is a good first layer, but the THREAT-MODEL-ATLAS.md explicitly acknowledges the limitation:

> **T-EXEC-002: Indirect Prompt Injection**
> - Current Mitigations: Content wrapping with XML tags and security notice
> - Residual Risk: **High** — LLM may ignore wrapper instructions
> - Recommendations: **Implement content sanitization, separate execution contexts**

The wrapping approach relies on the LLM respecting the boundary markers. Sophisticated injection attacks can include instructions to ignore previous context, override system prompts, or manipulate tool behavior — and the LLM may comply despite the wrapper.

## Evidence

- OpenClaw's own threat model rates this as High residual risk
- The recommendation in the threat model ("implement content sanitization") is not yet implemented natively
- MCP tool output, web_fetch results, and email ingestion are all attack surfaces

## Workaround

I built `cc-taint-check.py`, a PostToolUse hook (~150 lines Python) that:

1. Intercepts MCP tool output before the LLM processes it
2. Scans content against a catalogue of injection patterns:
   - Direct instruction injection ("ignore previous instructions", "you are now", "system prompt override")
   - Role manipulation ("as an AI without restrictions", "pretend you are")
   - Encoding-based evasion (base64 encoded instructions, Unicode homoglyphs)
   - Tool/function injection ("call function", "execute command")
3. Blocks on high-severity matches, logs low-severity
4. Normalizes Unicode before scanning to catch homoglyph-based evasion

## Proposed Solution

Add a configurable content scanning layer to the tool output pipeline:

1. **Scan tool output** from external sources (web_fetch, MCP tools, email, webhooks) before presenting to the LLM
2. **Pattern catalogue** of known injection techniques (maintainable, updatable)
3. **Configurable response:** log-only, warn user, or block the content
4. **Unicode normalization** before scanning

```json
{
  "security": {
    "contentScanning": {
      "enabled": true,
      "sources": ["web_fetch", "mcp", "email", "webhooks"],
      "action": "warn",
      "customPatterns": []
    }
  }
}
```

This directly implements the recommendation from the existing threat model.

## Impact

High. Addresses a self-identified High-risk gap in the threat model. As LLMs increasingly process untrusted external content, content-based scanning becomes essential alongside wrapper-based containment.

## Environment

- OpenClaw 2026.4.10 (npm, macOS)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Content-based prompt injection scanning on tool output #65816

Problem

Evidence

Workaround

Proposed Solution

Impact

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Feature] Content-based prompt injection scanning on tool output #65816

Description

Problem

Evidence

Workaround

Proposed Solution

Impact

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions