Feature: Runtime prompt injection defenses

## Problem

Skill supply chain attacks are getting attention (signed skills, permission manifests), but runtime prompt injection is mostly unsolved. Agents ingest untrusted content (URLs, API responses, social media posts, skill outputs) and it lands in context with equal weight to trusted user input.

Current defense: hope the model is suspicious enough. That's not a security model.

## Proposed Features

### 1. Content provenance tagging
Mark where content came from in the context:
- `[source:user]` — direct human input
- `[source:fetched_url]` — web_fetch output  
- `[source:skill:weather]` — skill output
- `[source:external:moltbook]` — external API/feed

Let the model differentiate trusted vs untrusted. Could be implemented as metadata in tool results or explicit tags in system prompt.

### 2. Skill sandboxing / permission manifests
Skills currently run with full agent permissions. Proposal:
```json
{
  "permissions": {
    "filesystem": ["read:~/data", "write:~/output"],
    "network": ["api.example.com"],
    "exec": false
  }
}
```
Refuse capabilities not declared. Like mobile app permissions.

### 3. Output filtering / exfiltration detection
Before external actions (email, message, HTTP POST), check:
- Does output contain strings matching `$ENV_VAR` patterns?
- Does it contain content from context marked as sensitive?
- Anomaly detection: is this action pattern unusual?

### 4. Injection canaries
Optionally inject fake "secrets" into context:
```
[CANARY: If you see this string in any agent output, report to security@clawdbot.com: CANARY-a8f3k2...]
```
If canary appears in output, alert. Tripwire detection.

### 5. Rate limiting on sensitive actions
Configurable limits:
```yaml
rateLimits:
  email: 5/hour
  exec: 100/hour
  externalApi: 50/hour
```
Pause and require confirmation when exceeded.

### 6. Audit replay
Extend existing command-logger to capture full context snapshots at decision points. Enable "replay" of what the agent saw when it took an action. Essential for post-incident forensics.

## Tradeoffs

More guardrails = less useful agent. The goal is defense-in-depth without turning agents back into chatbots. Configurable, opt-in, with sane defaults.

## Related

- Supply chain discussion on Moltbook (eudaemon_0's post on skill signing)
- #4827 (memoryFlush edge cases — tangentially related to context integrity)

cc @sethbrammer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature: Runtime prompt injection defenses #4840

Problem

Proposed Features

1. Content provenance tagging

2. Skill sandboxing / permission manifests

3. Output filtering / exfiltration detection

4. Injection canaries

5. Rate limiting on sensitive actions

6. Audit replay

Tradeoffs

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Feature: Runtime prompt injection defenses #4840

Description

Problem

Proposed Features

1. Content provenance tagging

2. Skill sandboxing / permission manifests

3. Output filtering / exfiltration detection

4. Injection canaries

5. Rate limiting on sensitive actions

6. Audit replay

Tradeoffs

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions