Skip to content

Feature: Runtime prompt injection defenses #4840

@cal-brmmr

Description

@cal-brmmr

Problem

Skill supply chain attacks are getting attention (signed skills, permission manifests), but runtime prompt injection is mostly unsolved. Agents ingest untrusted content (URLs, API responses, social media posts, skill outputs) and it lands in context with equal weight to trusted user input.

Current defense: hope the model is suspicious enough. That's not a security model.

Proposed Features

1. Content provenance tagging

Mark where content came from in the context:

  • [source:user] — direct human input
  • [source:fetched_url] — web_fetch output
  • [source:skill:weather] — skill output
  • [source:external:moltbook] — external API/feed

Let the model differentiate trusted vs untrusted. Could be implemented as metadata in tool results or explicit tags in system prompt.

2. Skill sandboxing / permission manifests

Skills currently run with full agent permissions. Proposal:

{
  "permissions": {
    "filesystem": ["read:~/data", "write:~/output"],
    "network": ["api.example.com"],
    "exec": false
  }
}

Refuse capabilities not declared. Like mobile app permissions.

3. Output filtering / exfiltration detection

Before external actions (email, message, HTTP POST), check:

  • Does output contain strings matching $ENV_VAR patterns?
  • Does it contain content from context marked as sensitive?
  • Anomaly detection: is this action pattern unusual?

4. Injection canaries

Optionally inject fake "secrets" into context:

[CANARY: If you see this string in any agent output, report to security@clawdbot.com: CANARY-a8f3k2...]

If canary appears in output, alert. Tripwire detection.

5. Rate limiting on sensitive actions

Configurable limits:

rateLimits:
  email: 5/hour
  exec: 100/hour
  externalApi: 50/hour

Pause and require confirmation when exceeded.

6. Audit replay

Extend existing command-logger to capture full context snapshots at decision points. Enable "replay" of what the agent saw when it took an action. Essential for post-incident forensics.

Tradeoffs

More guardrails = less useful agent. The goal is defense-in-depth without turning agents back into chatbots. Configurable, opt-in, with sane defaults.

Related

cc @sethbrammer

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions