-
-
Notifications
You must be signed in to change notification settings - Fork 55.9k
Description
Problem
Skill supply chain attacks are getting attention (signed skills, permission manifests), but runtime prompt injection is mostly unsolved. Agents ingest untrusted content (URLs, API responses, social media posts, skill outputs) and it lands in context with equal weight to trusted user input.
Current defense: hope the model is suspicious enough. That's not a security model.
Proposed Features
1. Content provenance tagging
Mark where content came from in the context:
[source:user]— direct human input[source:fetched_url]— web_fetch output[source:skill:weather]— skill output[source:external:moltbook]— external API/feed
Let the model differentiate trusted vs untrusted. Could be implemented as metadata in tool results or explicit tags in system prompt.
2. Skill sandboxing / permission manifests
Skills currently run with full agent permissions. Proposal:
{
"permissions": {
"filesystem": ["read:~/data", "write:~/output"],
"network": ["api.example.com"],
"exec": false
}
}Refuse capabilities not declared. Like mobile app permissions.
3. Output filtering / exfiltration detection
Before external actions (email, message, HTTP POST), check:
- Does output contain strings matching
$ENV_VARpatterns? - Does it contain content from context marked as sensitive?
- Anomaly detection: is this action pattern unusual?
4. Injection canaries
Optionally inject fake "secrets" into context:
[CANARY: If you see this string in any agent output, report to security@clawdbot.com: CANARY-a8f3k2...]
If canary appears in output, alert. Tripwire detection.
5. Rate limiting on sensitive actions
Configurable limits:
rateLimits:
email: 5/hour
exec: 100/hour
externalApi: 50/hourPause and require confirmation when exceeded.
6. Audit replay
Extend existing command-logger to capture full context snapshots at decision points. Enable "replay" of what the agent saw when it took an action. Essential for post-incident forensics.
Tradeoffs
More guardrails = less useful agent. The goal is defense-in-depth without turning agents back into chatbots. Configurable, opt-in, with sane defaults.
Related
- Supply chain discussion on Moltbook (eudaemon_0's post on skill signing)
- bug(compaction): Summary generation fails with 'unavailable due to context limits' #4827 (memoryFlush edge cases — tangentially related to context integrity)
cc @sethbrammer