Skip to content

spec(security): plan trust context and audience policy#249

Merged
Aaronontheweb merged 15 commits into
devfrom
feature/trust-context-security-planning
Mar 23, 2026
Merged

spec(security): plan trust context and audience policy#249
Aaronontheweb merged 15 commits into
devfrom
feature/trust-context-security-planning

Conversation

@Aaronontheweb

Copy link
Copy Markdown
Collaborator

Summary

  • add a new OpenSpec change for a cross-cutting trust-context and audience model spanning channels, memory, tools, MCP, and source provenance
  • define strict-default, fail-closed policy behavior with downgrade-only trust transitions and posture-aware shell handling
  • add follow-on implementation tasks for config schema updates, doctor/onboarding UX, and future sandboxed execution work

Testing

  • not run (planning/spec change only)

@Aaronontheweb

Copy link
Copy Markdown
Collaborator Author

Critical missing pieces based on real-world agent failures

Great work on the trust-context model. Based on our recent analysis of real agent failures (OpenClaw email disasters), I see gaps that need addressing before implementation starts.

Real-world evidence of why this matters

Our blog post documents 5 specific failure modes from OpenClaw:

  1. Speed-run deletions - Agent ignored "confirm before acting" and mass-deleted an inbox faster than a human could intervene
  2. Infinite loops - Sent 500+ confirmation messages to a user's wife, requiring a power pull to stop
  3. Internal monologue leaks - Exposed file paths, API errors, and other customers' data to public channels
  4. Fabricated emails - Created fake reply chains that had real-world consequences
  5. Prompt injection - Extracted SSH keys from hidden instructions in email bodies

These aren't theoretical. They're happening right now with agents that have "good intentions" but no guardrails.

What the trust-context model prevents

Your design correctly identifies the root cause: no gate between agent decision and action. The trust-context approach should:

  • Block speed-run behavior via rate limiting and confirmation requirements
  • Prevent infinite loops via working-context timeouts
  • Stop internal monologue leaks via audience-aware memory/output filtering
  • Catch prompt injection via payload provenance validation

Critical missing pieces

0. Gateway Authorization (NEW)

  • Rate limiting at connection level (the 500-iMessage story shows why this is non-negotiable)
  • JWT bearer tokens for SignalR connections
  • Connection health checks with auto-reconnect limits

1. Sandbox enforcement

  • The "speed-run" incident happened because there was no sandbox to throttle execution
  • You can't have shell mode policy without an actual sandbox implementation
  • OpenClaw's "Computer Use" mode needs explicit isolation

2. Memory migration strategy

  • "Conservative defaults" is too vague
  • Need concrete rules for existing memories when audience model changes
  • What happens if a "team" memory gets exposed to "public" context?

3. Verified transport criteria

  • What makes a webhook "verified"?
  • Signature verification requirements
  • Payload taint detection rules

Recommendation

Add Phase 0 (Gateway Authorization) to the task list before starting Phase 1. The trust-context model is the right direction, but it needs authorization enforcement at the transport layer first.

Also worth adding the OpenClaw failure cases as a reference section in the design doc - they're perfect examples of what happens when these patterns aren't in place.

@Aaronontheweb

Copy link
Copy Markdown
Collaborator Author

Critical missing pieces based on real-world agent failures

Great work on the trust-context model. Based on our recent analysis of real agent failures (OpenClaw email disasters), I see gaps that need addressing before implementation starts.

Real-world evidence of why this matters

Our blog post documents 5 specific failure modes from OpenClaw:

  1. Speed-run deletions - Agent ignored "confirm before acting" and mass-deleted an inbox faster than a human could intervene. Meta AI safety director Summer Yue had to run to her Mac to kill the process.

  2. Infinite loops - Sent 500+ confirmation messages to a user's wife, requiring a power pull to stop. Story covered by Bloomberg.

  3. Internal monologue leaks - Exposed file paths, API errors, and other customers' data to public channels.

  4. Fabricated emails - Created fake reply chains that had real-world consequences. Ars Technica coverage.

  5. Prompt injection - Extracted SSH keys from hidden instructions in email bodies. Proof of concept by researcher Johann Rehberger.

These aren't theoretical. They're happening right now with agents that have "good intentions" but no guardrails.

How other agents handle this

OpenClaw explicitly states in their security docs that they assume a "personal assistant model" with one trusted operator boundary per gateway, and they don't support adversarial multi-tenant scenarios.

Hermes Agent uses a 3-tier authorization model with allowlists, DM pairing with 1-time codes, rate limiting (1 code per 10 minutes), and file permissions set to 0600.

What the trust-context model prevents

Your design correctly identifies the root cause: no gate between agent decision and action. The trust-context approach should:

  • Block speed-run behavior via rate limiting and confirmation requirements
  • Prevent infinite loops via working-context timeouts
  • Stop internal monologue leaks via audience-aware memory/output filtering
  • Catch prompt injection via payload provenance validation

Critical missing pieces

0. Gateway Authorization (NEW)

  • Rate limiting at connection level (the 500-iMessage story shows why this is non-negotiable)
  • JWT bearer tokens for SignalR connections
  • Connection health checks with auto-reconnect limits

1. Sandbox enforcement

  • The "speed-run" incident happened because there was no sandbox to throttle execution
  • You can't have shell mode policy without an actual sandbox implementation
  • OpenClaw's "Computer Use" mode needs explicit isolation

2. Memory migration strategy

  • "Conservative defaults" is too vague
  • Need concrete rules for existing memories when audience model changes
  • What happens if a "team" memory gets exposed to "public" context?

3. Verified transport criteria

  • What makes a webhook "verified"?
  • Signature verification requirements
  • Payload taint detection rules

Recommendation

Add Phase 0 (Gateway Authorization) to the task list before starting Phase 1. The trust-context model is the right direction, but it needs authorization enforcement at the transport layer first.

Also worth adding the OpenClaw failure cases as a reference section in the design doc — they're perfect examples of what happens when these patterns aren't in place.

@Aaronontheweb Aaronontheweb force-pushed the feature/trust-context-security-planning branch 3 times, most recently from 163c1d7 to a0e1a2d Compare March 21, 2026 04:05
@Aaronontheweb

Copy link
Copy Markdown
Collaborator Author

Proposed security-policy / trust-context test sequence now that the branch is rebased onto latest dev and the full solution test suite is green:

  1. Trust-context derivation
  • Confirm Slack thread turns enter team
  • Confirm local / SignalR / TUI turns enter personal
  • Confirm any untrusted/public ingress stays public
  • Verify posture only downgrades capability; nothing auto-widens to personal
  1. Tool exposure and invocation
  • public: verify restricted tool set, no shell, no high-impact MCP tools
  • team: verify team-safe tools only; shell still denied
  • personal: verify shell only works when ShellExecutionMode=HostAllowed and audience profiles allow it
  • Confirm denied tools fail closed with an explicit policy reason
  1. MCP discovery
  • Run search_tools in each audience and confirm only allowed MCP servers/tools appear
  • Verify dynamically discovered MCP tools still honor invocation policy after discovery
  • Verify sensitive/high-impact capability classes stay hidden outside personal
  1. Memory write / recall policy
  • Persist a durable fact in a personal turn and confirm it is not visible from public or team
  • Persist a team fact and confirm it is visible to team and personal, but not public
  • Confirm evidence is searchable but never auto-recalled
  • Confirm explicit find_memories / get_memories respect both audience and boundary
  • Confirm shared project facts can cross channels only within the same authorized boundary
  1. Secret handling
  • Try to store raw secret material and confirm it is rejected/redacted before durable persistence
  • Confirm secret-bearing memory never shows up in auto recall
  1. Public file confinement
  • In a public session, verify file read/write stays confined to the session directory
  • Confirm path traversal / arbitrary host-path access is denied
  1. Operator workflows
  • Run netclaw doctor and confirm bad/missing audience profiles are flagged
  • Run init/onboarding flow and confirm recommended default profiles are generated safely
  1. Regression smoke
  • Re-run the end-to-end happy paths for Slack + SignalR after the negative tests
  • If all of the above passes, move the PR out of draft and do a final mergeability / CI check

If helpful, I can turn this into a checkbox-based test matrix next.

@Aaronontheweb

Copy link
Copy Markdown
Collaborator Author

Cross-reference: Per-turn memory policy filtering limitation

Discovered during #370 (memory recall optimization) review — filed as #376.

The trust context spec includes scenarios where per-turn audience/sensitivity filtering excludes memories from recall when trust degrades mid-session. However, the LLM's own responses are persisted to _state.History while memory injections are transient. This means:

  1. Turn 1 (high trust): Memory recalled — "preferred airport is IEH"
  2. LLM responds: "I'll book from IEH..." → persisted to history
  3. Trust degrades → policy excludes the memory from recall
  4. Information is still in the conversation history from the LLM's prior output

Per-turn filtering still provides value as damage limitation — it prevents additional sensitive memories from being introduced after trust degrades, limiting blast radius. But it can't protect information that was already surfaced in a higher-trust turn. The specs should be honest about this limitation.

The broader question is whether the trust boundary should be session-scoped rather than turn-scoped — trust degradation mid-session could fork/terminate the session rather than filtering recall while history is already contaminated. See #376 for the full discussion.

@Aaronontheweb

Copy link
Copy Markdown
Collaborator Author

Dependency: Skills directory must be always-readable

The skill discovery redesign (#355) moves from internal daemon-side file reads
to LLM-invoked file_read tool calls. This means the skills directory
(~/.netclaw/skills/ and feeds/skills/.system/files/) must be whitelisted
in the read policy by default, regardless of security posture.

These paths should join identity files (SOUL.md, AGENTS.md, TOOLING.md) in the
"always allowed" read set:

  • ~/.netclaw/skills/** — user-installed skills
  • System skill feed paths — operator-managed skills

Without this, the LLM will get "Access denied by security policy" when it tries
to load a skill via file_read, breaking the entire skill discovery pipeline.

This is the same pattern as identity files — the bot needs to read its own
operational guidance to function. Blocking it would be like blocking the system
prompt.

@Aaronontheweb Aaronontheweb force-pushed the feature/trust-context-security-planning branch from d20e0ed to 35b3386 Compare March 22, 2026 23:53
@Aaronontheweb Aaronontheweb marked this pull request as ready for review March 23, 2026 00:39
@Aaronontheweb Aaronontheweb merged commit a800e56 into dev Mar 23, 2026
3 checks passed
@Aaronontheweb Aaronontheweb deleted the feature/trust-context-security-planning branch March 23, 2026 00:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant