Skip to content

research(security): SafeAgent — runtime governed tool mediation with context-aware decision core over session trajectory #3570

@bug-ops

Description

@bug-ops

Description

SafeAgent (arXiv:2604.17562, April 2026) proposes a runtime protection architecture that treats agent safety as a stateful decision problem over evolving interaction trajectories — not a single-turn filter.

Core Design

Separates execution governance from semantic risk reasoning through two coordinated components:

  1. Runtime controller — mediates all actions around the agent loop; intercepts tool calls before and after execution; can retry, redirect, or halt based on policy arbitration
  2. Context-aware decision core — operates over persistent session state (full trajectory, not just current turn); encodes risk over cumulative context using operators: risk encoding → utility-cost evaluation → consequence modeling → policy arbitration → state synchronization

The key insight: multi-step prompt injection propagates through tool interactions and accumulated context. Input-output filtering per turn is insufficient because the attack may span multiple turns before the exploit fires. SafeAgent's stateful core evaluates risk over the full trajectory window.

Performance

Experiments on Agent Security Bench (ASB) and InjecAgent show consistent improvement over baseline and text-level guardrail methods while maintaining competitive benign-task performance.

Relevance to Zeph

Zeph's current security layers (ContentSanitizer, ExfiltrationGuard, PolicyGate) operate per-turn. A multi-step prompt injection that gradually builds exfiltration context across turns is not explicitly detected.

Proposed Zeph mapping:

  • ContentSanitizer = current per-turn input filter (existing)
  • Runtime controller = PolicyGate (partial — intercepts tool execution, but stateless)
  • Context-aware decision core = gap: no component accumulates risk signals across turns or models consequence propagation over the trajectory

A lightweight TrajectorySentinel in zeph-core that scores cumulative risk over the last N turns (using a small model or heuristic) and can emit a RiskAlert to the PolicyGate would cover the core gap without a full SafeAgent implementation.

References

Metadata

Metadata

Assignees

Labels

P3Research — medium-high complexityresearchResearch-driven improvement

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions