Description
SafeAgent (arXiv:2604.17562, April 2026) proposes a runtime protection architecture that treats agent safety as a stateful decision problem over evolving interaction trajectories — not a single-turn filter.
Core Design
Separates execution governance from semantic risk reasoning through two coordinated components:
- Runtime controller — mediates all actions around the agent loop; intercepts tool calls before and after execution; can retry, redirect, or halt based on policy arbitration
- Context-aware decision core — operates over persistent session state (full trajectory, not just current turn); encodes risk over cumulative context using operators: risk encoding → utility-cost evaluation → consequence modeling → policy arbitration → state synchronization
The key insight: multi-step prompt injection propagates through tool interactions and accumulated context. Input-output filtering per turn is insufficient because the attack may span multiple turns before the exploit fires. SafeAgent's stateful core evaluates risk over the full trajectory window.
Performance
Experiments on Agent Security Bench (ASB) and InjecAgent show consistent improvement over baseline and text-level guardrail methods while maintaining competitive benign-task performance.
Relevance to Zeph
Zeph's current security layers (ContentSanitizer, ExfiltrationGuard, PolicyGate) operate per-turn. A multi-step prompt injection that gradually builds exfiltration context across turns is not explicitly detected.
Proposed Zeph mapping:
ContentSanitizer = current per-turn input filter (existing)
- Runtime controller = PolicyGate (partial — intercepts tool execution, but stateless)
- Context-aware decision core = gap: no component accumulates risk signals across turns or models consequence propagation over the trajectory
A lightweight TrajectorySentinel in zeph-core that scores cumulative risk over the last N turns (using a small model or heuristic) and can emit a RiskAlert to the PolicyGate would cover the core gap without a full SafeAgent implementation.
References
Description
SafeAgent (arXiv:2604.17562, April 2026) proposes a runtime protection architecture that treats agent safety as a stateful decision problem over evolving interaction trajectories — not a single-turn filter.
Core Design
Separates execution governance from semantic risk reasoning through two coordinated components:
The key insight: multi-step prompt injection propagates through tool interactions and accumulated context. Input-output filtering per turn is insufficient because the attack may span multiple turns before the exploit fires. SafeAgent's stateful core evaluates risk over the full trajectory window.
Performance
Experiments on Agent Security Bench (ASB) and InjecAgent show consistent improvement over baseline and text-level guardrail methods while maintaining competitive benign-task performance.
Relevance to Zeph
Zeph's current security layers (ContentSanitizer, ExfiltrationGuard, PolicyGate) operate per-turn. A multi-step prompt injection that gradually builds exfiltration context across turns is not explicitly detected.
Proposed Zeph mapping:
ContentSanitizer= current per-turn input filter (existing)A lightweight
TrajectorySentinelinzeph-corethat scores cumulative risk over the last N turns (using a small model or heuristic) and can emit aRiskAlertto the PolicyGate would cover the core gap without a full SafeAgent implementation.References