research(security): SafeAgent — runtime governed tool mediation with context-aware decision core over session trajectory

## Description

**SafeAgent** (arXiv:2604.17562, April 2026) proposes a runtime protection architecture that treats agent safety as a stateful decision problem over evolving interaction trajectories — not a single-turn filter.

## Core Design

Separates execution governance from semantic risk reasoning through two coordinated components:

1. **Runtime controller** — mediates all actions around the agent loop; intercepts tool calls before and after execution; can retry, redirect, or halt based on policy arbitration
2. **Context-aware decision core** — operates over persistent session state (full trajectory, not just current turn); encodes risk over cumulative context using operators: risk encoding → utility-cost evaluation → consequence modeling → policy arbitration → state synchronization

The key insight: multi-step prompt injection propagates through tool interactions and accumulated context. Input-output filtering per turn is insufficient because the attack may span multiple turns before the exploit fires. SafeAgent's stateful core evaluates risk over the full trajectory window.

## Performance

Experiments on Agent Security Bench (ASB) and InjecAgent show consistent improvement over baseline and text-level guardrail methods while maintaining competitive benign-task performance.

## Relevance to Zeph

Zeph's current security layers (ContentSanitizer, ExfiltrationGuard, PolicyGate) operate per-turn. A multi-step prompt injection that gradually builds exfiltration context across turns is not explicitly detected.

**Proposed Zeph mapping:**
- `ContentSanitizer` = current per-turn input filter (existing)
- Runtime controller = PolicyGate (partial — intercepts tool execution, but stateless)
- Context-aware decision core = **gap**: no component accumulates risk signals across turns or models consequence propagation over the trajectory

A lightweight `TrajectorySentinel` in `zeph-core` that scores cumulative risk over the last N turns (using a small model or heuristic) and can emit a `RiskAlert` to the PolicyGate would cover the core gap without a full SafeAgent implementation.

## References
- SafeAgent: https://arxiv.org/abs/2604.17562
- Agent Security Bench: https://github.com/agentic-security/ASB

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

research(security): SafeAgent — runtime governed tool mediation with context-aware decision core over session trajectory #3570

Description

Core Design

Performance

Relevance to Zeph

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

research(security): SafeAgent — runtime governed tool mediation with context-aware decision core over session trajectory #3570

Description

Description

Core Design

Performance

Relevance to Zeph

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions