Summary
A comprehensive benchmark + multi-layer defense framework for prompt injection in RAG-enabled agents. Reduces attack success from 73.2% to 8.7% across 847 adversarial test cases in 5 attack categories.
Source: arXiv 2511.15759 — Securing AI Agents Against Prompt Injection Attacks: A Comprehensive Benchmark and Defense Framework
Badrinath Ramakrishnan, Akshaya Balaji. Published 2025-11-19.
Key Results
- 847 adversarial test cases, 5 categories: direct injection, context manipulation, instruction override, data exfiltration, cross-context contamination
- Defense = content filtering + prompt architecture improvements + response verification (post-LLM check)
- 89.4% attack mitigation, 94.3% legitimate functionality preserved
- Evaluated across 7 LLMs — model-specific vulnerability profiles identified
Applicability to Zeph
Zeph already has ContentSanitizer + ExfiltrationGuard (epic #1195) covering content filtering and exfiltration.
Gap: The response verification layer is missing — no post-LLM check that the agent's output wasn't compromised by injected instructions.
Integration point: AgentLoop::turn() after LLM response, before tool execution dispatch.
- Scan LLM response for injected-instruction patterns (overrides of
autonomy_level, unauthorized memory writes, unexpected exfiltration paths)
- Cross-reference with known injection patterns from
ContentSanitizer::injection_patterns()
- If flagged: escalate to WARN, optionally block tool execution (configurable)
Complements: #1651 (PromptArmor — pre-screen at input), this adds post-LLM response verification.
Implementation Sketch
ResponseVerifier struct in zeph-core::security
verify_response(response: &str, injection_context: &InjectionContext) -> VerificationResult
- Config:
[security.response_verification] enabled = true, block_on_detection = false
- TUI: show SEC panel alert when response verification fires
Summary
A comprehensive benchmark + multi-layer defense framework for prompt injection in RAG-enabled agents. Reduces attack success from 73.2% to 8.7% across 847 adversarial test cases in 5 attack categories.
Source: arXiv 2511.15759 — Securing AI Agents Against Prompt Injection Attacks: A Comprehensive Benchmark and Defense Framework
Badrinath Ramakrishnan, Akshaya Balaji. Published 2025-11-19.
Key Results
Applicability to Zeph
Zeph already has
ContentSanitizer+ExfiltrationGuard(epic #1195) covering content filtering and exfiltration.Gap: The response verification layer is missing — no post-LLM check that the agent's output wasn't compromised by injected instructions.
Integration point:
AgentLoop::turn()after LLM response, before tool execution dispatch.autonomy_level, unauthorized memory writes, unexpected exfiltration paths)ContentSanitizer::injection_patterns()Complements: #1651 (PromptArmor — pre-screen at input), this adds post-LLM response verification.
Implementation Sketch
ResponseVerifierstruct inzeph-core::securityverify_response(response: &str, injection_context: &InjectionContext) -> VerificationResult[security.response_verification] enabled = true, block_on_detection = false