Research Finding
PromptArmor's prompt injection defense approach uses a small, fast classifier LLM to screen incoming prompts before they reach the main agent LLM. The classifier is fine-tuned specifically to detect injection patterns and runs at sub-50ms latency on a 3B-param model.
This is distinct from #1630 (TrustBench pre-execution action verification) — that operates after tool call formulation. This guard operates at the input boundary, before any LLM inference.
Applicability
Zeph's ContentSanitizer applies regex-based sanitization. A lightweight LLM-based classifier would catch semantic injection patterns that regex cannot:
- "Ignore all previous instructions..."
- Multi-language injection (regex-based defenses often miss non-English variants)
- Base64-encoded injection payloads
- Indirect injection via tool results (web scrape returns adversarial content)
Two insertion points:
- User input boundary — in
CliChannel/AcpSession before the prompt enters the agent loop
- Tool result boundary — in
CompositeExecutor after tool execution, before results enter context (indirect injection)
Design Sketch
[security.guardrail]
enabled = false
provider = "ollama"
model = "llama-guard-3:1b"
timeout_ms = 500
action = "block" # or "warn"
struct GuardrailFilter {
provider: Arc<dyn LlmProvider>,
action: GuardrailAction,
}
impl GuardrailFilter {
async fn check(&self, content: &str) -> GuardrailVerdict;
}
Uses existing LlmProvider trait — no new HTTP client needed.
Source
Research session 2026-03-13. PromptArmor injection defense (promptarmor.ai, see also arXiv:2312.14197).
Priority
Medium — opt-in hardening for high-security deployments. The regex ContentSanitizer remains the default.
Related
#1630 (TrustBench pre-execution verification — different layer)
#1195 (Untrusted Content Isolation epic)
Research Finding
PromptArmor's prompt injection defense approach uses a small, fast classifier LLM to screen incoming prompts before they reach the main agent LLM. The classifier is fine-tuned specifically to detect injection patterns and runs at sub-50ms latency on a 3B-param model.
This is distinct from
#1630(TrustBench pre-execution action verification) — that operates after tool call formulation. This guard operates at the input boundary, before any LLM inference.Applicability
Zeph's
ContentSanitizerapplies regex-based sanitization. A lightweight LLM-based classifier would catch semantic injection patterns that regex cannot:Two insertion points:
CliChannel/AcpSessionbefore the prompt enters the agent loopCompositeExecutorafter tool execution, before results enter context (indirect injection)Design Sketch
Uses existing
LlmProvidertrait — no new HTTP client needed.Source
Research session 2026-03-13. PromptArmor injection defense (promptarmor.ai, see also arXiv:2312.14197).
Priority
Medium — opt-in hardening for high-security deployments. The regex
ContentSanitizerremains the default.Related
#1630(TrustBench pre-execution verification — different layer)#1195(Untrusted Content Isolation epic)