You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"MELON: Provable Defense Against Indirect Prompt Injection Attacks in AI Agents" (arXiv:2502.05174, February 2025, updated June 2025) https://arxiv.org/abs/2502.05174
Summary
Evaluates DeBERTa-based classifiers as injection detectors in LLM agent pipelines. Key findings:
Off-the-shelf DeBERTa checkpoints (e.g., mDeBERTa-v3-base-prompt-injection-v2) achieve near-zero attack success rate on some attack patterns — excellent detection recall
BUT: exhibit high false positive rates on benign tool outputs, misidentifying legitimate content as malicious
Root cause: DeBERTa was pre-trained on phishing/spam patterns, not on adversarial agent-specific prompt injection patterns
MELON alternative: masked re-execution — re-runs the tool call with key content masked, compares outputs to detect injection artifacts. Provably correct defense.
Applicability to Zeph
Directly relevant to issue #2185 (Candle classifier implementation) and #2190 (classifier integration tests). Critical design constraints:
Off-the-shelf DeBERTa is NOT a drop-in hard guardrail — high FPR would block legitimate agent operations
Safe deployment path: use as soft signal / first-pass filter only, not as a hard block
For a hard guardrail, either:
Fine-tune DeBERTa on agent-specific injection examples (few-shot is sufficient per MELON paper)
Implement MELON-style masked re-execution as a secondary verification layer
Source
"MELON: Provable Defense Against Indirect Prompt Injection Attacks in AI Agents" (arXiv:2502.05174, February 2025, updated June 2025)
https://arxiv.org/abs/2502.05174
Summary
Evaluates DeBERTa-based classifiers as injection detectors in LLM agent pipelines. Key findings:
mDeBERTa-v3-base-prompt-injection-v2) achieve near-zero attack success rate on some attack patterns — excellent detection recallApplicability to Zeph
Directly relevant to issue #2185 (Candle classifier implementation) and #2190 (classifier integration tests). Critical design constraints:
Implementation Recommendation
When implementing the DeBERTa classifier (#2185):
soft_signalmode (default) vshard_blockmode config optionsoft_signal: flag suspicious content for attention, do not blockPriority
P2 — informs design decisions for #2185 and #2190 before implementation.
Related Issues