research(security): MELON paper — DeBERTa injection detectors have high FPR; use as soft signal only (arXiv:2502.05174)

## Source

"MELON: Provable Defense Against Indirect Prompt Injection Attacks in AI Agents" (arXiv:2502.05174, February 2025, updated June 2025)
https://arxiv.org/abs/2502.05174

## Summary

Evaluates DeBERTa-based classifiers as injection detectors in LLM agent pipelines. Key findings:

- Off-the-shelf DeBERTa checkpoints (e.g., `mDeBERTa-v3-base-prompt-injection-v2`) achieve near-zero attack success rate on some attack patterns — excellent detection recall
- BUT: exhibit **high false positive rates** on benign tool outputs, misidentifying legitimate content as malicious
- Root cause: DeBERTa was pre-trained on phishing/spam patterns, not on adversarial agent-specific prompt injection patterns
- MELON alternative: **masked re-execution** — re-runs the tool call with key content masked, compares outputs to detect injection artifacts. Provably correct defense.

## Applicability to Zeph

Directly relevant to issue #2185 (Candle classifier implementation) and #2190 (classifier integration tests). Critical design constraints:

1. **Off-the-shelf DeBERTa is NOT a drop-in hard guardrail** — high FPR would block legitimate agent operations
2. Safe deployment path: use as **soft signal / first-pass filter only**, not as a hard block
3. For a hard guardrail, either:
   - Fine-tune DeBERTa on agent-specific injection examples (few-shot is sufficient per MELON paper)
   - Implement MELON-style masked re-execution as a secondary verification layer
4. Integration tests (#2190) must include FPR measurement on benign tool outputs, not just attack detection

## Implementation Recommendation

When implementing the DeBERTa classifier (#2185):
- Add a `soft_signal` mode (default) vs `hard_block` mode config option
- Default to `soft_signal`: flag suspicious content for attention, do not block
- Document FPR risk prominently in the config comments

## Priority

P2 — informs design decisions for #2185 and #2190 before implementation.

## Related Issues

- #2185 (feat): Candle-backed lightweight classifiers
- #2190 (test): Integration tests for Candle-backed classifier models
- #2178 (research): MCP protocol-level security vulnerabilities


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

research(security): MELON paper — DeBERTa injection detectors have high FPR; use as soft signal only (arXiv:2502.05174) #2193

Source

Summary

Applicability to Zeph

Implementation Recommendation

Priority

Related Issues

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

research(security): MELON paper — DeBERTa injection detectors have high FPR; use as soft signal only (arXiv:2502.05174) #2193

Description

Source

Summary

Applicability to Zeph

Implementation Recommendation

Priority

Related Issues

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions