You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
arXiv:2602.13597 — submitted February 2026. "AlignSentinel: Alignment-Aware Detection of Prompt Injection Attacks".
Introduces a three-class DeBERTa-v3-base classifier (misaligned-instruction / aligned-instruction / no-instruction) using LLM attention maps as features. Substantially reduces false positives on benign tool outputs that contain instruction-like text (e.g., grammar suggestions, API return messages).
Applicability to Zeph
HIGH — Directly extends #2193 (MELON, arXiv:2502.05174): MELON measured FPR on static benign corpora; AlignSentinel specifically targets the false-positive failure mode where benign tool return values (exactly what Zeph's zeph-tools surfaces into context) look like instructions.
The alignment-awareness approach and published FPR breakdown are immediately usable to:
Summary
arXiv:2602.13597 — submitted February 2026. "AlignSentinel: Alignment-Aware Detection of Prompt Injection Attacks".
Introduces a three-class DeBERTa-v3-base classifier (misaligned-instruction / aligned-instruction / no-instruction) using LLM attention maps as features. Substantially reduces false positives on benign tool outputs that contain instruction-like text (e.g., grammar suggestions, API return messages).
Applicability to Zeph
HIGH — Directly extends #2193 (MELON, arXiv:2502.05174): MELON measured FPR on static benign corpora; AlignSentinel specifically targets the false-positive failure mode where benign tool return values (exactly what Zeph's
zeph-toolssurfaces into context) look like instructions.The alignment-awareness approach and published FPR breakdown are immediately usable to:
Implementation Sketch
zeph-sanitizer: distinguish misaligned vs. aligned instructions in tool outputReferences