Skip to content

feat: add 13 IPI Arena detection rules and LLM prompt enhancement#78

Merged
garagon merged 1 commit intomainfrom
feat/ipi-arena-rules
Mar 22, 2026
Merged

feat: add 13 IPI Arena detection rules and LLM prompt enhancement#78
garagon merged 1 commit intomainfrom
feat/ipi-arena-rules

Conversation

@garagon
Copy link
Copy Markdown
Contributor

@garagon garagon commented Mar 22, 2026

Summary

  • Add 13 detection rules (IPI-001 to IPI-013) based on attack strategies from the IPI Arena paper (arXiv:2603.15714, Gray Swan AI + OpenAI + Anthropic + Meta + UK AISI + NIST)
  • Enhance LLM analysis prompt with indirect prompt injection threat types (concealment, function hijack, IPI-specific calibration examples)
  • Add ipi-bench tool for reproducible benchmarking against the IPI Arena dataset

Detection Rules (rules/ipi-arena.yaml)

Rule Severity Attack Strategy
IPI-001 critical Fake chain-of-thought injection
IPI-002 high Reasoning suppression directive
IPI-003 high Reward and punishment manipulation
IPI-004 critical Fabricated authority XML tags
IPI-005 high Simulated environment framing (Holodeck cluster)
IPI-006 critical Protocol override with formal authority
IPI-007 high Visual ruleset injection
IPI-008 high Alignment mode switching
IPI-009 critical Concealment instruction
IPI-010 high Scripted output directive
IPI-011 medium Fake social proof and authority claims
IPI-012 high Fake error or system state manipulation
IPI-013 high Impersonated user confirmation

Benchmark Results (95 public IPI Arena attacks)

Layer Detected Rate
Deterministic (230 rules) 81/95 85.3%
LLM (Claude Sonnet 4.6) 14/14 remaining 100%
Combined 95/95 100%
False positives 0 0%

Test Plan

  • make build passes
  • make test passes (19 packages, 0 failures, race detector on)
  • make lint passes (0 issues)
  • make vet passes
  • IPI Arena benchmark: 85.3% deterministic, 100% combined
  • 0 false positives on benign test cases
  • 14 IPI benchmark cases added to internal/llm/benchmark_test.go

Ref: arXiv:2603.15714

Add 13 detection rules (IPI-001 to IPI-013) based on attack strategies
identified in "How Vulnerable Are AI Agents to Indirect Prompt
Injections?" (arXiv:2603.15714, Gray Swan AI et al., March 2026).

Rules cover: fake chain-of-thought injection, reasoning suppression,
reward/punishment manipulation, fabricated authority tags, simulated
environment framing, protocol override, visual ruleset injection,
alignment mode switching, concealment instructions, scripted output
directives, fake social proof, fake error states, and impersonated
user confirmations.

Benchmarked against 95 public IPI Arena attack strings:
- Deterministic layer: 85.3% detection (81/95)
- LLM layer (Claude Sonnet 4.6): catches remaining 14/14
- Combined pipeline: 95/95 (100%), 0 false positives

Also adds ipi-bench tool for reproducible benchmarking against the
IPI Arena dataset, and enhances the LLM analysis prompt with
indirect prompt injection threat types (concealment, function hijack).

Ref: arXiv:2603.15714
@garagon garagon merged commit f13e193 into main Mar 22, 2026
1 check passed
@garagon garagon deleted the feat/ipi-arena-rules branch March 22, 2026 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant