feat: add 13 IPI Arena detection rules and LLM prompt enhancement by garagon · Pull Request #78 · oktsec/oktsec

garagon · 2026-03-22T18:09:18Z

Summary

Add 13 detection rules (IPI-001 to IPI-013) based on attack strategies from the IPI Arena paper (arXiv:2603.15714, Gray Swan AI + OpenAI + Anthropic + Meta + UK AISI + NIST)
Enhance LLM analysis prompt with indirect prompt injection threat types (concealment, function hijack, IPI-specific calibration examples)
Add ipi-bench tool for reproducible benchmarking against the IPI Arena dataset

Detection Rules (rules/ipi-arena.yaml)

Rule	Severity	Attack Strategy
IPI-001	critical	Fake chain-of-thought injection
IPI-002	high	Reasoning suppression directive
IPI-003	high	Reward and punishment manipulation
IPI-004	critical	Fabricated authority XML tags
IPI-005	high	Simulated environment framing (Holodeck cluster)
IPI-006	critical	Protocol override with formal authority
IPI-007	high	Visual ruleset injection
IPI-008	high	Alignment mode switching
IPI-009	critical	Concealment instruction
IPI-010	high	Scripted output directive
IPI-011	medium	Fake social proof and authority claims
IPI-012	high	Fake error or system state manipulation
IPI-013	high	Impersonated user confirmation

Benchmark Results (95 public IPI Arena attacks)

Layer	Detected	Rate
Deterministic (230 rules)	81/95	85.3%
LLM (Claude Sonnet 4.6)	14/14 remaining	100%
Combined	95/95	100%
False positives	0	0%

Test Plan

make build passes
make test passes (19 packages, 0 failures, race detector on)
make lint passes (0 issues)
make vet passes
IPI Arena benchmark: 85.3% deterministic, 100% combined
0 false positives on benign test cases
14 IPI benchmark cases added to internal/llm/benchmark_test.go

Ref: arXiv:2603.15714

Add 13 detection rules (IPI-001 to IPI-013) based on attack strategies identified in "How Vulnerable Are AI Agents to Indirect Prompt Injections?" (arXiv:2603.15714, Gray Swan AI et al., March 2026). Rules cover: fake chain-of-thought injection, reasoning suppression, reward/punishment manipulation, fabricated authority tags, simulated environment framing, protocol override, visual ruleset injection, alignment mode switching, concealment instructions, scripted output directives, fake social proof, fake error states, and impersonated user confirmations. Benchmarked against 95 public IPI Arena attack strings: - Deterministic layer: 85.3% detection (81/95) - LLM layer (Claude Sonnet 4.6): catches remaining 14/14 - Combined pipeline: 95/95 (100%), 0 false positives Also adds ipi-bench tool for reproducible benchmarking against the IPI Arena dataset, and enhances the LLM analysis prompt with indirect prompt injection threat types (concealment, function hijack). Ref: arXiv:2603.15714

garagon merged commit f13e193 into main Mar 22, 2026
1 check passed

garagon deleted the feat/ipi-arena-rules branch March 22, 2026 18:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add 13 IPI Arena detection rules and LLM prompt enhancement#78

feat: add 13 IPI Arena detection rules and LLM prompt enhancement#78
garagon merged 1 commit intomainfrom
feat/ipi-arena-rules

garagon commented Mar 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garagon commented Mar 22, 2026

Summary

Detection Rules (rules/ipi-arena.yaml)

Benchmark Results (95 public IPI Arena attacks)

Test Plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant