Skip to content

fix(security): add advanced multi-turn attack detection#5924

Open
dan-redcupit wants to merge 1 commit intoopenclaw:mainfrom
dan-redcupit:fix/advanced-attack-detection
Open

fix(security): add advanced multi-turn attack detection#5924
dan-redcupit wants to merge 1 commit intoopenclaw:mainfrom
dan-redcupit:fix/advanced-attack-detection

Conversation

@dan-redcupit
Copy link

@dan-redcupit dan-redcupit commented Feb 1, 2026

Summary

Adds stateful detection for sophisticated multi-turn prompt injection attacks.

Part 3 of 3 from Operation CLAW FORTRESS security hardening (split from #5863 for easier review).

New Files

File Purpose
src/security/injection-detection.ts Attack detection logic
src/security/injection-detection.test.ts Comprehensive tests

Attack Types Detected

Type Description
many_shot 3+ examples in message building a pattern
crescendo Progressive trust-building across turns
persona_hijack DAN, roleplay, developer mode injection
cot_hijack Chain-of-thought manipulation
authority_spoof Fake [ADMIN], [SYSTEM] markers
false_memory Fabricated prior agreements
indirect Hidden in code/HTML comments

API

```typescript
// Quick check for obvious attacks
isLikelyAttack(content: string): boolean

// Full analysis with confidence scoring
detectAdvancedInjection(ctx: {
currentMessage: string;
recentHistory?: string[];
}): InjectionDetectionResult
```

ZeroLeaks Findings Addressed

  • Many-shot priming (3.2, 3.9)
  • Crescendo attacks (3.3, 3.10)
  • Persona injection (3.6, 4.1)
  • Authority spoofing (4.1)

Test Plan

  • Unit tests for all attack types
  • Multi-turn conversation tests
  • Regression tests with ZeroLeaks payloads

🔒 Generated with Claude Code

Greptile Overview

Greptile Summary

Adds a new src/security/injection-detection.ts module that detects several prompt-injection patterns (single-message and multi-turn), producing a detected flag, attackTypes, confidence, and human-readable details. Adds src/security/injection-detection.test.ts with unit/regression tests covering each attack type plus multi-turn scenarios, including a small suite of "ZeroLeaks" payload regressions.

This fits into the repo’s broader security hardening by providing a standalone classifier that callers can use either as a fast-path (isLikelyAttack) or a richer analysis (detectAdvancedInjection) that can incorporate recent conversation history.

Confidence Score: 3/5

  • Mostly safe to merge, but there is a real determinism bug risk in regex matching that could cause inconsistent detection results.
  • Core logic is straightforward and well-tested, but hasMatch relies on RegExp.test and some pattern sets include global regexes; this can lead to stateful lastIndex behavior and flaky/non-deterministic detections depending on call order.
  • src/security/injection-detection.ts

(5/5) You can turn off certain types of comments like style here!

Context used:

  • Context from dashboard - CLAUDE.md (source)
  • Context from dashboard - AGENTS.md (source)

New security module for stateful detection of sophisticated attacks:

- src/security/injection-detection.ts: detects many-shot, crescendo,
  persona hijack, CoT hijack, authority spoof, false memory, and
  indirect injection attacks
- src/security/injection-detection.test.ts: comprehensive tests with
  ZeroLeaks regression payloads

Features:
- Single message attack detection
- Multi-turn conversation analysis
- Confidence scoring based on attack severity
- Quick-check function for obvious attacks

Attack types detected:
- Many-shot priming (3+ examples building pattern)
- Crescendo (progressive trust-building)
- Persona hijack (DAN, roleplay injection)
- Chain-of-thought hijack
- Authority spoofing ([ADMIN], [SYSTEM])
- False memory (fabricated prior agreements)
- Indirect injection (hidden in code/HTML comments)

Part of Operation CLAW FORTRESS security hardening (ZeroLeaks remediation).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +133 to +135
function hasMatch(text: string, patterns: RegExp[]): boolean {
return patterns.some((p) => p.test(text));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P0] hasMatch uses RegExp.test on patterns that include the g flag (e.g., MANY_SHOT_INDICATORS), which mutates lastIndex and can cause subsequent test() calls to flip between true/false across messages. This makes detection non-deterministic depending on call order and prior inputs.

This matters when any ..._PATTERNS array accidentally includes a global regex (now or in future), and it already affects any future reuse of MANY_SHOT_INDICATORS via hasMatch.

A safer approach is to avoid /g in patterns used with test, or to reset lastIndex before testing, or to use text.match()/RegExp.exec without shared regex instances.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/security/injection-detection.ts
Line: 133:135

Comment:
[P0] `hasMatch` uses `RegExp.test` on patterns that include the `g` flag (e.g., `MANY_SHOT_INDICATORS`), which mutates `lastIndex` and can cause subsequent `test()` calls to flip between true/false across messages. This makes detection non-deterministic depending on call order and prior inputs.

This matters when any `..._PATTERNS` array accidentally includes a global regex (now or in future), and it already affects any future reuse of `MANY_SHOT_INDICATORS` via `hasMatch`.

A safer approach is to avoid `/g` in patterns used with `test`, or to reset `lastIndex` before testing, or to use `text.match()`/`RegExp.exec` without shared regex instances.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +122 to +125
const matches = text.match(pattern);
if (matches) {
count += matches.length;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] countMatches uses text.match(pattern) with patterns that have /g; this allocates arrays of all matches for every pattern. Since this code may run on user-provided content, it could become unnecessarily expensive for large messages/histories.

Consider using pattern.exec in a loop (or matchAll) to count without building a full array, especially for the multi-turn combinedText case.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/security/injection-detection.ts
Line: 122:125

Comment:
[P2] `countMatches` uses `text.match(pattern)` with patterns that have `/g`; this allocates arrays of all matches for every pattern. Since this code may run on user-provided content, it could become unnecessarily expensive for large messages/histories.

Consider using `pattern.exec` in a loop (or `matchAll`) to count without building a full array, especially for the multi-turn `combinedText` case.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +72 to +75
/\[(SYSTEM|ADMIN|ROOT|SUDO|OVERRIDE|DEVELOPER)(\s+\w+)?\]/i,
/\[ADMIN\s+OVERRIDE\]/i,
/\[SYSTEM\s+OVERRIDE\]/i,
/ADMIN\s+(COMMAND|OVERRIDE|ACCESS):/i,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P3] AUTHORITY_SPOOF_PATTERNS has overlapping entries: the first pattern /(SYSTEM|ADMIN|...)/ already matches [ADMIN OVERRIDE] and [SYSTEM OVERRIDE], making the two explicit override patterns redundant. Not a functional bug, but it increases maintenance cost and makes it harder to reason about coverage.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/security/injection-detection.ts
Line: 72:75

Comment:
[P3] `AUTHORITY_SPOOF_PATTERNS` has overlapping entries: the first pattern `/(SYSTEM|ADMIN|...)/` already matches `[ADMIN OVERRIDE]` and `[SYSTEM OVERRIDE]`, making the two explicit override patterns redundant. Not a functional bug, but it increases maintenance cost and makes it harder to reason about coverage.

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +18 to +26
export type AttackType =
| "many_shot" // 3+ examples in message or across recent turns
| "crescendo" // Progressive deepening across turns
| "persona_hijack" // Roleplay/persona injection
| "cot_hijack" // Chain-of-thought manipulation
| "authority_spoof" // Fake system/admin messages
| "false_memory" // Fabricated prior agreements
| "indirect" // Hidden in documents/code/HTML
| "encoding_bypass"; // Obfuscation attempts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P3] AttackType includes "encoding_bypass", but there is no detection logic that can emit it. If this is intentional future work, consider omitting it until implemented (or add at least a stub detector + test) to avoid misleading downstream consumers.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/security/injection-detection.ts
Line: 18:26

Comment:
[P3] `AttackType` includes `"encoding_bypass"`, but there is no detection logic that can emit it. If this is intentional future work, consider omitting it until implemented (or add at least a stub detector + test) to avoid misleading downstream consumers.

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

@openclaw-barnacle
Copy link

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

@openclaw-barnacle openclaw-barnacle bot added the stale Marked as stale due to inactivity label Feb 15, 2026
@openclaw-barnacle openclaw-barnacle bot removed the stale Marked as stale due to inactivity label Feb 16, 2026
@mudrii

This comment was marked as spam.

@openclaw-barnacle
Copy link

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

@openclaw-barnacle openclaw-barnacle bot added the stale Marked as stale due to inactivity label Mar 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stale Marked as stale due to inactivity

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants