feat(security): add plugin output scanner for prompt injection detection#10559
feat(security): add plugin output scanner for prompt injection detection#10559DukeDeSouth wants to merge 3 commits intoopenclaw:mainfrom
Conversation
src/security/output-scanner.ts
Outdated
| for (const rule of RULES) { | ||
| const match = rule.pattern.exec(scanText); | ||
| if (!match) continue; |
There was a problem hiding this comment.
Only first match scanned
scanPluginOutput uses rule.pattern.exec(scanText) once per rule, so each rule can contribute at most one finding even if the output contains multiple occurrences of the same pattern. This makes findings incomplete and can also mis-order results (later matches of an earlier rule will never be reported). Consider iterating over all matches per rule (e.g., reset lastIndex and loop) or using a global regex strategy so every occurrence is collected.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/security/output-scanner.ts
Line: 257:259
Comment:
**Only first match scanned**
`scanPluginOutput` uses `rule.pattern.exec(scanText)` once per rule, so each rule can contribute at most one finding even if the output contains multiple occurrences of the same pattern. This makes `findings` incomplete and can also mis-order results (later matches of an earlier rule will never be reported). Consider iterating over all matches per rule (e.g., reset `lastIndex` and loop) or using a global regex strategy so every occurrence is collected.
How can I resolve this? If you propose a fix, please make it concise.| for (const rule of RULES) { | ||
| const match = rule.pattern.exec(scanText); | ||
| if (!match) continue; | ||
|
|
||
| // Skip matches inside code blocks | ||
| if (ignoreCodeBlocks && isInsideCodeBlock(match.index, codeSpans)) continue; | ||
|
|
There was a problem hiding this comment.
Global regex state leak
If any rule regex is later changed to include the g flag, RegExp.exec() becomes stateful via lastIndex, and repeated calls across different inputs can silently skip matches unless lastIndex is reset. Since RULES is a module-level constant reused across calls, this can introduce intermittent false negatives. Safer pattern is to either avoid exec() on shared regex instances or explicitly set rule.pattern.lastIndex = 0 before matching.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/security/output-scanner.ts
Line: 257:263
Comment:
**Global regex state leak**
If any rule regex is later changed to include the `g` flag, `RegExp.exec()` becomes stateful via `lastIndex`, and repeated calls across different inputs can silently skip matches unless `lastIndex` is reset. Since `RULES` is a module-level constant reused across calls, this can introduce intermittent false negatives. Safer pattern is to either avoid `exec()` on shared regex instances or explicitly set `rule.pattern.lastIndex = 0` before matching.
How can I resolve this? If you propose a fix, please make it concise.| const scanText = text.length > maxChars ? text.slice(0, maxChars) : text; | ||
|
|
||
| // Fast path: no suspicious keywords → clean | ||
| if (!hasAnyKeyword(scanText)) { | ||
| return { clean: true, findings: [], maxSeverity: undefined, scannedLength: scanText.length }; | ||
| } |
There was a problem hiding this comment.
Unbounded scan with NaN
maxChars is accepted as a number without validation. If a caller passes NaN, the comparison text.length > maxChars is always false, so the scan runs over the full (potentially huge) plugin output and defeats the "bounded scan time" guarantee. Consider normalizing maxChars to a finite integer (or falling back to the default) before slicing.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/security/output-scanner.ts
Line: 247:252
Comment:
**Unbounded scan with NaN**
`maxChars` is accepted as a number without validation. If a caller passes `NaN`, the comparison `text.length > maxChars` is always false, so the scan runs over the full (potentially huge) plugin output and defeats the "bounded scan time" guarantee. Consider normalizing `maxChars` to a finite integer (or falling back to the default) before slicing.
How can I resolve this? If you propose a fix, please make it concise.|
Addressing all 3 Greptile comments:
|
bfc1ccb to
f92900f
Compare
|
This pull request has been automatically marked as stale due to inactivity. |
Plugins return untrusted text that gets fed back into the LLM context. If a plugin response contains injection patterns, the model can be manipulated into ignoring its guidelines. `skill-scanner.ts` already scans plugin **code** for dangerous APIs, and `external-content.ts` wraps untrusted input with boundary markers. This PR adds `output-scanner.ts` — scanning the **text returned by plugins** for 15 OWASP LLM01-aligned prompt injection patterns: 5 critical (instruction override, role hijack, guideline disregard) 5 high (prompt extraction, hidden markers, data exfil, tool invocation) 3 medium (zero-width chars, ANSI escapes, base64 payloads) 2 low (jailbreak keywords, persona override) Features: - Prefilter with fast keyword check — skips regex when text is safe - Code block gating — ignores matches inside fenced blocks (docs/examples) - Configurable maxChars (default 64 KB) for bounded scan time - Structured findings with ruleId, severity, evidence, position - `hasInjection()` boolean guard for pipeline use Includes 35 vitest tests covering all severity levels, code block gating, edge cases, and false positive scenarios. Co-authored-by: Cursor <cursoragent@cursor.com>
…axChars Three fixes addressing review feedback: 1. Iterate all occurrences per rule instead of only reporting the first match. Uses a fresh global-flag regex copy per rule to collect every occurrence in the scanned text. 2. Reset `rule.pattern.lastIndex = 0` before matching to prevent state leaks from shared regex instances if flags are later changed to include `g`. 3. Validate `maxChars` against NaN/non-finite values — falls back to the default (64 KB) to guarantee bounded scan time. Addresses review feedback from greptile-apps. Co-authored-by: Cursor <cursoragent@cursor.com>
- Add MAX_MATCHES_PER_RULE (1000) to prevent DoS via crafted input with thousands of pattern repetitions - Protect against infinite loop on zero-width regex matches by advancing lastIndex when match.index === globalPattern.lastIndex - Add 4 hardening tests: - Multiple matches of same pattern are all collected - NaN/0/negative/Infinity maxChars falls back to default - No regex state leaks between consecutive scans - Match count capped at 1000 per rule Addresses Hive expert analysis recommendations. Co-authored-by: Cursor <cursoragent@cursor.com>
e8e12be to
61f27c4
Compare
|
Still actively maintained — rebased onto current main. All three Greptile points were already handled in the implementation:
|
|
did you actually try this with latest-gen models? they aren't as easily fooled. I question if that really helps at all. |
|
This pull request has been automatically marked as stale due to inactivity. |
Human View
Summary
Plugins return untrusted text that gets fed back into the LLM context. If a plugin response contains injection patterns, the model can be manipulated into ignoring its guidelines.
OpenClaw already has:
skill-scanner.ts— scans plugin code for dangerous APIs (eval, exec, etc.)external-content.ts— wraps untrusted input with boundary markers +detectSuspiciousPatterns()This PR adds the missing piece:
output-scanner.ts— scanning the text returned by plugins for prompt injection patterns before it enters the LLM context.15 OWASP LLM01-aligned patterns
[SYSTEM],<|im_start|>), data exfil, tool invocationKey features
```fenced blocks (reduces false positives on docs/examples){ ruleId, name, severity, evidence, position }hasInjection(text): boolean guard for pipeline uselistScanRules(): introspection for documentation/admin UIUsage
What this does NOT change
skill-scanner.ts,external-content.ts)Test plan
output-scanner.test.tsignoreCodeBlocks: falseoptionmaxCharstruncationhasInjection()helperlistScanRules()introspectionAI View (DCCE Protocol v1.0)
Metadata
AI Contribution Summary
Verification Steps Performed
Human Review Guidance
skill-scanner.ts,external-content.ts,output-scanner.tsMade with M7 Cursor
Greptile Overview
Greptile Summary
src/security/output-scanner.tsmodule that scans untrusted plugin-returned text for OWASP LLM01-aligned prompt injection patterns, with optional code-block gating and a max-length cap.scanPluginOutput) plus convenience helpers (hasInjection,listScanRules).src/security/output-scanner.test.ts) covering rule detection, options, and a few edge cases.skill-scannerfor plugin code andexternal-contentfor boundary-wrapping).Confidence Score: 3/5
scanPluginOutputcurrently collects only a single match per rule, relies on shared RegExp objects (futureg-flag changes could cause stateful false negatives), and themaxCharscap can be bypassed by passingNaN. These are fixable but should be addressed before relying on the scanner for security gating.(4/5) You can add custom instructions or style guidelines for the agent here!