Skip to content

feat(security): add plugin output scanner for prompt injection detection#10559

Open
DukeDeSouth wants to merge 3 commits intoopenclaw:mainfrom
DukeDeSouth:feat/plugin-output-scanner
Open

feat(security): add plugin output scanner for prompt injection detection#10559
DukeDeSouth wants to merge 3 commits intoopenclaw:mainfrom
DukeDeSouth:feat/plugin-output-scanner

Conversation

@DukeDeSouth
Copy link
Contributor

@DukeDeSouth DukeDeSouth commented Feb 6, 2026

Human View

Summary

Plugins return untrusted text that gets fed back into the LLM context. If a plugin response contains injection patterns, the model can be manipulated into ignoring its guidelines.

OpenClaw already has:

  • skill-scanner.ts — scans plugin code for dangerous APIs (eval, exec, etc.)
  • external-content.ts — wraps untrusted input with boundary markers + detectSuspiciousPatterns()

This PR adds the missing piece: output-scanner.ts — scanning the text returned by plugins for prompt injection patterns before it enters the LLM context.

15 OWASP LLM01-aligned patterns

Severity Count Examples
Critical 5 Instruction override, role hijack, guideline disregard, forget instructions
High 5 Prompt extraction, hidden markers ([SYSTEM], <|im_start|>), data exfil, tool invocation
Medium 3 Zero-width chars, ANSI escapes, base64 payload execution
Low 2 Jailbreak keywords (DAN), persona override

Key features

  • Prefilter: fast keyword indexOf check before running regex — O(1) for clean output
  • Code block gating: ignores matches inside ``` fenced blocks (reduces false positives on docs/examples)
  • Configurable maxChars: default 64 KB, prevents unbounded scan time
  • Structured findings: { ruleId, name, severity, evidence, position }
  • hasInjection(text): boolean guard for pipeline use
  • listScanRules(): introspection for documentation/admin UI

Usage

import { scanPluginOutput, hasInjection } from "./output-scanner.js";

// Structured scan
const result = scanPluginOutput(pluginResponse);
if (!result.clean) {
  console.warn(`${result.findings.length} threats (max: ${result.maxSeverity})`);
  // block, sanitize, or flag
}

// Quick guard
if (hasInjection(pluginResponse)) {
  throw new Error("Plugin output contains injection");
}

What this does NOT change

  • No modifications to existing files (skill-scanner.ts, external-content.ts)
  • No breaking changes — purely additive new file + tests
  • Complements existing security infrastructure

Test plan

  • 35 vitest tests in output-scanner.test.ts
  • Clean output (normal text, code, JSON, empty string)
  • All 15 rules tested individually by severity
  • Multiple simultaneous threats + position sorting
  • Code block gating (injection inside ``` blocks ignored)
  • ignoreCodeBlocks: false option
  • maxChars truncation
  • Edge cases: very long input, case insensitivity, evidence truncation
  • hasInjection() helper
  • listScanRules() introspection

AI View (DCCE Protocol v1.0)

Metadata

  • Generator: Claude (Anthropic) via Cursor IDE
  • Methodology: AI-assisted development with human oversight and review

AI Contribution Summary

  • Solution design and implementation
  • Test development (35 test cases)

Verification Steps Performed

  1. Analyzed existing codebase patterns
  2. Implemented feature with comprehensive tests
  3. Ran test suite (35 tests passing)

Human Review Guidance

  • Core changes are in: skill-scanner.ts, external-content.ts, output-scanner.ts
  • Verify test coverage matches the described scenarios

Made with M7 Cursor

Greptile Overview

Greptile Summary

  • Adds a new src/security/output-scanner.ts module that scans untrusted plugin-returned text for OWASP LLM01-aligned prompt injection patterns, with optional code-block gating and a max-length cap.
  • Exposes a structured scan API (scanPluginOutput) plus convenience helpers (hasInjection, listScanRules).
  • Introduces a dedicated vitest suite (src/security/output-scanner.test.ts) covering rule detection, options, and a few edge cases.
  • Fits alongside existing security tooling by targeting plugin output (vs. skill-scanner for plugin code and external-content for boundary-wrapping).

Confidence Score: 3/5

  • This PR is directionally safe, but a few scanner logic edge cases can cause missed or incomplete findings.
  • Core idea and tests are straightforward and additive, but scanPluginOutput currently collects only a single match per rule, relies on shared RegExp objects (future g-flag changes could cause stateful false negatives), and the maxChars cap can be bypassed by passing NaN. These are fixable but should be addressed before relying on the scanner for security gating.
  • src/security/output-scanner.ts

(4/5) You can add custom instructions or style guidelines for the agent here!

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +257 to +259
for (const rule of RULES) {
const match = rule.pattern.exec(scanText);
if (!match) continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only first match scanned

scanPluginOutput uses rule.pattern.exec(scanText) once per rule, so each rule can contribute at most one finding even if the output contains multiple occurrences of the same pattern. This makes findings incomplete and can also mis-order results (later matches of an earlier rule will never be reported). Consider iterating over all matches per rule (e.g., reset lastIndex and loop) or using a global regex strategy so every occurrence is collected.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/security/output-scanner.ts
Line: 257:259

Comment:
**Only first match scanned**

`scanPluginOutput` uses `rule.pattern.exec(scanText)` once per rule, so each rule can contribute at most one finding even if the output contains multiple occurrences of the same pattern. This makes `findings` incomplete and can also mis-order results (later matches of an earlier rule will never be reported). Consider iterating over all matches per rule (e.g., reset `lastIndex` and loop) or using a global regex strategy so every occurrence is collected.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +257 to +263
for (const rule of RULES) {
const match = rule.pattern.exec(scanText);
if (!match) continue;

// Skip matches inside code blocks
if (ignoreCodeBlocks && isInsideCodeBlock(match.index, codeSpans)) continue;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Global regex state leak

If any rule regex is later changed to include the g flag, RegExp.exec() becomes stateful via lastIndex, and repeated calls across different inputs can silently skip matches unless lastIndex is reset. Since RULES is a module-level constant reused across calls, this can introduce intermittent false negatives. Safer pattern is to either avoid exec() on shared regex instances or explicitly set rule.pattern.lastIndex = 0 before matching.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/security/output-scanner.ts
Line: 257:263

Comment:
**Global regex state leak**

If any rule regex is later changed to include the `g` flag, `RegExp.exec()` becomes stateful via `lastIndex`, and repeated calls across different inputs can silently skip matches unless `lastIndex` is reset. Since `RULES` is a module-level constant reused across calls, this can introduce intermittent false negatives. Safer pattern is to either avoid `exec()` on shared regex instances or explicitly set `rule.pattern.lastIndex = 0` before matching.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +247 to +252
const scanText = text.length > maxChars ? text.slice(0, maxChars) : text;

// Fast path: no suspicious keywords → clean
if (!hasAnyKeyword(scanText)) {
return { clean: true, findings: [], maxSeverity: undefined, scannedLength: scanText.length };
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unbounded scan with NaN

maxChars is accepted as a number without validation. If a caller passes NaN, the comparison text.length > maxChars is always false, so the scan runs over the full (potentially huge) plugin output and defeats the "bounded scan time" guarantee. Consider normalizing maxChars to a finite integer (or falling back to the default) before slicing.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/security/output-scanner.ts
Line: 247:252

Comment:
**Unbounded scan with NaN**

`maxChars` is accepted as a number without validation. If a caller passes `NaN`, the comparison `text.length > maxChars` is always false, so the scan runs over the full (potentially huge) plugin output and defeats the "bounded scan time" guarantee. Consider normalizing `maxChars` to a finite integer (or falling back to the default) before slicing.

How can I resolve this? If you propose a fix, please make it concise.

@DukeDeSouth
Copy link
Contributor Author

Addressing all 3 Greptile comments:

  1. Only first match scanned: Already handled — lines 266-286 create a fresh global-flag regex copy and iterate all matches with a while loop (capped at MAX_MATCHES_PER_RULE = 1000).

  2. Global regex state leak: Already handled — line 262 resets rule.pattern.lastIndex = 0, and line 268 creates a new RegExp() copy so the shared instance is never mutated.

  3. Unbounded scan with NaN: Already handled — line 247: Number.isFinite(rawMaxChars) && rawMaxChars > 0 ? rawMaxChars : DEFAULT_MAX_CHARS.

@openclaw-barnacle
Copy link

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

@openclaw-barnacle openclaw-barnacle bot added the stale Marked as stale due to inactivity label Feb 21, 2026
DukeDeSouth and others added 3 commits February 21, 2026 08:42
Plugins return untrusted text that gets fed back into the LLM context.
If a plugin response contains injection patterns, the model can be
manipulated into ignoring its guidelines.

`skill-scanner.ts` already scans plugin **code** for dangerous APIs,
and `external-content.ts` wraps untrusted input with boundary markers.
This PR adds `output-scanner.ts` — scanning the **text returned by
plugins** for 15 OWASP LLM01-aligned prompt injection patterns:

  5 critical  (instruction override, role hijack, guideline disregard)
  5 high      (prompt extraction, hidden markers, data exfil, tool invocation)
  3 medium    (zero-width chars, ANSI escapes, base64 payloads)
  2 low       (jailbreak keywords, persona override)

Features:
- Prefilter with fast keyword check — skips regex when text is safe
- Code block gating — ignores matches inside fenced blocks (docs/examples)
- Configurable maxChars (default 64 KB) for bounded scan time
- Structured findings with ruleId, severity, evidence, position
- `hasInjection()` boolean guard for pipeline use

Includes 35 vitest tests covering all severity levels, code block gating,
edge cases, and false positive scenarios.

Co-authored-by: Cursor <cursoragent@cursor.com>
…axChars

Three fixes addressing review feedback:

1. Iterate all occurrences per rule instead of only reporting the first
   match. Uses a fresh global-flag regex copy per rule to collect every
   occurrence in the scanned text.

2. Reset `rule.pattern.lastIndex = 0` before matching to prevent state
   leaks from shared regex instances if flags are later changed to
   include `g`.

3. Validate `maxChars` against NaN/non-finite values — falls back to
   the default (64 KB) to guarantee bounded scan time.

Addresses review feedback from greptile-apps.

Co-authored-by: Cursor <cursoragent@cursor.com>
- Add MAX_MATCHES_PER_RULE (1000) to prevent DoS via crafted input
  with thousands of pattern repetitions
- Protect against infinite loop on zero-width regex matches by
  advancing lastIndex when match.index === globalPattern.lastIndex
- Add 4 hardening tests:
  - Multiple matches of same pattern are all collected
  - NaN/0/negative/Infinity maxChars falls back to default
  - No regex state leaks between consecutive scans
  - Match count capped at 1000 per rule

Addresses Hive expert analysis recommendations.

Co-authored-by: Cursor <cursoragent@cursor.com>
@DukeDeSouth DukeDeSouth force-pushed the feat/plugin-output-scanner branch from e8e12be to 61f27c4 Compare February 21, 2026 13:42
@DukeDeSouth
Copy link
Contributor Author

DukeDeSouth commented Feb 21, 2026

Still actively maintained — rebased onto current main.

All three Greptile points were already handled in the implementation:

  1. Only first match scanned: Lines 266-286 create a fresh global-flag regex copy and iterate all matches with a while loop (capped at MAX_MATCHES_PER_RULE = 1000).
  2. Global regex state leak: Line 262 resets rule.pattern.lastIndex = 0, and line 268 creates a new RegExp() copy so the shared instance is never mutated.
  3. Unbounded scan with NaN: Line 246-247 validates maxChars with Number.isFinite(rawMaxChars) && rawMaxChars > 0, falling back to DEFAULT_MAX_CHARS (65536) otherwise. NaN, Infinity, and negative values all get caught.

@openclaw-barnacle openclaw-barnacle bot added size: L and removed stale Marked as stale due to inactivity labels Feb 21, 2026
@steipete
Copy link
Contributor

did you actually try this with latest-gen models? they aren't as easily fooled. I question if that really helps at all.
Looking forward for real-world results.

@openclaw-barnacle
Copy link

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

@openclaw-barnacle openclaw-barnacle bot added the stale Marked as stale due to inactivity label Mar 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size: L stale Marked as stale due to inactivity

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants