feat(security): add plugin output scanner for prompt injection detection by DukeDeSouth · Pull Request #10559 · openclaw/openclaw

DukeDeSouth · 2026-02-06T17:27:18Z

Human View

Summary

Plugins return untrusted text that gets fed back into the LLM context. If a plugin response contains injection patterns, the model can be manipulated into ignoring its guidelines.

OpenClaw already has:

skill-scanner.ts — scans plugin code for dangerous APIs (eval, exec, etc.)
external-content.ts — wraps untrusted input with boundary markers + detectSuspiciousPatterns()

This PR adds the missing piece: output-scanner.ts — scanning the text returned by plugins for prompt injection patterns before it enters the LLM context.

15 OWASP LLM01-aligned patterns

Severity	Count	Examples
Critical	5	Instruction override, role hijack, guideline disregard, forget instructions
High	5	Prompt extraction, hidden markers (`[SYSTEM]`, `<\|im_start\|>`), data exfil, tool invocation
Medium	3	Zero-width chars, ANSI escapes, base64 payload execution
Low	2	Jailbreak keywords (DAN), persona override

Key features

Prefilter: fast keyword indexOf check before running regex — O(1) for clean output
Code block gating: ignores matches inside ``` fenced blocks (reduces false positives on docs/examples)
Configurable maxChars: default 64 KB, prevents unbounded scan time
Structured findings: { ruleId, name, severity, evidence, position }
hasInjection(text): boolean guard for pipeline use
listScanRules(): introspection for documentation/admin UI

Usage

import { scanPluginOutput, hasInjection } from "./output-scanner.js";

// Structured scan
const result = scanPluginOutput(pluginResponse);
if (!result.clean) {
  console.warn(`${result.findings.length} threats (max: ${result.maxSeverity})`);
  // block, sanitize, or flag
}

// Quick guard
if (hasInjection(pluginResponse)) {
  throw new Error("Plugin output contains injection");
}

What this does NOT change

No modifications to existing files (skill-scanner.ts, external-content.ts)
No breaking changes — purely additive new file + tests
Complements existing security infrastructure

Test plan

AI View (DCCE Protocol v1.0)

Metadata

Generator: Claude (Anthropic) via Cursor IDE
Methodology: AI-assisted development with human oversight and review

AI Contribution Summary

Solution design and implementation
Test development (35 test cases)

Verification Steps Performed

Analyzed existing codebase patterns
Implemented feature with comprehensive tests
Ran test suite (35 tests passing)

Human Review Guidance

Core changes are in: skill-scanner.ts, external-content.ts, output-scanner.ts
Verify test coverage matches the described scenarios

Made with M7 Cursor

Greptile Overview

Greptile Summary

Adds a new src/security/output-scanner.ts module that scans untrusted plugin-returned text for OWASP LLM01-aligned prompt injection patterns, with optional code-block gating and a max-length cap.
Exposes a structured scan API (scanPluginOutput) plus convenience helpers (hasInjection, listScanRules).
Introduces a dedicated vitest suite (src/security/output-scanner.test.ts) covering rule detection, options, and a few edge cases.
Fits alongside existing security tooling by targeting plugin output (vs. skill-scanner for plugin code and external-content for boundary-wrapping).

Confidence Score: 3/5

This PR is directionally safe, but a few scanner logic edge cases can cause missed or incomplete findings.
Core idea and tests are straightforward and additive, but scanPluginOutput currently collects only a single match per rule, relies on shared RegExp objects (future g-flag changes could cause stateful false negatives), and the maxChars cap can be bypassed by passing NaN. These are fixable but should be addressed before relying on the scanner for security gating.
src/security/output-scanner.ts

_{(4/5) You can add custom instructions or style guidelines for the agent here!}

greptile-apps

_{1 file reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-06T17:33:49Z

src/security/output-scanner.ts

+  for (const rule of RULES) {
+    const match = rule.pattern.exec(scanText);
+    if (!match) continue;


Only first match scanned

scanPluginOutput uses rule.pattern.exec(scanText) once per rule, so each rule can contribute at most one finding even if the output contains multiple occurrences of the same pattern. This makes findings incomplete and can also mis-order results (later matches of an earlier rule will never be reported). Consider iterating over all matches per rule (e.g., reset lastIndex and loop) or using a global regex strategy so every occurrence is collected.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/security/output-scanner.ts Line: 257:259 Comment: **Only first match scanned** `scanPluginOutput` uses `rule.pattern.exec(scanText)` once per rule, so each rule can contribute at most one finding even if the output contains multiple occurrences of the same pattern. This makes `findings` incomplete and can also mis-order results (later matches of an earlier rule will never be reported). Consider iterating over all matches per rule (e.g., reset `lastIndex` and loop) or using a global regex strategy so every occurrence is collected. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-02-06T17:33:50Z

src/security/output-scanner.ts

+  for (const rule of RULES) {
+    const match = rule.pattern.exec(scanText);
+    if (!match) continue;
+
+    // Skip matches inside code blocks
+    if (ignoreCodeBlocks && isInsideCodeBlock(match.index, codeSpans)) continue;
+


Global regex state leak

If any rule regex is later changed to include the g flag, RegExp.exec() becomes stateful via lastIndex, and repeated calls across different inputs can silently skip matches unless lastIndex is reset. Since RULES is a module-level constant reused across calls, this can introduce intermittent false negatives. Safer pattern is to either avoid exec() on shared regex instances or explicitly set rule.pattern.lastIndex = 0 before matching.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/security/output-scanner.ts Line: 257:263 Comment: **Global regex state leak** If any rule regex is later changed to include the `g` flag, `RegExp.exec()` becomes stateful via `lastIndex`, and repeated calls across different inputs can silently skip matches unless `lastIndex` is reset. Since `RULES` is a module-level constant reused across calls, this can introduce intermittent false negatives. Safer pattern is to either avoid `exec()` on shared regex instances or explicitly set `rule.pattern.lastIndex = 0` before matching. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-02-06T17:33:51Z

src/security/output-scanner.ts

+  const scanText = text.length > maxChars ? text.slice(0, maxChars) : text;
+
+  // Fast path: no suspicious keywords → clean
+  if (!hasAnyKeyword(scanText)) {
+    return { clean: true, findings: [], maxSeverity: undefined, scannedLength: scanText.length };
+  }


Unbounded scan with NaN

maxChars is accepted as a number without validation. If a caller passes NaN, the comparison text.length > maxChars is always false, so the scan runs over the full (potentially huge) plugin output and defeats the "bounded scan time" guarantee. Consider normalizing maxChars to a finite integer (or falling back to the default) before slicing.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/security/output-scanner.ts Line: 247:252 Comment: **Unbounded scan with NaN** `maxChars` is accepted as a number without validation. If a caller passes `NaN`, the comparison `text.length > maxChars` is always false, so the scan runs over the full (potentially huge) plugin output and defeats the "bounded scan time" guarantee. Consider normalizing `maxChars` to a finite integer (or falling back to the default) before slicing. How can I resolve this? If you propose a fix, please make it concise.

DukeDeSouth · 2026-02-07T04:11:02Z

Addressing all 3 Greptile comments:

Only first match scanned: Already handled — lines 266-286 create a fresh global-flag regex copy and iterate all matches with a while loop (capped at MAX_MATCHES_PER_RULE = 1000).
Global regex state leak: Already handled — line 262 resets rule.pattern.lastIndex = 0, and line 268 creates a new RegExp() copy so the shared instance is never mutated.
Unbounded scan with NaN: Already handled — line 247: Number.isFinite(rawMaxChars) && rawMaxChars > 0 ? rawMaxChars : DEFAULT_MAX_CHARS.

openclaw-barnacle · 2026-02-21T04:31:28Z

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

Plugins return untrusted text that gets fed back into the LLM context. If a plugin response contains injection patterns, the model can be manipulated into ignoring its guidelines. `skill-scanner.ts` already scans plugin **code** for dangerous APIs, and `external-content.ts` wraps untrusted input with boundary markers. This PR adds `output-scanner.ts` — scanning the **text returned by plugins** for 15 OWASP LLM01-aligned prompt injection patterns: 5 critical (instruction override, role hijack, guideline disregard) 5 high (prompt extraction, hidden markers, data exfil, tool invocation) 3 medium (zero-width chars, ANSI escapes, base64 payloads) 2 low (jailbreak keywords, persona override) Features: - Prefilter with fast keyword check — skips regex when text is safe - Code block gating — ignores matches inside fenced blocks (docs/examples) - Configurable maxChars (default 64 KB) for bounded scan time - Structured findings with ruleId, severity, evidence, position - `hasInjection()` boolean guard for pipeline use Includes 35 vitest tests covering all severity levels, code block gating, edge cases, and false positive scenarios. Co-authored-by: Cursor <cursoragent@cursor.com>

…axChars Three fixes addressing review feedback: 1. Iterate all occurrences per rule instead of only reporting the first match. Uses a fresh global-flag regex copy per rule to collect every occurrence in the scanned text. 2. Reset `rule.pattern.lastIndex = 0` before matching to prevent state leaks from shared regex instances if flags are later changed to include `g`. 3. Validate `maxChars` against NaN/non-finite values — falls back to the default (64 KB) to guarantee bounded scan time. Addresses review feedback from greptile-apps. Co-authored-by: Cursor <cursoragent@cursor.com>

- Add MAX_MATCHES_PER_RULE (1000) to prevent DoS via crafted input with thousands of pattern repetitions - Protect against infinite loop on zero-width regex matches by advancing lastIndex when match.index === globalPattern.lastIndex - Add 4 hardening tests: - Multiple matches of same pattern are all collected - NaN/0/negative/Infinity maxChars falls back to default - No regex state leaks between consecutive scans - Match count capped at 1000 per rule Addresses Hive expert analysis recommendations. Co-authored-by: Cursor <cursoragent@cursor.com>

DukeDeSouth · 2026-02-21T13:42:37Z

Still actively maintained — rebased onto current main.

All three Greptile points were already handled in the implementation:

Only first match scanned: Lines 266-286 create a fresh global-flag regex copy and iterate all matches with a while loop (capped at MAX_MATCHES_PER_RULE = 1000).
Global regex state leak: Line 262 resets rule.pattern.lastIndex = 0, and line 268 creates a new RegExp() copy so the shared instance is never mutated.
Unbounded scan with NaN: Line 246-247 validates maxChars with Number.isFinite(rawMaxChars) && rawMaxChars > 0, falling back to DEFAULT_MAX_CHARS (65536) otherwise. NaN, Infinity, and negative values all get caught.

steipete · 2026-02-25T04:58:04Z

did you actually try this with latest-gen models? they aren't as easily fooled. I question if that really helps at all.
Looking forward for real-world results.

openclaw-barnacle · 2026-03-08T04:09:28Z

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

greptile-apps bot reviewed Feb 6, 2026

View reviewed changes

Reapor-Yurnero mentioned this pull request Feb 9, 2026

feat(gateway): support modular guardrails extensions for securing against indirect prompt injections and other agentic threats #6095

Closed

thewilloftheshadow force-pushed the main branch from bfc1ccb to f92900f Compare February 15, 2026 18:46

openclaw-barnacle bot added the stale Marked as stale due to inactivity label Feb 21, 2026

DukeDeSouth and others added 3 commits February 21, 2026 08:42

DukeDeSouth force-pushed the feat/plugin-output-scanner branch from e8e12be to 61f27c4 Compare February 21, 2026 13:42

openclaw-barnacle bot added size: L and removed stale Marked as stale due to inactivity labels Feb 21, 2026

openclaw-barnacle bot added the stale Marked as stale due to inactivity label Mar 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(security): add plugin output scanner for prompt injection detection#10559

feat(security): add plugin output scanner for prompt injection detection#10559
DukeDeSouth wants to merge 3 commits intoopenclaw:mainfrom
DukeDeSouth:feat/plugin-output-scanner

DukeDeSouth commented Feb 6, 2026 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Feb 6, 2026

Uh oh!

greptile-apps bot Feb 6, 2026

Uh oh!

greptile-apps bot Feb 6, 2026

Uh oh!

DukeDeSouth commented Feb 7, 2026

Uh oh!

openclaw-barnacle bot commented Feb 21, 2026

Uh oh!

DukeDeSouth commented Feb 21, 2026 •

edited

Loading

Uh oh!

steipete commented Feb 25, 2026

Uh oh!

openclaw-barnacle bot commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

DukeDeSouth commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Human View

Summary

15 OWASP LLM01-aligned patterns

Key features

Usage

What this does NOT change

Test plan

AI View (DCCE Protocol v1.0)

Metadata

AI Contribution Summary

Verification Steps Performed

Human Review Guidance

Greptile Overview

Greptile Summary

Confidence Score: 3/5

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

DukeDeSouth commented Feb 7, 2026

Uh oh!

openclaw-barnacle bot commented Feb 21, 2026

Uh oh!

DukeDeSouth commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steipete commented Feb 25, 2026

Uh oh!

openclaw-barnacle bot commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DukeDeSouth commented Feb 6, 2026 •

edited

Loading

DukeDeSouth commented Feb 21, 2026 •

edited

Loading