security: extend skill scanner to detect threats in markdown skill definitions#10705
security: extend skill scanner to detect threats in markdown skill definitions#10705Alex-Alaniz wants to merge 1 commit into
Conversation
…finitions The skill scanner previously only analyzed JS/TS files for malicious patterns, leaving SKILL.md files completely unscanned. This extends coverage to markdown files with rules targeting embedded technical threats: - Hidden Unicode (zero-width, bidi overrides) used to conceal malicious content - Download-and-execute patterns (curl|wget piped to shell) - Large base64 payloads in code fences (obfuscated content) - Hex-encoded payloads in markdown body - Data URIs with executable MIME types Code rules and markdown rules are applied independently — code patterns (eval, child_process) do not false-positive on markdown documentation, and markdown rules do not fire on JS/TS files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| pattern: /(?:curl|wget)\s+[^\n|]*\|\s*(?:sh|bash|zsh|node|python|perl|ruby)\b/, | ||
| }, | ||
| { | ||
| ruleId: "markdown-encoded-payload", | ||
| severity: "warn", | ||
| message: "Large base64 block detected in markdown (possible obfuscated payload)", | ||
| pattern: /```[^\n]*\n[A-Za-z0-9+/=\s]{400,}\n```/, | ||
| }, |
There was a problem hiding this comment.
Markdown base64 rule misfires
markdown-encoded-payload matches any 400+ chars of [A-Za-z0-9+/=\s] inside a fenced block (src/security/skill-scanner.ts:186-190). That character class also matches lots of non-base64 content (e.g., long ASCII text, repeated = padding, or whitespace-heavy blocks), so this will reliably generate false positives on large documentation/code samples. If the intent is “base64-looking payload”, it needs a stricter heuristic (e.g., require high base64 density + valid padding + line lengths typical of base64, or require a nearby decode usage marker) to avoid flagging benign docs.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/security/skill-scanner.ts
Line: 183:190
Comment:
**Markdown base64 rule misfires**
`markdown-encoded-payload` matches any 400+ chars of `[A-Za-z0-9+/=\s]` inside a fenced block (`src/security/skill-scanner.ts:186-190`). That character class also matches lots of non-base64 content (e.g., long ASCII text, repeated `=` padding, or whitespace-heavy blocks), so this will reliably generate false positives on large documentation/code samples. If the intent is “base64-looking payload”, it needs a stricter heuristic (e.g., require high base64 density + valid padding + line lengths typical of base64, or require a nearby decode usage marker) to avoid flagging benign docs.
How can I resolve this? If you propose a fix, please make it concise.| message: "Download-and-execute pattern detected in markdown content", | ||
| pattern: /(?:curl|wget)\s+[^\n|]*\|\s*(?:sh|bash|zsh|node|python|perl|ruby)\b/, | ||
| }, | ||
| { |
There was a problem hiding this comment.
Download/exec misses common forms
markdown-download-exec only matches curl|wget ... | <shell> with no newline and requires the pipe in the same line (src/security/skill-scanner.ts:180-184). This will miss very common patterns that are functionally identical in markdown docs, e.g. curl ... | sudo bash, curl ... | bash -s --, bash <(curl ...), or multi-line pipes inside fenced blocks. Given this PR’s goal (closing markdown bypass), this rule as written will fail to flag a large slice of real-world “download and execute” snippets.
This is a correctness gap (missed detections), not just coverage preference.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/security/skill-scanner.ts
Line: 182:185
Comment:
**Download/exec misses common forms**
`markdown-download-exec` only matches `curl|wget ... | <shell>` with no newline and requires the pipe in the same line (`src/security/skill-scanner.ts:180-184`). This will miss very common patterns that are functionally identical in markdown docs, e.g. `curl ... | sudo bash`, `curl ... | bash -s --`, `bash <(curl ...)`, or multi-line pipes inside fenced blocks. Given this PR’s goal (closing markdown bypass), this rule as written will fail to flag a large slice of real-world “download and execute” snippets.
This is a correctness gap (missed detections), not just coverage preference.
How can I resolve this? If you propose a fix, please make it concise.
Additional Comments (1)
In Also applies to the earlier line-rule use of Prompt To Fix With AIThis is a comment left during a code review.
Path: src/security/skill-scanner.ts
Line: 275:281
Comment:
**Stateful regex breaks matching**
In `scanSource`, source rules use `rule.pattern.test(lines[i])` inside a loop (`src/security/skill-scanner.ts:275-281`). Several of the new markdown rules (and potentially existing ones) could reasonably be written with `g`/`y` flags later; if that happens, repeated `.test()` calls on the same `RegExp` instance will advance `lastIndex` and produce false negatives (including skipping the first matching line). Safer pattern: use a non-stateful regex for per-line checks (e.g., `new RegExp(rule.pattern.source, rule.pattern.flags.replace(/g|y/g, ""))`) or reset `rule.pattern.lastIndex = 0` before each `.test()`/`.exec()` call.
Also applies to the earlier line-rule use of `rule.pattern.exec(line)` (`src/security/skill-scanner.ts:231`).
How can I resolve this? If you propose a fix, please make it concise. |
bfc1ccb to
f92900f
Compare
|
This pull request has been automatically marked as stale due to inactivity. |
|
@openclaw-bot and @openclaw-barncale @thewilloftheshadow thewilloftheshadow force-pushed the main branch from bfc1ccb to f92900f |
@thewilloftheshadow thewilloftheshadow force-pushed the main branch from bfc1ccb to f92900f |
|
This pull request has been automatically marked as stale due to inactivity. |
|
Closing due to inactivity. |
Summary
Extends the skill scanner to detect security threats in markdown files (
.md), closing a gap where malicious content inSKILL.mdskill definitions could bypass code-only scanning.Motivation: The ClawHavoc advisory revealed 341+ malicious ClawHub skills. While VirusTotal scans code files for known malware signatures, skill metadata lives in markdown (
SKILL.md) — which was previously unscanned. Attackers can embed download-and-execute patterns, obfuscated payloads, hidden Unicode (Trojan Source / CVE-2021-42574), and executable data URIs in skill documentation that gets injected into LLM system prompts.Changes:
SCANNABLE_EXTENSIONSintoCODE_EXTENSIONSandMARKDOWN_EXTENSIONSto enable file-type-specific rule routingisMarkdown()helper and routescanSource()to use markdown-specific rules for.mdfileshidden-unicode(zero-width + bidi override characters),markdown-data-uri(executable MIME types)markdown-download-exec(curl|bashpatterns),markdown-encoded-payload(large base64 in code blocks),markdown-hex-payload(hex-encoded sequences)eval,child_process, etc.) are isolated from markdown files to prevent false positives on documentation examplesRule isolation: Code-specific rules only fire on code files; markdown-specific rules only fire on
.mdfiles. This prevents breaking existing scanner behavior while extending coverage.Test plan
.md, markdown rules don't fire on.tsSKILL.mdproduces zero findingsoxlintpasses with 0 warnings/errorstsgotype checking passes🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
Greptile Overview
Greptile Summary
.mdas scannable and routes markdown files through a new set of markdown-specific rules.data:URIs.SKILL.md, summary counting, and rule isolation between code vs markdown.Confidence Score: 3/5
.mdto markdown-specific rules and expanding scannable extensions) is straightforward and well-tested. Main concerns are (1) regex statefulness inscanSourceif any rule regex acquiresg/yflags, and (2) markdown base64 and download/exec heuristics being either overly broad (false positives) or too narrow (missed common patterns), which undermines scanner correctness.