fix(media): strip control and bidi characters from outbound filenames#72973
fix(media): strip control and bidi characters from outbound filenames#72973masatohoshino wants to merge 1 commit into
Conversation
Outbound media filenames flow from two untrusted sources into channel adapters
(Telegram, Discord, Matrix, WhatsApp): HTTP `Content-Disposition` headers
parsed in `fetchRemoteMedia`, and URL pathnames extracted in
`loadWebMediaInternal`. Both paths previously called `path.basename(...)`
with no further sanitization, leaving C0/C1 control bytes, zero-width joiners,
and bidi formatting characters in the filename that downstream channels
forwarded verbatim.
This adds a `stripFilenameControlChars` helper in `src/media/outbound-filename.ts`
that removes the union of those invisible ranges (U+0000-U+001F, U+007F-U+009F,
U+200B-U+200D, U+202A-U+202E, U+2066-U+2069, U+FEFF) and applies it at the two
root extraction sites. All other Unicode is preserved, so legitimate Japanese,
Cyrillic, Arabic, etc. filenames pass through unchanged. Defense-in-depth: the
operating systems on the receiving end already enforce the real extension via
content sniffing, so this only closes the visual-display gap.
The new file is named to avoid colliding with the three existing
`sanitize{File,Filename}` helpers in `store.ts`, `file-context.ts`, and
`server.ts`, each of which has a different purpose (cross-platform path
safety, LLM prompt attribute escaping, HTTP header attachment escaping).
🔒 Aisle Security AnalysisWe found 1 potential security issue(s) in this PR:
1. 🟡 Incomplete removal of Unicode bidi marks enables filename UI spoofing (LRM/RLM/ALM not stripped)
Description
These characters are in Unicode category Cf and can influence how a filename is displayed in some clients/OS UIs while leaving the underlying byte sequence (and extension) unchanged. Because Vulnerable code: const FILENAME_INVISIBLE_CONTROL_PATTERN =
"[\\u0000-\\u001F\\u007F-\\u009F\\u200B-\\u200D\\u202A-\\u202E\\u2066-\\u2069\\uFEFF]";Notably, the accompanying tests explicitly assert that U+200E/U+200F are preserved, which codifies the bypass. RecommendationStrip all bidi-affecting format marks from untrusted filenames. A minimal targeted fix is to extend the character class to include LRM/RLM/ALM (and optionally other bidi marks): const FILENAME_INVISIBLE_CONTROL_PATTERN =
"[" +
"\\u0000-\\u001F\\u007F-\\u009F" + // C0/C1
"\\u061C" + // ALM
"\\u200B-\\u200F" + // ZWSP..RLM (includes LRM/RLM)
"\\u202A-\\u202E" + // embedding/overrides
"\\u2066-\\u2069" + // isolates
"\\uFEFF" +
"]";Alternatively (if runtime supports it), use Unicode property escapes to remove all control/format chars and then explicitly allow what you need, but be cautious about scope: value.replace(/[\p{Cc}\p{Cf}]/gu, "")Then add tests demonstrating that LRM/RLM/ALM are stripped and cannot be used for extension/UI spoofing. Analyzed PR: #72973 at commit Last updated on: 2026-04-27T18:28:14Z |
Greptile SummaryThis PR introduces Confidence Score: 4/5Safe to merge — pure helper applied at well-scoped call sites with no behavioral change to file content, MIME inference, or delivery paths. Implementation is correct: the regex ranges match the declared strip set, the module-level g-flagged regex is only used with .replace() (safe; no shared lastIndex hazard), and the web-media.ts URL-path case is protected by the WHATWG URL constructor pre-encoding non-ASCII chars before path.basename is called. Test coverage is thorough (every strip range, boundary preserve characters, empty/all-control inputs, and an integration path through parseContentDispositionFileName). No P0 or P1 issues found. No files require special attention. Reviews (1): Last reviewed commit: "fix(media): strip control and bidi chara..." | Re-trigger Greptile |
|
Thank you for the careful analysis. I'd like to keep the current strip set for now and document the tradeoff, unless maintainers prefer the broader policy. This PR targets outbound filename display hardening for visual-deception issues:
LRM (U+200E), RLM (U+200F), and ALM (U+061C) are also bidi-affecting invisible format characters, but they are directional marks rather than override/isolate controls. They are used legitimately in Arabic, Hebrew, and other RTL-script contexts to resolve weak-direction characters such as digits and punctuation in mixed-script text. Because OpenClaw delivers files across multilingual chat surfaces, stripping those marks may cause legitimate mixed-script filenames to render with degraded directionality for RTL-script users. That would be an irreversible display regression at this filename extraction boundary. The current strip set therefore intentionally targets the stronger override/isolate controls used for visual reordering attacks, while preserving directional marks whose security benefit from stripping appears more theoretical and whose i18n cost is concrete. I'm happy to widen the strip set if maintainers prefer a stricter policy for all bidi-affecting invisible characters at this boundary, but I wanted to make the multilingual filename tradeoff explicit before doing so. |
|
Codex review: needs real behavior proof before merge. Summary Reproducibility: yes. Source inspection shows current main returns Content-Disposition and local media basenames without filtering the reported invisible/control characters; a focused fetchRemoteMedia or loadWebMedia harness can reproduce the filename value without live channels. Real behavior proof Next step before merge Security Review findings
Review detailsBest possible solution: Land a rebased media-boundary sanitizer that wraps basenameFromAnyPath/extnameFromAnyPath, records the agreed bidi policy in tests, and proves the actual fetch/local-media path or Telegram label. Do we have a high-confidence way to reproduce the issue? Yes. Source inspection shows current main returns Content-Disposition and local media basenames without filtering the reported invisible/control characters; a focused fetchRemoteMedia or loadWebMedia harness can reproduce the filename value without live channels. Is this the best way to solve the issue? No, not as currently mergeable. Centralizing the strip at the media filename boundary is the right shape, but the branch must preserve current basenameFromAnyPath behavior and settle proof plus bidi policy first. Full review comments:
Overall correctness: patch is incorrect Acceptance criteria:
What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 421b9e281942. |
|
Closing this PR to reduce maintainer review load. The original shared outbound filename-sanitizer idea is still useful, but this PR is no longer in a strong review state:
The urgent per-channel surfaces are now better represented by narrower PRs:
The remaining shared sanitizer / bidi-marker policy surface is better revisited later if a concrete visible reproduction appears, especially for Telegram or another affected adapter. Closing to reduce shelf noise. I can re-submit a narrower shared-helper PR later if there is maintainer interest or a concrete channel proof. |
Summary
Content-Dispositionand URL pathname can carry C0/C1 control, zero-width, and bidi override/isolate formatting characters throughsrc/media/fetch.tsandsrc/media/web-media.tsto recipient client UI on Telegram / Discord / Matrix / WhatsApp document labels.src/media/outbound-filename.tsstripFilenameControlCharsis applied at the 4 sites where filename strings cross the untrusted boundary (3 infetch.ts, 1 inweb-media.ts). Channel adapters inextensions/are untouched.extensions/discord/src/send.shared.ts,extensions/whatsapp/src/inbound/send-api.ts); text body / caption / sender display name / reply quote / embed structured fields; encoding correctness (RFC 5987 charset/lang, mojibake recovery — covered by media: add shared filename decoder #71517).Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
parseContentDispositionFileName, scope is exclusive — encoding correctness vs character-class strip)Root Cause (if applicable)
parseContentDispositionFileName(src/media/fetch.ts) appliespath.basename+decodeURIComponentonly;loadWebMediaInternal(src/media/web-media.ts) appliespath.basenameonly. Neither path filters invisible / formatting characters before the value reaches the channel-side document UI label.sanitizeFilenameinsrc/media/store.tsfor local path safety,sanitizeFileNameinsrc/media/file-context.tsfor LLM prompt attribute escape,sanitizeAttachmentFilenameinsrc/media/server.tsfor HTTP header escape) target other contexts and do not cover this one.media.fileNamefrom the same upstream extraction sites, so a per-channel fix would duplicate logic; the root sites are the natural single integration point.Regression Test Plan (if applicable)
src/media/outbound-filename.test.ts(new, 27 cases) andsrc/media/fetch.test.ts(new Content-Disposition integration case).U+200E,U+200F,U+202F) round-trip unchanged; the integration path throughparseContentDispositionFileNamereturns a sanitized filename.src/media/.User-visible / Behavior Changes
Outbound media filenames forwarded to channel document UI labels no longer contain C0/C1 control, zero-width, or bidi override/isolate formatting characters. Other characters (including
U+200E/U+200Fbidi marks andU+202Fnarrow no-break space) round-trip unchanged. No defaults or config keys change.Diagram (if applicable)
N/A.
Security Impact (required)
NoNoNoNoNoRepro + Verification
Environment
src/media/fetch.ts(HTTP Content-Disposition) andsrc/media/web-media.ts(URL pathname); 4 channel adapters consume the result unchanged.Steps
pnpm installpnpm test src/media/outbound-filename.test.ts src/media/fetch.test.ts src/media/web-media.test.tspnpm check:changedExpected
outbound-filename.test.ts: 27/27 pass.fetch.test.ts+web-media.test.ts: 42/42 pass (includes the new Content-Disposition integration case).check:changed: lanescore, coreTestsGREEN (typecheck / lint / conflict-markers / runtime import cycles / pairing guards).Actual
outbound-filename.test.ts: 27 passed (114ms).fetch.test.ts+web-media.test.ts: 42 passed (843ms).check:changed: GREEN, lint 0 warnings 0 errors, 0 runtime cycles, all guards pass.Evidence
pnpm test src/media/outbound-filename.test.ts src/media/fetch.test.ts src/media/web-media.test.ts(worktree at654cf20737, rebased ontof20a295782):pnpm check:changed:Unrelated full-suite note: a local full-suite precheck on both the rebased branch and the
f20a295782baseline hit one gateway-laneERR_WORKER_OUT_OF_MEMORYunder concurrent Vitest shards. The affected test passes in isolation and does not touchsrc/media/. Earlier unrelated Windows ACL / unit-fast failures were fixed upstream and are included in the current rebase.Human Verification (required)
parseContentDispositionFileName; preserved characters (U+200E,U+200F,U+202F) round-trip unchanged;pnpm check:changedGREEN on654cf20737(rebased ontof20a295782).U+200ELRM kept,U+200BZWSP stripped), filenames with multibyte characters preserved.media.fileNameunchanged throughextensions/adapters that are untouched by this PR, so the change is transparent to them.Review Conversations
Compatibility / Migration
YesNoNoRisks and Mitigations
U+200E(LRM),U+200F(RLM), andU+202F(NARROW NO-BREAK SPACE) round-trip unchanged, locking in the boundary.parseContentDispositionFileName).decodeURIComponentvalue; both PRs touch the same function but functionally exclusive. Whichever lands second takes a small rebase.U+200E), RLM (U+200F), and ALM (U+061C) remain in extracted filenames.U+202A-U+202E) and isolate (U+2066-U+2069) controls, which are the stronger visual-reordering controls relevant to RTLO-style filename deception and Trojan-Source-style visual reordering. LRM/RLM/ALM are directional marks used legitimately in Arabic, Hebrew, and other RTL-script contexts to resolve weak-direction characters such as digits and punctuation in mixed-script filenames. Stripping them would create a concrete multilingual display regression at this filename extraction boundary to mitigate a more theoretical residual surface. If maintainers prefer a stricter policy for all bidi-affecting invisible characters, this strip set can be widened in a follow-up commit.Real behavior proof
Behavior or issue addressed: With this PR at head
654cf20737,stripFilenameControlChars(src/media/outbound-filename.ts) is invoked at the four documented call sites insrc/media/fetch.ts:73,75,80(Content-Dispositionfilename*/filenamebranches) andsrc/media/web-media.ts:556(URL pathname basename), removing C0/C1 controls (U+0000–U+001F, U+007F–U+009F), zero-width characters (U+200B–U+200D, U+FEFF), and bidi override + isolate formatting (U+202A–U+202E, U+2066–U+2069) from outbound filename strings before they are surfaced to recipient client UI or used as on-disk filenames. This is defense-in-depth for display labels — file bytes, MIME inference, and channel delivery are unchanged.Real environment tested: Local Node v22.22.2 with the
tsxloader (the same TS-on-Node mechanism the repository uses in its own scripts, e.g.node --import tsx scripts/check-import-cycles.ts). Worktree at HEAD654cf20737(this PR's head). The harness importsstripFilenameControlCharsfromsrc/media/outbound-filename.ts— the same exported functionsrc/media/fetch.ts:11andsrc/media/web-media.ts:31import — and invokes it with the samestripFilenameControlChars(path.basename(input))call shape thatsrc/media/fetch.ts:80uses, against adversarial input strings. The harness does not exercise the surrounding HTTP fetch, channel adapter, or runtime delivery code — only the helper itself, as imported.Exact steps or command run after this patch:
Evidence after fix: Verbatim stdout captured 2026-05-17 from running the command above against worktree HEAD
654cf20737:No secrets, tokens, chat identifiers, or user identifiers are involved — the harness operates entirely on hand-constructed adversarial filename strings.
Observed result after fix: For every adversarial input above, the helper's return value contains no invisible/control bytes. Concretely from the hex dumps in the Evidence section: (1) the RLO byte sequence
e2 80 aeis removed; (2) the CRLF bytes0d 0aare removed in both occurrences from the helper's return value (the value its callers insrc/media/fetch.tsandsrc/media/web-media.tsthen propagate asmedia.fileName); (3) the ZWSP / BOM / NEL / LRI+PDI byte sequences are removed; (4) the baseline ASCII filenameinvoice.pdfis preserved byte-for-byte (11 → 11 bytes, identical hex). The7/7 passedline in the captured output indicates every reported verdict is OK. This proof exercises only the helper function — it does not observe an HTTP request, aContent-Dispositionheader on the wire, a runtime delivery, or any channel adapter execution.What was not tested:
Content-Dispositionround-trip. Reason: it would require switching the local runtime symlink to this PR's worktree and a real-time Telegram client interaction, which would interrupt an unrelated active work item. The PR'sextensions/adapters consumemedia.fileNameunchanged from the extraction sites this PR hardens, so the sanitized value reaches the adapter layer transparently. Happy to add a Telegram capture if a maintainer would like it as an additional artifact.U+200E,U+200F,U+061C). Intentional, per the §Risks discussion on multilingual filename display.extensions/discord/src/send.shared.tsandextensions/whatsapp/src/inbound/send-api.ts. Out of scope per §Scope boundary.