fix(media): anchor sanitizeMimeType regex and reject trailing garbage (#9795) by ymaxgit · Pull Request #68225 · openclaw/openclaw

ymaxgit · 2026-04-17T17:38:14Z

Summary

Problem: sanitizeMimeType in src/media-understanding/apply.ts used an unanchored regex (/^(type/subtype)/). A string like text/plain<script>... or text/plain garbage silently matched and returned text/plain, accepting the overall value as a valid MIME even though the input was not a well-formed RFC 7231 media type.
Why it matters: The function is the single input validator for textHint and mime strings coming into applyMediaUnderstanding. Accepting malformed values risks surprising callers that rely on it to reject garbage (e.g. hints propagated from untrusted text). Reported as a security input-validation bug in [Security] sanitizeMimeType regex should be anchored and case-insensitive #9795.
What changed: The regex is now end-anchored with an optional RFC 7231 parameters tail ((?:\s*;.*)?$). Parameters are still dropped from the returned value, but anything else after the subtype (whitespace + garbage, control characters, junk) causes the function to return undefined.
What did NOT change (scope boundary): No change to normalizeMimeType in src/media/mime.ts (which already splits on ;), no change to any caller of sanitizeMimeType, no change to the allowed character class for type/subtype. The function was exported purely so the new regression test can exercise it directly.

Change Type (select all)

Bug fix
Security hardening

Scope (select all touched areas)

API / contracts

Linked Issue/PR

Closes [Security] sanitizeMimeType regex should be anchored and case-insensitive #9795
This PR fixes a bug or regression

Root Cause

Root cause: the pattern was /^([a-z0-9!#$&^_.+-]+\/[a-z0-9!#$&^_.+-]+)/ with no end anchor, so any valid-looking prefix was captured and the rest was ignored.
Missing detection / guardrail: no direct unit test for this helper existed (it was internal and only exercised through the broader applyMediaUnderstanding suite).

Regression Test Plan

Unit test
Target test or file: src/media-understanding/apply.sanitize-mime.test.ts (new)
Scenarios locked in:
- Plain type/subtype accepted.
- RFC parameters (; charset=utf-8, ; boundary=...) stripped to just type/subtype.
- Trailing garbage (<script>, spaces + junk, embedded newlines) rejected (returns undefined).
- Missing type or subtype rejected.
- Empty / whitespace / undefined input returns undefined.
Why this is the smallest reliable guardrail: the helper has a single responsibility; pinning it in isolation catches both the original bug shape and future regex tweaks without needing to stand up the full media-understanding pipeline.

User-visible / Behavior Changes

Very narrow: values that previously coerced to a type/subtype despite invalid trailing bytes now cause the caller to see mimeType as undefined, which then falls back to the existing detection paths. No config changes.

Security Impact (required)

New permissions/capabilities? No
Tightens input validation on the MIME normalization surface used for media-understanding file attachments.

Verification

pnpm test src/media-understanding/apply.sanitize-mime.test.ts — 6 passed
pnpm test 'src/media-understanding/**/*.test.ts' — 131 passed (full media-understanding suite, on the pre-cherry-pick branch)
pnpm check — 0 warnings, 0 errors (pre-cherry-pick branch)

Human Verification

Verified scenarios: manually tried each regex input from the issue report against the updated pattern, confirmed the capture group only returns when the entire string is a valid type/subtype optionally followed by ;....
Edge cases checked: CR/LF injection (text/plain\nContent-Type: text/html) now rejected; upper-case input still lowercased before matching via normalizeOptionalLowercaseString.
What I did not verify: real-world corpora of MIME values seen in production (relying on the RFC 7231 shape only).

Compatibility / Migration

Backward compatible? Yes for well-formed input.
Config/env changes? No
Migration needed? No

Risks and Mitigations

Risk: a caller that previously relied on lenient behavior for malformed textHint values will now get undefined and fall through to detection.
- Mitigation: that was a latent bug in the caller; the downstream applyMediaUnderstanding suite continues to pass (131/131) on the scoped run, so no regressions were surfaced.

greptile-apps · 2026-04-17T17:41:03Z

Greptile Summary

Anchors the sanitizeMimeType regex with $ and adds an optional RFC 7231 parameters tail ((?:\\s*;.*)?), so inputs like text/plain<script> or text/plain garbage now correctly return undefined instead of silently coercing to a valid type/subtype. The function is also exported to support the new targeted regression test (apply.sanitize-mime.test.ts).

Confidence Score: 5/5

Safe to merge — fix is correct, tests are thorough, and the only remaining finding is a minor hardening suggestion with no exploitable impact.

No P0/P1 issues. The single P2 comment (using [ ]* instead of \s* before the semicolon) is a cosmetic hardening suggestion; the returned value is always a clean type/subtype string regardless, so there is no security or correctness risk.

No files require special attention.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: src/media-understanding/apply.ts
Line: 88

Comment:
**`\s*` before `;` allows newlines in the parameters prefix**

`\s` includes `\n` and `\r\n`, so `text/plain\n;charset=utf-8` passes the regex and returns `text/plain`. The returned value is always safe, but the acceptance of a newline-before-semicolon input is technically non-RFC-7231-compliant. Replacing `\s*` with `[ \t]*` (horizontal whitespace only) would reject those edge-case inputs without affecting any real-world MIME strings.

```suggestion
  const match = trimmed.match(/^([a-z0-9!#$&^_.+-]+\/[a-z0-9!#$&^_.+-]+)(?:[ \t]*;.*)?$/);
```

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "fix(media): anchor sanitizeMimeType rege..." | Re-trigger Greptile}

greptile-apps · 2026-04-17T17:41:06Z

+  // Trailing garbage (e.g. `text/plain<script>` or `text/plain\ngarbage`) is
+  // rejected outright instead of silently truncated. Input is already
+  // lowercased by normalizeOptionalLowercaseString.
+  const match = trimmed.match(/^([a-z0-9!#$&^_.+-]+\/[a-z0-9!#$&^_.+-]+)(?:\s*;.*)?$/);


\s* before ; allows newlines in the parameters prefix

\s includes \n and \r\n, so text/plain\n;charset=utf-8 passes the regex and returns text/plain. The returned value is always safe, but the acceptance of a newline-before-semicolon input is technically non-RFC-7231-compliant. Replacing \s* with [ \t]* (horizontal whitespace only) would reject those edge-case inputs without affecting any real-world MIME strings.

Suggested change

const match = trimmed.match(/^([a-z0-9!#$&^_.+-]+\/[a-z0-9!#$&^_.+-]+)(?:\s*;.*)?$/);

const match = trimmed.match(/^([a-z0-9!#$&^_.+-]+\/[a-z0-9!#$&^_.+-]+)(?:[ \t]*;.*)?$/);

Prompt To Fix With AI

This is a comment left during a code review. Path: src/media-understanding/apply.ts Line: 88 Comment: **`\s*` before `;` allows newlines in the parameters prefix** `\s` includes `\n` and `\r\n`, so `text/plain\n;charset=utf-8` passes the regex and returns `text/plain`. The returned value is always safe, but the acceptance of a newline-before-semicolon input is technically non-RFC-7231-compliant. Replacing `\s*` with `[ \t]*` (horizontal whitespace only) would reject those edge-case inputs without affecting any real-world MIME strings. ```suggestion const match = trimmed.match(/^([a-z0-9!#$&^_.+-]+\/[a-z0-9!#$&^_.+-]+)(?:[ \t]*;.*)?$/); ``` How can I resolve this? If you propose a fix, please make it concise.

prtags · 2026-04-23T09:51:25Z

Related work from PRtags group charming-marten-atlb

Title: Open PR duplicate: sanitizeMimeType regex anchoring (#9795)

Number	Title
#68225*	fix(media): anchor sanitizeMimeType regex and reject trailing garbage (#9795)
#68456	security(media): anchor sanitizeMimeType regex to reject malformed input

* This PR

clawsweeper · 2026-04-26T05:32:10Z

Closing this as duplicate or superseded after Codex automated review.

Close PR #68225 as duplicate/superseded. Current main still has the unanchored sanitizeMimeType behavior, so the underlying #9795 fix is not shipped, but open PR #68456 tracks the same remaining sanitizer anchoring work with the same implementation surface and regression tests. PRtags also explicitly grouped #68225 and #68456 as duplicate work for #9795.

Best possible solution:

Consolidate the #9795 sanitizer hardening on the open canonical duplicate PR #68456, carrying over any useful review detail from #68225 as needed, and keep #9795 open until the selected fix is merged and shipped.

What I checked:

Current main still has the original unanchored sanitizer: sanitizeMimeType lowercases/trims input, then matches only a MIME-looking prefix with /^([a-z0-9!#$&^_.+-]+\/[a-z0-9!#$&^_.+-]+)/; there is no end anchor or parameter-tail validation in current main. (src/media-understanding/apply.ts:77, d251932fcfd6)
Current call path makes the sanitizer relevant to the reported behavior: applyMediaUnderstanding computes mimeType = sanitizeMimeType(textHint ?? normalizeMimeType(rawMime)), then uses that value for unknown/allowed MIME checks, so a prefix-truncated MIME can still affect file handling. (src/media-understanding/apply.ts:404, d251932fcfd6)
Canonical duplicate PR security(media): anchor sanitizeMimeType regex to reject malformed input #68456 tracks the same fix: PR security(media): anchor sanitizeMimeType regex to reject malformed input #68456 is open, targets [Security] sanitizeMimeType regex should be anchored and case-insensitive #9795, modifies src/media-understanding/apply.ts to export sanitizeMimeType and end-anchor the regex, and adds sanitizer regression coverage in src/media-understanding/apply.test.ts for trailing junk, path-like suffixes, parameters, mixed case, and missing type/subtype cases. (src/media-understanding/apply.test.ts:1420, a6cba758561c)
This PR is the same remaining work: PR fix(media): anchor sanitizeMimeType regex and reject trailing garbage (#9795) #68225 also targets [Security] sanitizeMimeType regex should be anchored and case-insensitive #9795, changes the same sanitizer in src/media-understanding/apply.ts, exports it for tests, and adds focused tests for valid MIME values, parameters, trailing junk, missing type/subtype, and blank input. (src/media-understanding/apply.sanitize-mime.test.ts:1, 7abcecdb1ff6)
PRtags already grouped the pair as duplicate work: The PRtags comment on fix(media): anchor sanitizeMimeType regex and reject trailing garbage (#9795) #68225 groups fix(media): anchor sanitizeMimeType regex and reject trailing garbage (#9795) #68225 and security(media): anchor sanitizeMimeType regex to reject malformed input #68456 under Open PR duplicate: sanitizeMimeType regex anchoring (#9795), and the reciprocal comment exists on security(media): anchor sanitizeMimeType regex to reject malformed input #68456.
Bot review found useful but non-blocking discussion on this PR: Greptile summarized fix(media): anchor sanitizeMimeType regex and reject trailing garbage (#9795) #68225 as a correct anchoring fix with thorough tests and only a minor P2 hardening suggestion about \s* before the semicolon, so the useful review context can be carried into the canonical PR if needed. (src/media-understanding/apply.ts:88, 7abcecdb1ff6)

So I’m closing this here and keeping the remaining discussion on the canonical linked item.

Codex Review notes: model gpt-5.5, reasoning high; reviewed against d251932fcfd6.

vincentkoc · 2026-04-28T03:48:40Z

ProjectClownfish could not safely update this branch, so it opened a narrow replacement PR instead.

Replacement PR: #73229
Source PR: #68225
Contributor credit is preserved in the replacement PR body and changelog plan.

fix(media): anchor sanitizeMimeType regex (openclaw#9795)

7abcecd

openclaw-barnacle Bot added the size: XS label Apr 17, 2026

greptile-apps Bot reviewed Apr 17, 2026

View reviewed changes

prtags Bot mentioned this pull request Apr 23, 2026

security(media): anchor sanitizeMimeType regex to reject malformed input #68456

Closed

5 tasks

vincentkoc mentioned this pull request Apr 28, 2026

fix(media): tighten sanitizeMimeType anchoring #73229

Merged

vincentkoc closed this Apr 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(media): anchor sanitizeMimeType regex and reject trailing garbage (#9795)#68225

fix(media): anchor sanitizeMimeType regex and reject trailing garbage (#9795)#68225
ymaxgit wants to merge 1 commit into
openclaw:mainfrom
security-for-agent:claude/fix-sanitize-mime-9795

ymaxgit commented Apr 17, 2026

Uh oh!

greptile-apps Bot commented Apr 17, 2026

Uh oh!

greptile-apps Bot Apr 17, 2026

Uh oh!

prtags Bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

clawsweeper Bot commented Apr 26, 2026

Uh oh!

vincentkoc commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	const match = trimmed.match(/^([a-z0-9!#$&^_.+-]+\/[a-z0-9!#$&^_.+-]+)(?:\s;.)?$/);
	const match = trimmed.match(/^([a-z0-9!#$&^_.+-]+\/[a-z0-9!#$&^_.+-]+)(?:[ \t];.)?$/);

Uh oh!

Conversation

ymaxgit commented Apr 17, 2026

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Root Cause

Regression Test Plan

User-visible / Behavior Changes

Security Impact (required)

Verification

Human Verification

Compatibility / Migration

Risks and Mitigations

Uh oh!

greptile-apps Bot commented Apr 17, 2026

Greptile Summary

Confidence Score: 5/5

Uh oh!

greptile-apps Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

prtags Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clawsweeper Bot commented Apr 26, 2026

Uh oh!

vincentkoc commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

prtags Bot commented Apr 23, 2026 •

edited

Loading