Skip to content

fix(media): anchor sanitizeMimeType regex and reject trailing garbage (#9795)#68225

Closed
ymaxgit wants to merge 1 commit into
openclaw:mainfrom
security-for-agent:claude/fix-sanitize-mime-9795
Closed

fix(media): anchor sanitizeMimeType regex and reject trailing garbage (#9795)#68225
ymaxgit wants to merge 1 commit into
openclaw:mainfrom
security-for-agent:claude/fix-sanitize-mime-9795

Conversation

@ymaxgit

@ymaxgit ymaxgit commented Apr 17, 2026

Copy link
Copy Markdown

Summary

  • Problem: sanitizeMimeType in src/media-understanding/apply.ts used an unanchored regex (/^(type/subtype)/). A string like text/plain<script>... or text/plain garbage silently matched and returned text/plain, accepting the overall value as a valid MIME even though the input was not a well-formed RFC 7231 media type.
  • Why it matters: The function is the single input validator for textHint and mime strings coming into applyMediaUnderstanding. Accepting malformed values risks surprising callers that rely on it to reject garbage (e.g. hints propagated from untrusted text). Reported as a security input-validation bug in [Security] sanitizeMimeType regex should be anchored and case-insensitive #9795.
  • What changed: The regex is now end-anchored with an optional RFC 7231 parameters tail ((?:\s*;.*)?$). Parameters are still dropped from the returned value, but anything else after the subtype (whitespace + garbage, control characters, junk) causes the function to return undefined.
  • What did NOT change (scope boundary): No change to normalizeMimeType in src/media/mime.ts (which already splits on ;), no change to any caller of sanitizeMimeType, no change to the allowed character class for type/subtype. The function was exported purely so the new regression test can exercise it directly.

Change Type (select all)

  • Bug fix
  • Security hardening

Scope (select all touched areas)

  • API / contracts

Linked Issue/PR

Root Cause

  • Root cause: the pattern was /^([a-z0-9!#$&^_.+-]+\/[a-z0-9!#$&^_.+-]+)/ with no end anchor, so any valid-looking prefix was captured and the rest was ignored.
  • Missing detection / guardrail: no direct unit test for this helper existed (it was internal and only exercised through the broader applyMediaUnderstanding suite).

Regression Test Plan

  • Unit test
  • Target test or file: src/media-understanding/apply.sanitize-mime.test.ts (new)
  • Scenarios locked in:
    • Plain type/subtype accepted.
    • RFC parameters (; charset=utf-8, ; boundary=...) stripped to just type/subtype.
    • Trailing garbage (<script>, spaces + junk, embedded newlines) rejected (returns undefined).
    • Missing type or subtype rejected.
    • Empty / whitespace / undefined input returns undefined.
  • Why this is the smallest reliable guardrail: the helper has a single responsibility; pinning it in isolation catches both the original bug shape and future regex tweaks without needing to stand up the full media-understanding pipeline.

User-visible / Behavior Changes

Very narrow: values that previously coerced to a type/subtype despite invalid trailing bytes now cause the caller to see mimeType as undefined, which then falls back to the existing detection paths. No config changes.

Security Impact (required)

  • New permissions/capabilities? No
  • Tightens input validation on the MIME normalization surface used for media-understanding file attachments.

Verification

  • pnpm test src/media-understanding/apply.sanitize-mime.test.ts — 6 passed
  • pnpm test 'src/media-understanding/**/*.test.ts' — 131 passed (full media-understanding suite, on the pre-cherry-pick branch)
  • pnpm check — 0 warnings, 0 errors (pre-cherry-pick branch)

Human Verification

  • Verified scenarios: manually tried each regex input from the issue report against the updated pattern, confirmed the capture group only returns when the entire string is a valid type/subtype optionally followed by ;....
  • Edge cases checked: CR/LF injection (text/plain\nContent-Type: text/html) now rejected; upper-case input still lowercased before matching via normalizeOptionalLowercaseString.
  • What I did not verify: real-world corpora of MIME values seen in production (relying on the RFC 7231 shape only).

Compatibility / Migration

  • Backward compatible? Yes for well-formed input.
  • Config/env changes? No
  • Migration needed? No

Risks and Mitigations

  • Risk: a caller that previously relied on lenient behavior for malformed textHint values will now get undefined and fall through to detection.
    • Mitigation: that was a latent bug in the caller; the downstream applyMediaUnderstanding suite continues to pass (131/131) on the scoped run, so no regressions were surfaced.

@greptile-apps

greptile-apps Bot commented Apr 17, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Anchors the sanitizeMimeType regex with $ and adds an optional RFC 7231 parameters tail ((?:\\s*;.*)?), so inputs like text/plain<script> or text/plain garbage now correctly return undefined instead of silently coercing to a valid type/subtype. The function is also exported to support the new targeted regression test (apply.sanitize-mime.test.ts).

Confidence Score: 5/5

Safe to merge — fix is correct, tests are thorough, and the only remaining finding is a minor hardening suggestion with no exploitable impact.

No P0/P1 issues. The single P2 comment (using [ ]* instead of \s* before the semicolon) is a cosmetic hardening suggestion; the returned value is always a clean type/subtype string regardless, so there is no security or correctness risk.

No files require special attention.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/media-understanding/apply.ts
Line: 88

Comment:
**`\s*` before `;` allows newlines in the parameters prefix**

`\s` includes `\n` and `\r\n`, so `text/plain\n;charset=utf-8` passes the regex and returns `text/plain`. The returned value is always safe, but the acceptance of a newline-before-semicolon input is technically non-RFC-7231-compliant. Replacing `\s*` with `[ \t]*` (horizontal whitespace only) would reject those edge-case inputs without affecting any real-world MIME strings.

```suggestion
  const match = trimmed.match(/^([a-z0-9!#$&^_.+-]+\/[a-z0-9!#$&^_.+-]+)(?:[ \t]*;.*)?$/);
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(media): anchor sanitizeMimeType rege..." | Re-trigger Greptile

// Trailing garbage (e.g. `text/plain<script>` or `text/plain\ngarbage`) is
// rejected outright instead of silently truncated. Input is already
// lowercased by normalizeOptionalLowercaseString.
const match = trimmed.match(/^([a-z0-9!#$&^_.+-]+\/[a-z0-9!#$&^_.+-]+)(?:\s*;.*)?$/);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 \s* before ; allows newlines in the parameters prefix

\s includes \n and \r\n, so text/plain\n;charset=utf-8 passes the regex and returns text/plain. The returned value is always safe, but the acceptance of a newline-before-semicolon input is technically non-RFC-7231-compliant. Replacing \s* with [ \t]* (horizontal whitespace only) would reject those edge-case inputs without affecting any real-world MIME strings.

Suggested change
const match = trimmed.match(/^([a-z0-9!#$&^_.+-]+\/[a-z0-9!#$&^_.+-]+)(?:\s*;.*)?$/);
const match = trimmed.match(/^([a-z0-9!#$&^_.+-]+\/[a-z0-9!#$&^_.+-]+)(?:[ \t]*;.*)?$/);
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/media-understanding/apply.ts
Line: 88

Comment:
**`\s*` before `;` allows newlines in the parameters prefix**

`\s` includes `\n` and `\r\n`, so `text/plain\n;charset=utf-8` passes the regex and returns `text/plain`. The returned value is always safe, but the acceptance of a newline-before-semicolon input is technically non-RFC-7231-compliant. Replacing `\s*` with `[ \t]*` (horizontal whitespace only) would reject those edge-case inputs without affecting any real-world MIME strings.

```suggestion
  const match = trimmed.match(/^([a-z0-9!#$&^_.+-]+\/[a-z0-9!#$&^_.+-]+)(?:[ \t]*;.*)?$/);
```

How can I resolve this? If you propose a fix, please make it concise.

@prtags

prtags Bot commented Apr 23, 2026

Copy link
Copy Markdown

Related work from PRtags group charming-marten-atlb

Title: Open PR duplicate: sanitizeMimeType regex anchoring (#9795)

Number Title
#68225* fix(media): anchor sanitizeMimeType regex and reject trailing garbage (#9795)
#68456 security(media): anchor sanitizeMimeType regex to reject malformed input

* This PR

@clawsweeper

clawsweeper Bot commented Apr 26, 2026

Copy link
Copy Markdown
Contributor

Closing this as duplicate or superseded after Codex automated review.

Close PR #68225 as duplicate/superseded. Current main still has the unanchored sanitizeMimeType behavior, so the underlying #9795 fix is not shipped, but open PR #68456 tracks the same remaining sanitizer anchoring work with the same implementation surface and regression tests. PRtags also explicitly grouped #68225 and #68456 as duplicate work for #9795.

Best possible solution:

Consolidate the #9795 sanitizer hardening on the open canonical duplicate PR #68456, carrying over any useful review detail from #68225 as needed, and keep #9795 open until the selected fix is merged and shipped.

What I checked:

So I’m closing this here and keeping the remaining discussion on the canonical linked item.

Codex Review notes: model gpt-5.5, reasoning high; reviewed against d251932fcfd6.

@vincentkoc

Copy link
Copy Markdown
Member

ProjectClownfish could not safely update this branch, so it opened a narrow replacement PR instead.

Replacement PR: #73229
Source PR: #68225
Contributor credit is preserved in the replacement PR body and changelog plan.

@vincentkoc vincentkoc closed this Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Security] sanitizeMimeType regex should be anchored and case-insensitive

3 participants