Skip to content

fix(media): treat legacy Word docs as binary attachments#73799

Merged
vincentkoc merged 1 commit into
mainfrom
clownfish/ghcrawl-156640-autonomous-smoke
Apr 30, 2026
Merged

fix(media): treat legacy Word docs as binary attachments#73799
vincentkoc merged 1 commit into
mainfrom
clownfish/ghcrawl-156640-autonomous-smoke

Conversation

@openclaw-clownfish

Copy link
Copy Markdown
Contributor

Summary

  • Repair fix(media): treat legacy .doc containers as binary #54380 on top of current main.
  • Keep the narrow application/msword and application/x-cfb binary MIME guard for legacy .doc/OLE files.
  • Strengthen the regression test so those MIME types are explicitly allowed and the skip behavior depends on isBinaryMediaMime, addressing the Greptile review finding.

Credit

Based on #54380 by @andyliu. Related prior reports/PRs include #54176, #44068, #54190, and #54234; preserve attribution for any reused reproduction or test detail.

Validation

  • pnpm check:changed
  • Codex /review clean before merge

Fixes #54176 when landed.

ProjectClownfish replacement details:

@greptile-apps

greptile-apps Bot commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Adds application/msword and application/x-cfb to the isBinaryMediaMime guard in apply.ts, preventing legacy .doc/OLE files with printable-looking bytes from being embedded into prompts as text. The accompanying it.each test verifies the skip holds even when those MIME types appear in an explicit allowedMimes config.

Confidence Score: 5/5

This PR is safe to merge — the change is narrow, well-tested, and has no bypass paths.

The fix is a two-line addition to an existing guard function, checked before any allowlist or heuristic logic. TEXT_EXT_MIME does not map .doc, so no forced-text override can bypass it. The new parameterised test covers both MIME types under the strongest allowlist condition.

No files require special attention.

Reviews (1): Last reviewed commit: "fix(media): treat legacy Word docs as bi..." | Re-trigger Greptile

@clawsweeper

clawsweeper Bot commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge.

What this changes:

The PR adds application/msword and application/x-cfb to isBinaryMediaMime, adds a regression test that skips those MIME values even when explicitly allowed, and adds an active changelog fix entry.

Maintainer follow-up before merge:

This is already an open implementation PR with a narrow diff and no blocking automated review finding; the next action is maintainer review, validation, and merge or intentional replacement, not a separate repair PR.

Security review:

Security review cleared: The PR does not touch workflows, dependencies, lockfiles, scripts, permissions, secrets handling, generated/vendor code, or package metadata; the code change narrows binary prompt ingestion.

Review details

Best possible solution:

Land this PR or an equivalent focused media-understanding guard after maintainer validation, then close #54176 as implemented and leave the earlier unmerged attempts closed with credit preserved.

Do we have a high-confidence way to reproduce the issue?

Yes. The current-main source path shows that application/msword and application/x-cfb bypass the binary MIME guard and can reach text heuristics; the PR's new regression is a high-confidence unit reproduction for the intended skip behavior.

Is this the best way to solve the issue?

Yes. Adding these known legacy Word/OLE MIME values to the existing binary guard before text heuristics is the narrowest maintainable fix; dedicated .doc conversion can be separate future work if product wants extraction support.

Acceptance criteria:

  • pnpm test src/media-understanding/apply.test.ts
  • pnpm check:changed

What I checked:

Likely related people:

  • vignesh07: Prior feature-history review in the PR thread identifies 86a156db2627d68fb00b320c9ef7ca508a2a0df9 as changing isBinaryMediaMime into the current application-MIME binary guard where this PR adds the missing legacy Word/OLE values. (role: introduced current guard behavior; confidence: high; commits: 86a156db2627; files: src/media-understanding/apply.ts)
  • frankekn: Prior history ties cb18ce7a8566ecc0571ad6e517f430f9143ea8f6 to the file attachment extraction path, text MIME heuristics, and related misclassification coverage affected by this guard. (role: introduced file extraction and text-heuristic path; confidence: medium; commits: cb18ce7a8566; files: src/media-understanding/apply.ts, src/media-understanding/apply.test.ts)
  • steipete: The existing review thread routes inbound media-understanding introduction and later media-understanding/media helper maintenance to steipete; local release history also shows Peter Steinberger as release maintainer for the current tagged tree. (role: feature introducer and recent maintainer; confidence: medium; commits: 1b973f750621, fcb7c9ff650a, 6545317a2cb9; files: src/media-understanding/apply.ts, src/media-understanding/apply.test.ts, src/media/mime.ts)
  • martinfrancois: Prior history identifies 734bb9c2e73b18777175ae05953f51b4be034d33 as adding +zip binary MIME handling, ZIP-signature skipping, and regression coverage in the same prompt-extraction path. (role: adjacent binary-payload hardening owner; confidence: medium; commits: 734bb9c2e73b; files: src/media-understanding/apply.ts, src/media-understanding/apply.test.ts)
  • gumadeiras: The prior thread notes gumadeiras as connected to the same 734bb9c2e73b18777175ae05953f51b4be034d33 binary prompt-inflation hardening work, making them a secondary routing candidate for this guard family. (role: adjacent reviewer/coauthor; confidence: low; commits: 734bb9c2e73b; files: src/media-understanding/apply.ts, src/media-understanding/apply.test.ts)

Remaining risk / open question:

  • Validation still needs to run on the PR head; this review was read-only and did not execute tests or changed-gate checks.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 6308d2a1dcd4.

@vincentkoc vincentkoc force-pushed the clownfish/ghcrawl-156640-autonomous-smoke branch from f4c4238 to 4d82705 Compare April 30, 2026 05:02
@vincentkoc vincentkoc self-assigned this Apr 30, 2026
@vincentkoc vincentkoc merged commit 3af4575 into main Apr 30, 2026
84 checks passed
@vincentkoc vincentkoc deleted the clownfish/ghcrawl-156640-autonomous-smoke branch April 30, 2026 05:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

clawsweeper Tracked by ClawSweeper automation size: XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Binary files (.doc) should not be auto-embedded as text content

1 participant