fix: treat application/x-cfb and application/msword as binary MIME types#54190
fix: treat application/x-cfb and application/msword as binary MIME types#54190lndyzwdxhs wants to merge 1 commit into
Conversation
Greptile SummaryThis PR fixes a well-scoped bug:
Confidence Score: 5/5
Reviews (1): Last reviewed commit: "fix: treat application/x-cfb and applica..." | Re-trigger Greptile |
Binary .doc files (OLE2 Compound Binary Format) were not recognized as binary by isBinaryMediaMime(), causing ~70KB of binary garbage to be embedded as text content in conversation prompts. This triggered content filtering rejections from LLM providers, breaking entire conversation sessions. Add application/x-cfb and application/msword to the binary MIME type allowlist so these files are correctly skipped during text extraction. Fixes openclaw#54176 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
d5dd3c7 to
4edf646
Compare
|
Closing this as duplicate or superseded after Codex automated review. PR #54190 is a valid narrow fix, but it has been superseded by PR #54380, which is still open, on a fresher branch, touches the same media-understanding classifier/tests, and explicitly says it supersedes #54190 and #54234. Current main does not yet ship the MIME additions, so this should close as superseded rather than implemented. Best possible solution: Close #54190 as superseded and continue review on #54380, preferably strengthening its regression test per the existing review note before landing. After #54380 or an equivalent fix lands, close the original bug #54176 with the shipped commit/release evidence. What I checked:
So I’m closing this here and keeping the remaining discussion on the canonical linked item. Codex Review notes: model gpt-5.5, reasoning high; reviewed against 06d409dc2738. |
Binary .doc files (OLE2 Compound Binary Format) were not recognized as binary by isBinaryMediaMime(), causing ~70KB of binary garbage to be embedded as text content in conversation prompts. This triggered content filtering rejections from LLM providers, breaking entire conversation sessions.
Add application/x-cfb and application/msword to the binary MIME type allowlist so these files are correctly skipped during text extraction.
Fixes #54176
Summary
.docfile is sent via Feishu channel, the media pipeline treats it as text-eligible and embeds binary content into the conversation prompt. This triggers "high risk" content filtering rejections from LLM providers..docattachments without corrupting their session.application/x-cfb(OLE2 Compound Binary Format) andapplication/msword(Word 97-2003) to the binary MIME type check inisBinaryMediaMime()insrc/media-understanding/apply.ts. Added 2 corresponding tests.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
Root Cause / Regression History (if applicable)
isBinaryMediaMime()insrc/media-understanding/apply.ts(line 286) checks a hardcoded list of binary MIME types.application/x-cfbandapplication/mswordwere missing from this list. These MIME types fall through all checks (archive formats,application/vnd.*vendor prefix) and returnfalse, allowing binary content to proceed to text extraction.application/vnd.openxmlformats-*(ZIP-based Office) but not the older OLE2/CFB-based formats.isBinaryMediaMime()function was written to cover common binary formats (images, audio, video, archives,application/vnd.*). OLE2.docfiles useapplication/x-cfbwhich does not fall under any of those categories.isBinaryMediaMime()was first introduced. Became visible when Feishu users started sending legacy.docfiles, which are detected asapplication/x-cfbby file-type libraries.Regression Test Plan (if applicable)
src/media-understanding/apply.test.tsapplication/x-cfband CFB magic bytes is NOT applied as text content. (2) A file with MIMEapplication/mswordand CFB magic bytes is NOT applied as text content.applyMediaUnderstandingpipeline with these MIME types and assert the file is skipped — if someone removes the MIME entries, the tests fail immediately.User-visible / Behavior Changes
.docfiles sent via any channel are now correctly treated as binary and skipped during text extraction, instead of having their binary content embedded in the conversation prompt..docxor other Office formats (already handled byapplication/vnd.*check).Security Impact (required)
NoNoNoNoNoYes, explain risk + mitigation: N/ARepro + Verification
Environment
Steps
.docfile (Word 97-2003 format) as an attachmentExpected
.docfile is recognized as binary and skipped during text extractionActual
Evidence
2 new tests in
src/media-understanding/apply.test.ts: "skips binary OLE2 .doc attachments (application/x-cfb)" and "skips binary .doc attachments (application/msword)". All 34 media understanding tests pass.pnpm checkpasses cleanly.Human Verification (required)
isBinaryMediaMime(). Files with these MIME types are skipped byapplyMediaUnderstanding. All existing tests remain green.application/vnd.*office formats still handled (existing test). Vendor+json/+xmlpayloads still eligible for text extraction (existing test).application/octet-streamstill binary (existing logic)..docsend with real API. Other OLE2-based formats (.xls,.ppt) which also useapplication/x-cfb— these are covered by the same fix.Review Conversations
If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.
Compatibility / Migration
Yes— binary.docfiles were never intended to be text-extracted; this restores correct behavior.NoNoFailure Recovery (if this breaks)
src/media-understanding/apply.tsto prior version; remove new tests fromsrc/media-understanding/apply.test.ts..docfiles (unlikely — the extracted content was binary garbage), that feature would stop receiving content for.docinputs.Risks and Mitigations
.xls,.ppt) also useapplication/x-cfband will now be treated as binary.