Skip to content

fix: detect legacy Office binary formats as non-text attachments#54234

Closed
kagura-agent wants to merge 2 commits into
openclaw:mainfrom
kagura-agent:fix/binary-file-detection
Closed

fix: detect legacy Office binary formats as non-text attachments#54234
kagura-agent wants to merge 2 commits into
openclaw:mainfrom
kagura-agent:fix/binary-file-detection

Conversation

@kagura-agent

Copy link
Copy Markdown
Contributor

Summary

Fixes #54176 — .doc files (application/x-cfb, application/msword) are auto-embedded as text, causing 70KB of binary garbage in prompts and LLM 'high risk' rejections.

Root Cause

isBinaryMediaMime() in src/media-understanding/apply.ts catches application/vnd.* (which covers modern Office formats like .docx/.xlsx) and common archives, but misses two legacy Microsoft Office MIME types:

  • application/msword — the standard MIME for .doc files
  • application/x-cfb — the MIME that file-type sniffers detect for OLE Compound Binary Format (legacy Office container)

Fix

Add both MIME types to the binary detection check, grouped with a comment explaining they are legacy OLE/CFB formats.

Testing

  • No behavioral changes to modern Office formats (already caught by application/vnd.*)
  • .doc files will now be treated as binary attachments instead of embedded as text
  • 7 lines changed, 1 file

@greptile-apps

greptile-apps Bot commented Mar 25, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a bug where .doc files (detected with MIME types application/msword or application/x-cfb) were being embedded as raw text into prompts instead of being treated as binary attachments, producing garbage content and triggering LLM safety rejections.

  • Fix — two legacy Microsoft Office MIME types (application/msword, application/x-cfb) are added to isBinaryMediaMime() in src/media-understanding/apply.ts, correctly placed before the existing application/vnd.* guard with an explanatory comment.
  • Scope — the change is minimal and surgical; modern Office formats (.docx, .xlsx, etc.) were already caught by application/vnd.* and are unaffected.
  • Testing — no new test cases are added for the new MIME types, though the existing suite covers the analogous application/vnd.* path at line 1173 of apply.test.ts.

Confidence Score: 5/5

  • Safe to merge — the change is a targeted, low-risk addition that only expands the set of MIME types recognised as binary.
  • The fix is correct: both MIME types are unambiguously binary, the guard pattern matches the existing style, and the placement (before the vnd.* check) is logical. The only gap is the absence of dedicated tests for the new cases, but the overall test suite already validates the surrounding logic and the change itself carries negligible regression risk.
  • No files require special attention.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/media-understanding/apply.ts
Line: 306-312

Comment:
**No test added for the new MIME types**

The existing suite has a dedicated test at line 1173 verifying that `application/vnd.*` Office attachments are skipped as binary. It would be consistent to add a parallel test for `application/msword` (and optionally `application/x-cfb`) so regressions on these cases are caught automatically, especially given that this is the exact scenario that caused the bug in #54176.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix: detect legacy Office binary formats..." | Re-trigger Greptile

Comment on lines +306 to +312
// Legacy Microsoft Office binary formats (OLE/CFB container)
if (
mime === "application/msword" ||
mime === "application/x-cfb"
) {
return true;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No test added for the new MIME types

The existing suite has a dedicated test at line 1173 verifying that application/vnd.* Office attachments are skipped as binary. It would be consistent to add a parallel test for application/msword (and optionally application/x-cfb) so regressions on these cases are caught automatically, especially given that this is the exact scenario that caused the bug in #54176.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/media-understanding/apply.ts
Line: 306-312

Comment:
**No test added for the new MIME types**

The existing suite has a dedicated test at line 1173 verifying that `application/vnd.*` Office attachments are skipped as binary. It would be consistent to add a parallel test for `application/msword` (and optionally `application/x-cfb`) so regressions on these cases are caught automatically, especially given that this is the exact scenario that caused the bug in #54176.

How can I resolve this? If you propose a fix, please make it concise.

@kagura-agent

Copy link
Copy Markdown
Contributor Author

Friendly ping — any feedback on this PR? Happy to address concerns or adjust the approach. Will close in 7 days if no response. 🙏

@kagura-agent

Copy link
Copy Markdown
Contributor Author

Friendly ping — this has been open for a week. Let me know if any changes are needed!

Add application/msword and application/x-cfb to isBinaryMediaMime()
so that .doc files (Microsoft Word 97-2003 format) are not auto-embedded
as text content in prompts.

The application/x-cfb MIME type is what file-type libraries detect for
OLE Compound Binary Format files (legacy Office), and application/msword
is the standard MIME for .doc files. Neither was previously caught by
the binary detection logic, causing 70KB+ of binary garbage to be
embedded in conversation history and triggering LLM 'high risk' rejections.

Fixes openclaw#54176
…ypes

Address CI format check failure and Greptile review feedback:
- Run oxfmt on apply.ts to fix formatting
- Add test for application/msword and application/x-cfb detection
@kagura-agent kagura-agent force-pushed the fix/binary-file-detection branch from 55cbf1d to 254444b Compare April 13, 2026 20:41
@kagura-agent

Copy link
Copy Markdown
Contributor Author

Rebased on latest main — no conflicts. Friendly ping: would love any feedback or review when you get a chance 🙏

@kagura-agent

Copy link
Copy Markdown
Contributor Author

Closing — this has been open for ~3 weeks without review. The fix is still valid; happy to reopen or resubmit if there's interest. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Binary files (.doc) should not be auto-embedded as text content

1 participant