fix: treat application/x-cfb and application/msword as binary MIME types by lndyzwdxhs · Pull Request #54190 · openclaw/openclaw

lndyzwdxhs · 2026-03-25T03:01:25Z

Binary .doc files (OLE2 Compound Binary Format) were not recognized as binary by isBinaryMediaMime(), causing ~70KB of binary garbage to be embedded as text content in conversation prompts. This triggered content filtering rejections from LLM providers, breaking entire conversation sessions.

Add application/x-cfb and application/msword to the binary MIME type allowlist so these files are correctly skipped during text extraction.

Fixes #54176

Summary

Problem: When a .doc file is sent via Feishu channel, the media pipeline treats it as text-eligible and embeds binary content into the conversation prompt. This triggers "high risk" content filtering rejections from LLM providers.
Why it matters: The entire conversation session breaks — not just the single message. Users on Feishu (and potentially other channels) cannot send .doc attachments without corrupting their session.
What changed: Added application/x-cfb (OLE2 Compound Binary Format) and application/msword (Word 97-2003) to the binary MIME type check in isBinaryMediaMime() in src/media-understanding/apply.ts. Added 2 corresponding tests.
What did NOT change (scope boundary): No other MIME detection logic changed. The function signature, call site (line 363), and overall media pipeline flow are unchanged. No config, no new dependencies.

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Closes [Bug]: Binary files (.doc) should not be auto-embedded as text content #54176
Related # N/A
This PR fixes a bug or regression

Root Cause / Regression History (if applicable)

Root cause: isBinaryMediaMime() in src/media-understanding/apply.ts (line 286) checks a hardcoded list of binary MIME types. application/x-cfb and application/msword were missing from this list. These MIME types fall through all checks (archive formats, application/vnd.* vendor prefix) and return false, allowing binary content to proceed to text extraction.
Missing detection / guardrail: No test verified that legacy Office binary formats (pre-OOXML) were treated as binary. Existing tests covered application/vnd.openxmlformats-* (ZIP-based Office) but not the older OLE2/CFB-based formats.
Prior context: The isBinaryMediaMime() function was written to cover common binary formats (images, audio, video, archives, application/vnd.*). OLE2 .doc files use application/x-cfb which does not fall under any of those categories.
Why this regressed now: Likely present since isBinaryMediaMime() was first introduced. Became visible when Feishu users started sending legacy .doc files, which are detected as application/x-cfb by file-type libraries.
If unknown, what was ruled out: N/A — root cause is clear from reading the function.

Regression Test Plan (if applicable)

Coverage level that should have caught this:
- Unit test
- Seam / integration test
- End-to-end test
- Existing coverage already sufficient
Target test or file: src/media-understanding/apply.test.ts
Scenario the test should lock in: (1) A file with MIME application/x-cfb and CFB magic bytes is NOT applied as text content. (2) A file with MIME application/msword and CFB magic bytes is NOT applied as text content.
Why this is the smallest reliable guardrail: The tests directly exercise the applyMediaUnderstanding pipeline with these MIME types and assert the file is skipped — if someone removes the MIME entries, the tests fail immediately.
Existing test that already covers this (if any): None — existing "skips binary application/vnd office attachments" test only covers ZIP-based OOXML formats.
If no new test is added, why not: N/A — 2 new tests added.

User-visible / Behavior Changes

.doc files sent via any channel are now correctly treated as binary and skipped during text extraction, instead of having their binary content embedded in the conversation prompt.
No change for .docx or other Office formats (already handled by application/vnd.* check).

Security Impact (required)

New permissions/capabilities? No
Secrets/tokens handling changed? No
New/changed network calls? No
Command/tool execution surface changed? No
Data access scope changed? No
If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

OS: macOS (also reproducible on Linux)
Runtime/container: Node.js 22+
Model/provider: Any LLM provider (binary content triggers content filtering)
Integration/channel (if any): Feishu (primary repro channel)
Relevant config (redacted): Feishu channel configured with valid credentials

Steps

Configure a Feishu channel
Send a legacy .doc file (Word 97-2003 format) as an attachment
Observe that the binary content is embedded in the conversation prompt and the LLM provider rejects it with a "high risk" content filtering error

Expected

The .doc file is recognized as binary and skipped during text extraction

Actual

Before this PR: ~70KB of binary garbage is embedded as text content, triggering content filtering rejections and breaking the conversation session
After this PR: the file is correctly identified as binary and not embedded

Evidence

Failing test/log before + passing after
Trace/log snippets
Screenshot/recording
Perf numbers (if relevant)

2 new tests in src/media-understanding/apply.test.ts: "skips binary OLE2 .doc attachments (application/x-cfb)" and "skips binary .doc attachments (application/msword)". All 34 media understanding tests pass. pnpm check passes cleanly.

Human Verification (required)

Verified scenarios: Both new MIME types are correctly identified as binary by isBinaryMediaMime(). Files with these MIME types are skipped by applyMediaUnderstanding. All existing tests remain green.
Edge cases checked: application/vnd.* office formats still handled (existing test). Vendor +json/+xml payloads still eligible for text extraction (existing test). application/octet-stream still binary (existing logic).
What you did not verify: Manual end-to-end Feishu .doc send with real API. Other OLE2-based formats (.xls, .ppt) which also use application/x-cfb — these are covered by the same fix.

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.

Compatibility / Migration

Backward compatible? Yes — binary .doc files were never intended to be text-extracted; this restores correct behavior.
Config/env changes? No
Migration needed? No
If yes, exact upgrade steps: N/A

Failure Recovery (if this breaks)

How to disable/revert this change quickly: Revert this single commit.
Files/config to restore: src/media-understanding/apply.ts to prior version; remove new tests from src/media-understanding/apply.test.ts.
Known bad symptoms reviewers should watch for: If a downstream feature relied on text-extracting .doc files (unlikely — the extracted content was binary garbage), that feature would stop receiving content for .doc inputs.

Risks and Mitigations

Risk: Other OLE2-based formats (.xls, .ppt) also use application/x-cfb and will now be treated as binary.
- Mitigation: This is intentional and correct. All OLE2 Compound Binary Format files are binary; text extraction produces garbage for all of them.

greptile-apps · 2026-03-25T03:02:31Z

Greptile Summary

This PR fixes a well-scoped bug: application/x-cfb (OLE2 Compound Binary Format) and application/msword (Word 97-2003) were missing from the isBinaryMediaMime() allowlist in src/media-understanding/apply.ts, causing binary .doc file content to be embedded as text in LLM prompts and triggering content-filter rejections.

Adds application/x-cfb and application/msword to the hardcoded binary MIME list, consistent with all existing entries.
Adds two targeted unit tests with real CFB magic bytes that assert the files are skipped by applyMediaUnderstanding, locking in the regression guardrail.
No other logic, function signatures, or call sites are changed; the fix is minimal and backward-compatible.

Confidence Score: 5/5

This PR is safe to merge — the change is minimal, well-tested, and directly addresses a clear production bug with no functional regressions.
The fix is a two-line addition to a straightforward allowlist check, paired with two well-structured unit tests using realistic binary payloads. The root cause is clearly identified, the scope is tightly bounded, and all existing tests remain green. No architectural changes, no new dependencies, no side effects on other MIME types.
No files require special attention.

_{Reviews (1): Last reviewed commit: "fix: treat application/x-cfb and applica..." | Re-trigger Greptile}

Binary .doc files (OLE2 Compound Binary Format) were not recognized as binary by isBinaryMediaMime(), causing ~70KB of binary garbage to be embedded as text content in conversation prompts. This triggered content filtering rejections from LLM providers, breaking entire conversation sessions. Add application/x-cfb and application/msword to the binary MIME type allowlist so these files are correctly skipped during text extraction. Fixes openclaw#54176 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

clawsweeper · 2026-04-26T08:14:36Z

Closing this as duplicate or superseded after Codex automated review.

PR #54190 is a valid narrow fix, but it has been superseded by PR #54380, which is still open, on a fresher branch, touches the same media-understanding classifier/tests, and explicitly says it supersedes #54190 and #54234. Current main does not yet ship the MIME additions, so this should close as superseded rather than implemented.

Best possible solution:

Close #54190 as superseded and continue review on #54380, preferably strengthening its regression test per the existing review note before landing. After #54380 or an equivalent fix lands, close the original bug #54176 with the shipped commit/release evidence.

What I checked:

Current main still lacks this fix: At main 06d409d, isBinaryMediaMime() handles images/audio/video, octet-stream, common archives, +zip, and application/vnd.* but does not include application/x-cfb or application/msword before returning false. (src/media-understanding/apply.ts:337, 06d409dc2738)
Affected path confirmed: extractFileBlocks() calls isBinaryMediaMime(normalizedRawMime) before text extraction, so this classifier is the relevant product boundary for the reported binary .doc prompt-embedding bug. (src/media-understanding/apply.ts:418, 06d409dc2738)
Canonical superseding PR exists: GitHub API shows PR fix(media): treat legacy .doc containers as binary #54380 is open, unmerged, authored from head c6f0158, and its body states: “Supersedes fix: treat application/x-cfb and application/msword as binary MIME types #54190 and fix: detect legacy Office binary formats as non-text attachments #54234 with a fresh branch from current main.”. (c6f01584d0f2)
Superseding PR covers the same files and behavior: PR fix(media): treat legacy .doc containers as binary #54380's files add application/x-cfb and application/msword to src/media-understanding/apply.ts and add regression coverage in src/media-understanding/apply.test.ts for skipping legacy Office binary formats. (src/media-understanding/apply.ts:299, c6f01584d0f2)
Original bug remains canonical context: Issue [Bug]: Binary files (.doc) should not be auto-embedded as text content #54176 is still open and describes the same Feishu .doc/application-x-cfb binary embedding failure that both PR fix: treat application/x-cfb and application/msword as binary MIME types #54190 and PR fix(media): treat legacy .doc containers as binary #54380 are intended to fix.

So I’m closing this here and keeping the remaining discussion on the canonical linked item.

Codex Review notes: model gpt-5.5, reasoning high; reviewed against 06d409dc2738.

openclaw-barnacle Bot added the size: XS label Mar 25, 2026

lndyzwdxhs mentioned this pull request Mar 25, 2026

[Bug]: Binary files (.doc) should not be auto-embedded as text content #54176

Closed

andyliu mentioned this pull request Mar 25, 2026

fix(media): treat legacy .doc containers as binary #54380

Closed

lndyzwdxhs force-pushed the fix/binary-doc-mime-type branch from d5dd3c7 to 4edf646 Compare March 25, 2026 13:40

clawsweeper Bot closed this Apr 26, 2026

openclaw-clownfish Bot mentioned this pull request Apr 28, 2026

fix(media): treat legacy Word docs as binary attachments #73799

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: treat application/x-cfb and application/msword as binary MIME types#54190

fix: treat application/x-cfb and application/msword as binary MIME types#54190
lndyzwdxhs wants to merge 1 commit into
openclaw:mainfrom
lndyzwdxhs:fix/binary-doc-mime-type

lndyzwdxhs commented Mar 25, 2026

Uh oh!

greptile-apps Bot commented Mar 25, 2026

Uh oh!

clawsweeper Bot commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant