Skip to content

fix: treat application/x-cfb and application/msword as binary MIME types#54190

Closed
lndyzwdxhs wants to merge 1 commit into
openclaw:mainfrom
lndyzwdxhs:fix/binary-doc-mime-type
Closed

fix: treat application/x-cfb and application/msword as binary MIME types#54190
lndyzwdxhs wants to merge 1 commit into
openclaw:mainfrom
lndyzwdxhs:fix/binary-doc-mime-type

Conversation

@lndyzwdxhs

Copy link
Copy Markdown
Contributor

Binary .doc files (OLE2 Compound Binary Format) were not recognized as binary by isBinaryMediaMime(), causing ~70KB of binary garbage to be embedded as text content in conversation prompts. This triggered content filtering rejections from LLM providers, breaking entire conversation sessions.

Add application/x-cfb and application/msword to the binary MIME type allowlist so these files are correctly skipped during text extraction.

Fixes #54176

Summary

  • Problem: When a .doc file is sent via Feishu channel, the media pipeline treats it as text-eligible and embeds binary content into the conversation prompt. This triggers "high risk" content filtering rejections from LLM providers.
  • Why it matters: The entire conversation session breaks — not just the single message. Users on Feishu (and potentially other channels) cannot send .doc attachments without corrupting their session.
  • What changed: Added application/x-cfb (OLE2 Compound Binary Format) and application/msword (Word 97-2003) to the binary MIME type check in isBinaryMediaMime() in src/media-understanding/apply.ts. Added 2 corresponding tests.
  • What did NOT change (scope boundary): No other MIME detection logic changed. The function signature, call site (line 363), and overall media pipeline flow are unchanged. No config, no new dependencies.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

Root Cause / Regression History (if applicable)

  • Root cause: isBinaryMediaMime() in src/media-understanding/apply.ts (line 286) checks a hardcoded list of binary MIME types. application/x-cfb and application/msword were missing from this list. These MIME types fall through all checks (archive formats, application/vnd.* vendor prefix) and return false, allowing binary content to proceed to text extraction.
  • Missing detection / guardrail: No test verified that legacy Office binary formats (pre-OOXML) were treated as binary. Existing tests covered application/vnd.openxmlformats-* (ZIP-based Office) but not the older OLE2/CFB-based formats.
  • Prior context: The isBinaryMediaMime() function was written to cover common binary formats (images, audio, video, archives, application/vnd.*). OLE2 .doc files use application/x-cfb which does not fall under any of those categories.
  • Why this regressed now: Likely present since isBinaryMediaMime() was first introduced. Became visible when Feishu users started sending legacy .doc files, which are detected as application/x-cfb by file-type libraries.
  • If unknown, what was ruled out: N/A — root cause is clear from reading the function.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/media-understanding/apply.test.ts
  • Scenario the test should lock in: (1) A file with MIME application/x-cfb and CFB magic bytes is NOT applied as text content. (2) A file with MIME application/msword and CFB magic bytes is NOT applied as text content.
  • Why this is the smallest reliable guardrail: The tests directly exercise the applyMediaUnderstanding pipeline with these MIME types and assert the file is skipped — if someone removes the MIME entries, the tests fail immediately.
  • Existing test that already covers this (if any): None — existing "skips binary application/vnd office attachments" test only covers ZIP-based OOXML formats.
  • If no new test is added, why not: N/A — 2 new tests added.

User-visible / Behavior Changes

  • .doc files sent via any channel are now correctly treated as binary and skipped during text extraction, instead of having their binary content embedded in the conversation prompt.
  • No change for .docx or other Office formats (already handled by application/vnd.* check).

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: macOS (also reproducible on Linux)
  • Runtime/container: Node.js 22+
  • Model/provider: Any LLM provider (binary content triggers content filtering)
  • Integration/channel (if any): Feishu (primary repro channel)
  • Relevant config (redacted): Feishu channel configured with valid credentials

Steps

  1. Configure a Feishu channel
  2. Send a legacy .doc file (Word 97-2003 format) as an attachment
  3. Observe that the binary content is embedded in the conversation prompt and the LLM provider rejects it with a "high risk" content filtering error

Expected

  • The .doc file is recognized as binary and skipped during text extraction

Actual

  • Before this PR: ~70KB of binary garbage is embedded as text content, triggering content filtering rejections and breaking the conversation session
  • After this PR: the file is correctly identified as binary and not embedded

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

2 new tests in src/media-understanding/apply.test.ts: "skips binary OLE2 .doc attachments (application/x-cfb)" and "skips binary .doc attachments (application/msword)". All 34 media understanding tests pass. pnpm check passes cleanly.

Human Verification (required)

  • Verified scenarios: Both new MIME types are correctly identified as binary by isBinaryMediaMime(). Files with these MIME types are skipped by applyMediaUnderstanding. All existing tests remain green.
  • Edge cases checked: application/vnd.* office formats still handled (existing test). Vendor +json/+xml payloads still eligible for text extraction (existing test). application/octet-stream still binary (existing logic).
  • What you did not verify: Manual end-to-end Feishu .doc send with real API. Other OLE2-based formats (.xls, .ppt) which also use application/x-cfb — these are covered by the same fix.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.

Compatibility / Migration

  • Backward compatible? Yes — binary .doc files were never intended to be text-extracted; this restores correct behavior.
  • Config/env changes? No
  • Migration needed? No
  • If yes, exact upgrade steps: N/A

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: Revert this single commit.
  • Files/config to restore: src/media-understanding/apply.ts to prior version; remove new tests from src/media-understanding/apply.test.ts.
  • Known bad symptoms reviewers should watch for: If a downstream feature relied on text-extracting .doc files (unlikely — the extracted content was binary garbage), that feature would stop receiving content for .doc inputs.

Risks and Mitigations

  • Risk: Other OLE2-based formats (.xls, .ppt) also use application/x-cfb and will now be treated as binary.
    • Mitigation: This is intentional and correct. All OLE2 Compound Binary Format files are binary; text extraction produces garbage for all of them.

@greptile-apps

greptile-apps Bot commented Mar 25, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a well-scoped bug: application/x-cfb (OLE2 Compound Binary Format) and application/msword (Word 97-2003) were missing from the isBinaryMediaMime() allowlist in src/media-understanding/apply.ts, causing binary .doc file content to be embedded as text in LLM prompts and triggering content-filter rejections.

  • Adds application/x-cfb and application/msword to the hardcoded binary MIME list, consistent with all existing entries.
  • Adds two targeted unit tests with real CFB magic bytes that assert the files are skipped by applyMediaUnderstanding, locking in the regression guardrail.
  • No other logic, function signatures, or call sites are changed; the fix is minimal and backward-compatible.

Confidence Score: 5/5

  • This PR is safe to merge — the change is minimal, well-tested, and directly addresses a clear production bug with no functional regressions.
  • The fix is a two-line addition to a straightforward allowlist check, paired with two well-structured unit tests using realistic binary payloads. The root cause is clearly identified, the scope is tightly bounded, and all existing tests remain green. No architectural changes, no new dependencies, no side effects on other MIME types.
  • No files require special attention.

Reviews (1): Last reviewed commit: "fix: treat application/x-cfb and applica..." | Re-trigger Greptile

Binary .doc files (OLE2 Compound Binary Format) were not recognized as binary
by isBinaryMediaMime(), causing ~70KB of binary garbage to be embedded as text
content in conversation prompts. This triggered content filtering rejections
from LLM providers, breaking entire conversation sessions.

Add application/x-cfb and application/msword to the binary MIME type allowlist
so these files are correctly skipped during text extraction.

Fixes openclaw#54176

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@lndyzwdxhs lndyzwdxhs force-pushed the fix/binary-doc-mime-type branch from d5dd3c7 to 4edf646 Compare March 25, 2026 13:40
@clawsweeper

clawsweeper Bot commented Apr 26, 2026

Copy link
Copy Markdown
Contributor

Closing this as duplicate or superseded after Codex automated review.

PR #54190 is a valid narrow fix, but it has been superseded by PR #54380, which is still open, on a fresher branch, touches the same media-understanding classifier/tests, and explicitly says it supersedes #54190 and #54234. Current main does not yet ship the MIME additions, so this should close as superseded rather than implemented.

Best possible solution:

Close #54190 as superseded and continue review on #54380, preferably strengthening its regression test per the existing review note before landing. After #54380 or an equivalent fix lands, close the original bug #54176 with the shipped commit/release evidence.

What I checked:

So I’m closing this here and keeping the remaining discussion on the canonical linked item.

Codex Review notes: model gpt-5.5, reasoning high; reviewed against 06d409dc2738.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Binary files (.doc) should not be auto-embedded as text content

1 participant