fix(media): treat legacy .doc containers as binary by andyliu · Pull Request #54380 · openclaw/openclaw

andyliu · 2026-03-25T09:08:14Z

Summary

treat legacy OLE/CFB-based .doc payloads as binary attachments
mark both application/msword and application/x-cfb as binary media
add coverage so these attachments are skipped instead of auto-embedded as text

Supersedes #54190 and #54234 with a fresh branch from current main.

greptile-apps · 2026-03-25T09:11:20Z

Greptile Summary

This PR correctly classifies application/msword and application/x-cfb as binary MIME types, preventing legacy OLE/CFB-based .doc files from being accidentally auto-embedded as text in the media-understanding pipeline. The one-line addition in isBinaryMediaMime fits cleanly into the existing classifier structure (alongside other archive/binary types) and is applied early in extractFileBlocks, before any text-heuristic evaluation.

apply.ts: adds application/x-cfb and application/msword to the binary MIME list — correct and well-targeted.
apply.test.ts: new test verifies the skip behaviour end-to-end, but the underlying config (createMediaDisabledConfig) doesn't include those MIME types in allowedMimes, so the test would pass even without the fix. Using createMediaDisabledConfigWithAllowedMimes([mime]) would make the test a proper regression guard for the isBinaryMediaMime change.

Confidence Score: 5/5

Safe to merge — the fix is a minimal, correct addition to an existing classifier with no side-effects.
The production change is a two-line addition to a well-understood list of binary MIME types. Both application/msword and application/x-cfb are unambiguously binary formats and belong there. The only issue found is that the new test doesn't strictly exercise the new code path (it would pass even without the fix), but this is a non-blocking concern that can be improved as a follow-up.
No files require special attention.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: src/media-understanding/apply.test.ts
Line: 1207-1223

Comment:
**Test doesn't exercise the new `isBinaryMediaMime` code path**

The test uses `applyWithDisabledMedia`, which internally calls `createMediaDisabledConfig()`. That config leaves `allowedMimesConfigured` as `false`, so `extractFileBlocks` populates `allowedMimes` from `EXTRA_TEXT_MIMES` only — and neither `application/msword` nor `application/x-cfb` is in that set. The file would therefore be skipped at the later `!allowedMimes.has(mimeType)` check (line ~400) even **without** the new entries in `isBinaryMediaMime`.

To have the test actually target the fix, use a config that explicitly allows those MIME types:

```ts
it("skips legacy Office binary formats carried as msword or x-cfb", async () => {
  const oleMagic = Buffer.from("d0cf11e0a1b11ae1", "hex");
  for (const mime of ["application/msword", "application/x-cfb"]) {
    const filePath = await createTempMediaFile({
      fileName: mime === "application/msword" ? "legacy.doc" : "legacy.cfb",
      content: oleMagic,
    });

    // Use a config that permits the MIME type so that the only thing
    // preventing embedding is the isBinaryMediaMime check.
    const cfg = createMediaDisabledConfigWithAllowedMimes([mime]);
    const { ctx, result } = await applyWithDisabledMedia({
      body: "<media:file>",
      mediaPath: filePath,
      mediaType: mime,
      cfg,
    });

    expectFileNotApplied({ ctx, result, body: "<media:file>" });
  }
});
```

As written, reverting the `isBinaryMediaMime` change would leave the test green, so it doesn't guard against regressions in the new binary-classification logic.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "fix(media): treat legacy .doc containers..." | Re-trigger Greptile}

greptile-apps · 2026-03-25T09:11:24Z

+  it("skips legacy Office binary formats carried as msword or x-cfb", async () => {
+    const oleMagic = Buffer.from("d0cf11e0a1b11ae1", "hex");
+    for (const mime of ["application/msword", "application/x-cfb"]) {
+      const filePath = await createTempMediaFile({
+        fileName: mime === "application/msword" ? "legacy.doc" : "legacy.cfb",
+        content: oleMagic,
+      });
+
+      const { ctx, result } = await applyWithDisabledMedia({
+        body: "<media:file>",
+        mediaPath: filePath,
+        mediaType: mime,
+      });
+
+      expectFileNotApplied({ ctx, result, body: "<media:file>" });
+    }
+  });


Test doesn't exercise the new isBinaryMediaMime code path

The test uses applyWithDisabledMedia, which internally calls createMediaDisabledConfig(). That config leaves allowedMimesConfigured as false, so extractFileBlocks populates allowedMimes from EXTRA_TEXT_MIMES only — and neither application/msword nor application/x-cfb is in that set. The file would therefore be skipped at the later !allowedMimes.has(mimeType) check (line ~400) even without the new entries in isBinaryMediaMime.

To have the test actually target the fix, use a config that explicitly allows those MIME types:

it("skips legacy Office binary formats carried as msword or x-cfb", async () => { const oleMagic = Buffer.from("d0cf11e0a1b11ae1", "hex"); for (const mime of ["application/msword", "application/x-cfb"]) { const filePath = await createTempMediaFile({ fileName: mime === "application/msword" ? "legacy.doc" : "legacy.cfb", content: oleMagic, }); // Use a config that permits the MIME type so that the only thing // preventing embedding is the isBinaryMediaMime check. const cfg = createMediaDisabledConfigWithAllowedMimes([mime]); const { ctx, result } = await applyWithDisabledMedia({ body: "<media:file>", mediaPath: filePath, mediaType: mime, cfg, }); expectFileNotApplied({ ctx, result, body: "<media:file>" }); } });

As written, reverting the isBinaryMediaMime change would leave the test green, so it doesn't guard against regressions in the new binary-classification logic.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/media-understanding/apply.test.ts Line: 1207-1223 Comment: **Test doesn't exercise the new `isBinaryMediaMime` code path** The test uses `applyWithDisabledMedia`, which internally calls `createMediaDisabledConfig()`. That config leaves `allowedMimesConfigured` as `false`, so `extractFileBlocks` populates `allowedMimes` from `EXTRA_TEXT_MIMES` only — and neither `application/msword` nor `application/x-cfb` is in that set. The file would therefore be skipped at the later `!allowedMimes.has(mimeType)` check (line ~400) even **without** the new entries in `isBinaryMediaMime`. To have the test actually target the fix, use a config that explicitly allows those MIME types: ```ts it("skips legacy Office binary formats carried as msword or x-cfb", async () => { const oleMagic = Buffer.from("d0cf11e0a1b11ae1", "hex"); for (const mime of ["application/msword", "application/x-cfb"]) { const filePath = await createTempMediaFile({ fileName: mime === "application/msword" ? "legacy.doc" : "legacy.cfb", content: oleMagic, }); // Use a config that permits the MIME type so that the only thing // preventing embedding is the isBinaryMediaMime check. const cfg = createMediaDisabledConfigWithAllowedMimes([mime]); const { ctx, result } = await applyWithDisabledMedia({ body: "<media:file>", mediaPath: filePath, mediaType: mime, cfg, }); expectFileNotApplied({ ctx, result, body: "<media:file>" }); } }); ``` As written, reverting the `isBinaryMediaMime` change would leave the test green, so it doesn't guard against regressions in the new binary-classification logic. How can I resolve this? If you propose a fix, please make it concise.

andyliu · 2026-03-25T14:16:20Z

Fresh branch from current main, narrow diff (2 files), and all checks are green. The remaining Greptile note is non-blocking test-strength feedback; production change is unchanged and safe to merge if this still looks good.

openclaw-clownfish · 2026-04-28T21:01:53Z

ProjectClownfish could not safely update this branch, so it opened a narrow replacement PR instead.

Replacement PR: #73799
Source PR: #54380
Contributor credit is preserved in the replacement PR body and changelog plan.

fix(media): treat legacy .doc containers as binary

c6f0158

openclaw-barnacle Bot added the size: XS label Mar 25, 2026

greptile-apps Bot reviewed Mar 25, 2026

View reviewed changes

This was referenced Apr 26, 2026

fix: treat application/x-cfb and application/msword as binary MIME types #54190

Closed

[Bug]: Binary files (.doc) should not be auto-embedded as text content #54176

Closed

vincentkoc mentioned this pull request Apr 28, 2026

fix(media): tighten sanitizeMimeType anchoring #73229

Merged

openclaw-clownfish Bot mentioned this pull request Apr 28, 2026

fix(media): treat legacy Word docs as binary attachments #73799

Merged

openclaw-clownfish Bot closed this Apr 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(media): treat legacy .doc containers as binary#54380

fix(media): treat legacy .doc containers as binary#54380
andyliu wants to merge 1 commit into
openclaw:mainfrom
andyliu:codex/fix-54176-binary-doc

andyliu commented Mar 25, 2026

Uh oh!

greptile-apps Bot commented Mar 25, 2026

Uh oh!

greptile-apps Bot Mar 25, 2026

Uh oh!

andyliu commented Mar 25, 2026

Uh oh!

openclaw-clownfish Bot commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

andyliu commented Mar 25, 2026

Summary

Uh oh!

greptile-apps Bot commented Mar 25, 2026

Greptile Summary

Confidence Score: 5/5

Uh oh!

greptile-apps Bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

andyliu commented Mar 25, 2026

Uh oh!

openclaw-clownfish Bot commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant