Skip to content

fix(media): treat legacy .doc containers as binary#54380

Closed
andyliu wants to merge 1 commit into
openclaw:mainfrom
andyliu:codex/fix-54176-binary-doc
Closed

fix(media): treat legacy .doc containers as binary#54380
andyliu wants to merge 1 commit into
openclaw:mainfrom
andyliu:codex/fix-54176-binary-doc

Conversation

@andyliu

@andyliu andyliu commented Mar 25, 2026

Copy link
Copy Markdown
Contributor

Summary

  • treat legacy OLE/CFB-based .doc payloads as binary attachments
  • mark both application/msword and application/x-cfb as binary media
  • add coverage so these attachments are skipped instead of auto-embedded as text

Fixes #54176

Supersedes #54190 and #54234 with a fresh branch from current main.

@greptile-apps

greptile-apps Bot commented Mar 25, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR correctly classifies application/msword and application/x-cfb as binary MIME types, preventing legacy OLE/CFB-based .doc files from being accidentally auto-embedded as text in the media-understanding pipeline. The one-line addition in isBinaryMediaMime fits cleanly into the existing classifier structure (alongside other archive/binary types) and is applied early in extractFileBlocks, before any text-heuristic evaluation.

  • apply.ts: adds application/x-cfb and application/msword to the binary MIME list — correct and well-targeted.
  • apply.test.ts: new test verifies the skip behaviour end-to-end, but the underlying config (createMediaDisabledConfig) doesn't include those MIME types in allowedMimes, so the test would pass even without the fix. Using createMediaDisabledConfigWithAllowedMimes([mime]) would make the test a proper regression guard for the isBinaryMediaMime change.

Confidence Score: 5/5

  • Safe to merge — the fix is a minimal, correct addition to an existing classifier with no side-effects.
  • The production change is a two-line addition to a well-understood list of binary MIME types. Both application/msword and application/x-cfb are unambiguously binary formats and belong there. The only issue found is that the new test doesn't strictly exercise the new code path (it would pass even without the fix), but this is a non-blocking concern that can be improved as a follow-up.
  • No files require special attention.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/media-understanding/apply.test.ts
Line: 1207-1223

Comment:
**Test doesn't exercise the new `isBinaryMediaMime` code path**

The test uses `applyWithDisabledMedia`, which internally calls `createMediaDisabledConfig()`. That config leaves `allowedMimesConfigured` as `false`, so `extractFileBlocks` populates `allowedMimes` from `EXTRA_TEXT_MIMES` only — and neither `application/msword` nor `application/x-cfb` is in that set. The file would therefore be skipped at the later `!allowedMimes.has(mimeType)` check (line ~400) even **without** the new entries in `isBinaryMediaMime`.

To have the test actually target the fix, use a config that explicitly allows those MIME types:

```ts
it("skips legacy Office binary formats carried as msword or x-cfb", async () => {
  const oleMagic = Buffer.from("d0cf11e0a1b11ae1", "hex");
  for (const mime of ["application/msword", "application/x-cfb"]) {
    const filePath = await createTempMediaFile({
      fileName: mime === "application/msword" ? "legacy.doc" : "legacy.cfb",
      content: oleMagic,
    });

    // Use a config that permits the MIME type so that the only thing
    // preventing embedding is the isBinaryMediaMime check.
    const cfg = createMediaDisabledConfigWithAllowedMimes([mime]);
    const { ctx, result } = await applyWithDisabledMedia({
      body: "<media:file>",
      mediaPath: filePath,
      mediaType: mime,
      cfg,
    });

    expectFileNotApplied({ ctx, result, body: "<media:file>" });
  }
});
```

As written, reverting the `isBinaryMediaMime` change would leave the test green, so it doesn't guard against regressions in the new binary-classification logic.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(media): treat legacy .doc containers..." | Re-trigger Greptile

Comment on lines +1207 to +1223
it("skips legacy Office binary formats carried as msword or x-cfb", async () => {
const oleMagic = Buffer.from("d0cf11e0a1b11ae1", "hex");
for (const mime of ["application/msword", "application/x-cfb"]) {
const filePath = await createTempMediaFile({
fileName: mime === "application/msword" ? "legacy.doc" : "legacy.cfb",
content: oleMagic,
});

const { ctx, result } = await applyWithDisabledMedia({
body: "<media:file>",
mediaPath: filePath,
mediaType: mime,
});

expectFileNotApplied({ ctx, result, body: "<media:file>" });
}
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Test doesn't exercise the new isBinaryMediaMime code path

The test uses applyWithDisabledMedia, which internally calls createMediaDisabledConfig(). That config leaves allowedMimesConfigured as false, so extractFileBlocks populates allowedMimes from EXTRA_TEXT_MIMES only — and neither application/msword nor application/x-cfb is in that set. The file would therefore be skipped at the later !allowedMimes.has(mimeType) check (line ~400) even without the new entries in isBinaryMediaMime.

To have the test actually target the fix, use a config that explicitly allows those MIME types:

it("skips legacy Office binary formats carried as msword or x-cfb", async () => {
  const oleMagic = Buffer.from("d0cf11e0a1b11ae1", "hex");
  for (const mime of ["application/msword", "application/x-cfb"]) {
    const filePath = await createTempMediaFile({
      fileName: mime === "application/msword" ? "legacy.doc" : "legacy.cfb",
      content: oleMagic,
    });

    // Use a config that permits the MIME type so that the only thing
    // preventing embedding is the isBinaryMediaMime check.
    const cfg = createMediaDisabledConfigWithAllowedMimes([mime]);
    const { ctx, result } = await applyWithDisabledMedia({
      body: "<media:file>",
      mediaPath: filePath,
      mediaType: mime,
      cfg,
    });

    expectFileNotApplied({ ctx, result, body: "<media:file>" });
  }
});

As written, reverting the isBinaryMediaMime change would leave the test green, so it doesn't guard against regressions in the new binary-classification logic.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/media-understanding/apply.test.ts
Line: 1207-1223

Comment:
**Test doesn't exercise the new `isBinaryMediaMime` code path**

The test uses `applyWithDisabledMedia`, which internally calls `createMediaDisabledConfig()`. That config leaves `allowedMimesConfigured` as `false`, so `extractFileBlocks` populates `allowedMimes` from `EXTRA_TEXT_MIMES` only — and neither `application/msword` nor `application/x-cfb` is in that set. The file would therefore be skipped at the later `!allowedMimes.has(mimeType)` check (line ~400) even **without** the new entries in `isBinaryMediaMime`.

To have the test actually target the fix, use a config that explicitly allows those MIME types:

```ts
it("skips legacy Office binary formats carried as msword or x-cfb", async () => {
  const oleMagic = Buffer.from("d0cf11e0a1b11ae1", "hex");
  for (const mime of ["application/msword", "application/x-cfb"]) {
    const filePath = await createTempMediaFile({
      fileName: mime === "application/msword" ? "legacy.doc" : "legacy.cfb",
      content: oleMagic,
    });

    // Use a config that permits the MIME type so that the only thing
    // preventing embedding is the isBinaryMediaMime check.
    const cfg = createMediaDisabledConfigWithAllowedMimes([mime]);
    const { ctx, result } = await applyWithDisabledMedia({
      body: "<media:file>",
      mediaPath: filePath,
      mediaType: mime,
      cfg,
    });

    expectFileNotApplied({ ctx, result, body: "<media:file>" });
  }
});
```

As written, reverting the `isBinaryMediaMime` change would leave the test green, so it doesn't guard against regressions in the new binary-classification logic.

How can I resolve this? If you propose a fix, please make it concise.

@andyliu

andyliu commented Mar 25, 2026

Copy link
Copy Markdown
Contributor Author

Fresh branch from current main, narrow diff (2 files), and all checks are green. The remaining Greptile note is non-blocking test-strength feedback; production change is unchanged and safe to merge if this still looks good.

@openclaw-clownfish

Copy link
Copy Markdown
Contributor

ProjectClownfish could not safely update this branch, so it opened a narrow replacement PR instead.

Replacement PR: #73799
Source PR: #54380
Contributor credit is preserved in the replacement PR body and changelog plan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Binary files (.doc) should not be auto-embedded as text content

1 participant