Skip to content

fix(media-understanding): skip legacy Word docs from text extraction#44068

Closed
legolaz8451 wants to merge 3 commits into
openclaw:mainfrom
legolaz8451:bugfix-msword-binary-guard
Closed

fix(media-understanding): skip legacy Word docs from text extraction#44068
legolaz8451 wants to merge 3 commits into
openclaw:mainfrom
legolaz8451:bugfix-msword-binary-guard

Conversation

@legolaz8451

@legolaz8451 legolaz8451 commented Mar 12, 2026

Copy link
Copy Markdown
Contributor

Summary

Describe the problem and fix in 2-5 bullets:

  • Problem: legacy .doc attachments with MIME application/msword could pass through media-understanding text heuristics, be misclassified as text, and get inlined into prompt context.
  • Why it matters: in long-running WhatsApp sessions this can inject binary garbage into model history, trigger context_length_exceeded, and leave the session unusable until reset.
  • What changed: src/media-understanding/apply.ts now treats legacy Word MIME types as binary and skips text-heuristic coercion for already-binary MIME types; src/media-understanding/apply.test.ts adds a regression test for the UTF-16-plus-TSV false-positive path.
  • What did NOT change (scope boundary): this PR does not add .doc extraction support, does not change session compaction/recovery behavior, and does not address the separate Brave Search API 422 country-validation bug.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #
  • Related #

User-visible / Behavior Changes

  • Legacy Word .doc attachments are no longer auto-inlined into media-understanding file blocks through text MIME heuristics.
  • Affected sessions avoid the specific binary-to-text prompt pollution path that could cause context overflow.

Security Impact (required)

  • New permissions/capabilities? (Yes/No) No
  • Secrets/tokens handling changed? (Yes/No) No
  • New/changed network calls? (Yes/No) No
  • Command/tool execution surface changed? (Yes/No) No
  • Data access scope changed? (Yes/No) No
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS: Linux
  • Runtime/container: local fresh clone on branch bugfix-msword-binary-guard
  • Model/provider: N/A for local regression test
  • Integration/channel (if any): media-understanding attachment preprocessing; production incident observed on WhatsApp
  • Relevant config (redacted): default file MIME allowlist path; no special override required for local regression

Steps

  1. Create or provide an attachment named incident.doc with MIME application/msword and bytes that look UTF-16/text-like near the start.
  2. Pass that attachment through applyMediaUnderstanding with file-block extraction enabled and media features otherwise disabled.
  3. Observe whether the attachment is converted into a <file ...> block and appended to the body.

Expected

  • Legacy .doc attachments are treated as binary and skipped by file-block text extraction.

Actual

  • Before this change, the .doc-like payload could be treated as text-like, relabeled, and inlined.
  • After this change, the regression test confirms no <file> block is added.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)
YYYY-MM-DDTHH:MM:SS.sssZ
error code: context_length_exceeded
message: Your input exceeds the context window of this model. Please adjust your input and try again.
YYYY-MM-DD HH:MM:SS auto-reply sent
Context overflow: prompt too large for the model. Try /reset (or /new) to start a fresh session, or use a larger-context model.

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: ran the targeted regression test for legacy .doc MIME misclassification; verified the branch diff is limited to the guard and test; verified pnpm build succeeds in the corrected local environment.
  • Edge cases checked: confirmed the guard is scoped to legacy Word binary MIME types; confirmed the existing PDF special-case remains intact in the same code path.
  • What you did not verify: no live WhatsApp end-to-end retest; no production replay; no validation of .docx or other Office formats beyond the specific MIME guard added here.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.

Compatibility / Migration

  • Backward compatible? (Yes/No) Yes
  • Config/env changes? (Yes/No) No
  • Migration needed? (Yes/No) No
  • If yes, exact upgrade steps:

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: revert this commit to restore the previous MIME heuristic behavior.
  • Files/config to restore: src/media-understanding/apply.ts and src/media-understanding/apply.test.ts
  • Known bad symptoms reviewers should watch for: legacy .doc attachments unexpectedly no longer contribute extracted text; any regressions in non-binary text attachment extraction in the same code path.

Risks and Mitigations

List only real risks for this PR. Add/remove entries as needed. If none, write None.

  • Risk: some legitimate inputs mislabeled as application/msword will now be skipped instead of being coerced into text.
    • Mitigation: the changed behavior is limited to known legacy Word binary MIME types, which are unsafe to decode generically; dedicated extraction can be added later if needed.

@greptile-apps

greptile-apps Bot commented Mar 12, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a bug where legacy .doc attachments with MIME type application/msword could bypass the binary-MIME guard in extractFileBlocks, get misclassified as text via UTF-16 heuristics, and get inlined into prompt context — potentially causing context_length_exceeded errors in long-running sessions.

The fix adds application/msword (and two vnd.* variants) to isBinaryMediaMime and also adds a !isBinaryMediaMime gate to allowTextHeuristic. A regression test with a crafted UTF-16-BOM + TSV-like payload reproduces the specific false-positive path.

Key observations:

  • The core fix — adding application/msword to isBinaryMediaMime — is correct and sufficient for the reported bug.
  • application/vnd.ms-word and application/vnd.ms-office were already implicitly covered by the existing startsWith("application/vnd.") wildcard in isBinaryMediaMime; the new explicit entries are redundant but harmless.
  • The !isBinaryMediaMime(normalizedRawMime) guard added to allowTextHeuristic (line 396) is dead code in all practical execution paths: any binary MIME without a forcedTextMimeResolved override is already eliminated by the early continue on line 388–390, and when forcedTextMimeResolved is set, textHint takes that value regardless of allowTextHeuristic.
  • The regression test is well-constructed and directly validates the described bug scenario.

Confidence Score: 4/5

  • Safe to merge — the bug fix is correct and isolated; only minor redundant code was introduced alongside the real fix.
  • The single necessary change (adding application/msword to isBinaryMediaMime) is correct and well-tested. Two redundant additions were included (application/vnd.ms-word / application/vnd.ms-office already covered by the vnd.* wildcard, and the duplicate !isBinaryMediaMime gate in allowTextHeuristic), but neither introduces a functional regression. The change is tightly scoped and backward compatible.
  • No files require special attention — both changes are confined to src/media-understanding/apply.ts and its test file.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/media-understanding/apply.ts
Line: 314-320

Comment:
**Redundant explicit entries for `vnd.ms-word` and `vnd.ms-office`**

`application/vnd.ms-word` and `application/vnd.ms-office` are already handled by the existing `startsWith("application/vnd.")` catch-all block further down in this function (lines 331-338), which returns `true` for any `application/vnd.*` that doesn't end in `+json` or `+xml`. The only MIME type that was genuinely unguarded before this PR is `application/msword` (no `vnd.` prefix).

The two extra entries are harmless dead code that can be simplified:

```suggestion
  if (mime === "application/msword") {
    return true;
  }
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/media-understanding/apply.ts
Line: 395-396

Comment:
**`!isBinaryMediaMime` check in `allowTextHeuristic` is dead code**

By the time execution reaches this line, the early-exit guard at line 388–390 has already run:

```ts
if (!forcedTextMimeResolved && isBinaryMediaMime(normalizedRawMime)) {
  continue;
}
```

In the common case (`forcedTextMimeResolved` is `null`/`undefined`), any MIME for which `isBinaryMediaMime` returns `true` already triggered `continue`. So at line 396, `isBinaryMediaMime(normalizedRawMime)` is always `false`, making `!isBinaryMediaMime(normalizedRawMime)` always `true` — the condition never actually guards anything.

In the forced-text-MIME case (`forcedTextMimeResolved` is set), the early exit is skipped, but `textHint` (line 400–401) always takes `forcedTextMimeResolved` first, so `textLike`/`guessedDelimited` are never used anyway.

Consider reverting this half of the change to keep the intent clear — only the addition to `isBinaryMediaMime` itself was needed to fix the bug:

```suggestion
    const allowTextHeuristic = normalizedRawMime !== "application/pdf";
```

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: a4ba15e

Comment thread src/media-understanding/apply.ts Outdated
Comment thread src/media-understanding/apply.ts Outdated
@katoue

This comment was marked as spam.

@legolaz8451 legolaz8451 force-pushed the bugfix-msword-binary-guard branch from d343903 to 9a4954b Compare March 12, 2026 18:07
legolaz8451 and others added 3 commits March 12, 2026 22:33
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
@legolaz8451 legolaz8451 force-pushed the bugfix-msword-binary-guard branch from 9a4954b to f7ba427 Compare March 12, 2026 21:33
@openclaw-barnacle

Copy link
Copy Markdown

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

@openclaw-clownfish

Copy link
Copy Markdown
Contributor

Thanks @legolaz8451 for the focused Word-doc guard and reproduction detail. I am closing this as superseded by #73799 because both PRs address the same legacy .doc prompt-embedding failure, but #73799 is the current canonical Clownfish path with the full application/msword + application/x-cfb guard, the strengthened regression test, and current green checks.

Clownfish cannot safely land this branch as-is because it omits the application/x-cfb case covered by the canonical bug and still has Greptile cleanup notes on redundant/dead-code changes. Credit for this source PR is preserved in the replacement context: #44068 -> #73799.

If this covers a distinct MIME path or still reproduces after #73799 lands, please reply and we can reopen or split it back out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size: XS stale Marked as stale due to inactivity

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants