fix(media-understanding): skip legacy Word docs from text extraction by legolaz8451 · Pull Request #44068 · openclaw/openclaw

legolaz8451 · 2026-03-12T13:22:29Z

Summary

Describe the problem and fix in 2-5 bullets:

Problem: legacy .doc attachments with MIME application/msword could pass through media-understanding text heuristics, be misclassified as text, and get inlined into prompt context.
Why it matters: in long-running WhatsApp sessions this can inject binary garbage into model history, trigger context_length_exceeded, and leave the session unusable until reset.
What changed: src/media-understanding/apply.ts now treats legacy Word MIME types as binary and skips text-heuristic coercion for already-binary MIME types; src/media-understanding/apply.test.ts adds a regression test for the UTF-16-plus-TSV false-positive path.
What did NOT change (scope boundary): this PR does not add .doc extraction support, does not change session compaction/recovery behavior, and does not address the separate Brave Search API 422 country-validation bug.

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Closes #
Related #

User-visible / Behavior Changes

Legacy Word .doc attachments are no longer auto-inlined into media-understanding file blocks through text MIME heuristics.
Affected sessions avoid the specific binary-to-text prompt pollution path that could cause context overflow.

Security Impact (required)

New permissions/capabilities? (Yes/No) No
Secrets/tokens handling changed? (Yes/No) No
New/changed network calls? (Yes/No) No
Command/tool execution surface changed? (Yes/No) No
Data access scope changed? (Yes/No) No
If any Yes, explain risk + mitigation:

Repro + Verification

Environment

OS: Linux
Runtime/container: local fresh clone on branch bugfix-msword-binary-guard
Model/provider: N/A for local regression test
Integration/channel (if any): media-understanding attachment preprocessing; production incident observed on WhatsApp
Relevant config (redacted): default file MIME allowlist path; no special override required for local regression

Steps

Create or provide an attachment named incident.doc with MIME application/msword and bytes that look UTF-16/text-like near the start.
Pass that attachment through applyMediaUnderstanding with file-block extraction enabled and media features otherwise disabled.
Observe whether the attachment is converted into a <file ...> block and appended to the body.

Expected

Legacy .doc attachments are treated as binary and skipped by file-block text extraction.

Actual

Before this change, the .doc-like payload could be treated as text-like, relabeled, and inlined.
After this change, the regression test confirms no <file> block is added.

Evidence

Attach at least one:

Failing test/log before + passing after
Trace/log snippets
Screenshot/recording
Perf numbers (if relevant)

YYYY-MM-DDTHH:MM:SS.sssZ
error code: context_length_exceeded
message: Your input exceeds the context window of this model. Please adjust your input and try again.

YYYY-MM-DD HH:MM:SS auto-reply sent
Context overflow: prompt too large for the model. Try /reset (or /new) to start a fresh session, or use a larger-context model.

Human Verification (required)

What you personally verified (not just CI), and how:

Verified scenarios: ran the targeted regression test for legacy .doc MIME misclassification; verified the branch diff is limited to the guard and test; verified pnpm build succeeds in the corrected local environment.
Edge cases checked: confirmed the guard is scoped to legacy Word binary MIME types; confirmed the existing PDF special-case remains intact in the same code path.
What you did not verify: no live WhatsApp end-to-end retest; no production replay; no validation of .docx or other Office formats beyond the specific MIME guard added here.

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.

Compatibility / Migration

Backward compatible? (Yes/No) Yes
Config/env changes? (Yes/No) No
Migration needed? (Yes/No) No
If yes, exact upgrade steps:

Failure Recovery (if this breaks)

How to disable/revert this change quickly: revert this commit to restore the previous MIME heuristic behavior.
Files/config to restore: src/media-understanding/apply.ts and src/media-understanding/apply.test.ts
Known bad symptoms reviewers should watch for: legacy .doc attachments unexpectedly no longer contribute extracted text; any regressions in non-binary text attachment extraction in the same code path.

Risks and Mitigations

List only real risks for this PR. Add/remove entries as needed. If none, write None.

Risk: some legitimate inputs mislabeled as application/msword will now be skipped instead of being coerced into text.
- Mitigation: the changed behavior is limited to known legacy Word binary MIME types, which are unsafe to decode generically; dedicated extraction can be added later if needed.

greptile-apps · 2026-03-12T13:24:30Z

Greptile Summary

This PR fixes a bug where legacy .doc attachments with MIME type application/msword could bypass the binary-MIME guard in extractFileBlocks, get misclassified as text via UTF-16 heuristics, and get inlined into prompt context — potentially causing context_length_exceeded errors in long-running sessions.

The fix adds application/msword (and two vnd.* variants) to isBinaryMediaMime and also adds a !isBinaryMediaMime gate to allowTextHeuristic. A regression test with a crafted UTF-16-BOM + TSV-like payload reproduces the specific false-positive path.

Key observations:

The core fix — adding application/msword to isBinaryMediaMime — is correct and sufficient for the reported bug.
application/vnd.ms-word and application/vnd.ms-office were already implicitly covered by the existing startsWith("application/vnd.") wildcard in isBinaryMediaMime; the new explicit entries are redundant but harmless.
The !isBinaryMediaMime(normalizedRawMime) guard added to allowTextHeuristic (line 396) is dead code in all practical execution paths: any binary MIME without a forcedTextMimeResolved override is already eliminated by the early continue on line 388–390, and when forcedTextMimeResolved is set, textHint takes that value regardless of allowTextHeuristic.
The regression test is well-constructed and directly validates the described bug scenario.

Confidence Score: 4/5

Safe to merge — the bug fix is correct and isolated; only minor redundant code was introduced alongside the real fix.
The single necessary change (adding application/msword to isBinaryMediaMime) is correct and well-tested. Two redundant additions were included (application/vnd.ms-word / application/vnd.ms-office already covered by the vnd.* wildcard, and the duplicate !isBinaryMediaMime gate in allowTextHeuristic), but neither introduces a functional regression. The change is tightly scoped and backward compatible.
No files require special attention — both changes are confined to src/media-understanding/apply.ts and its test file.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: src/media-understanding/apply.ts
Line: 314-320

Comment:
**Redundant explicit entries for `vnd.ms-word` and `vnd.ms-office`**

`application/vnd.ms-word` and `application/vnd.ms-office` are already handled by the existing `startsWith("application/vnd.")` catch-all block further down in this function (lines 331-338), which returns `true` for any `application/vnd.*` that doesn't end in `+json` or `+xml`. The only MIME type that was genuinely unguarded before this PR is `application/msword` (no `vnd.` prefix).

The two extra entries are harmless dead code that can be simplified:

```suggestion
  if (mime === "application/msword") {
    return true;
  }
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/media-understanding/apply.ts
Line: 395-396

Comment:
**`!isBinaryMediaMime` check in `allowTextHeuristic` is dead code**

By the time execution reaches this line, the early-exit guard at line 388–390 has already run:

```ts
if (!forcedTextMimeResolved && isBinaryMediaMime(normalizedRawMime)) {
  continue;
}
```

In the common case (`forcedTextMimeResolved` is `null`/`undefined`), any MIME for which `isBinaryMediaMime` returns `true` already triggered `continue`. So at line 396, `isBinaryMediaMime(normalizedRawMime)` is always `false`, making `!isBinaryMediaMime(normalizedRawMime)` always `true` — the condition never actually guards anything.

In the forced-text-MIME case (`forcedTextMimeResolved` is set), the early exit is skipped, but `textHint` (line 400–401) always takes `forcedTextMimeResolved` first, so `textLike`/`guessedDelimited` are never used anyway.

Consider reverting this half of the change to keep the intent clear — only the addition to `isBinaryMediaMime` itself was needed to fix the bug:

```suggestion
    const allowTextHeuristic = normalizedRawMime !== "application/pdf";
```

How can I resolve this? If you propose a fix, please make it concise.

_{Last reviewed commit: a4ba15e}

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

openclaw-barnacle · 2026-04-27T04:43:06Z

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

openclaw-clownfish · 2026-04-29T09:10:04Z

Thanks @legolaz8451 for the focused Word-doc guard and reproduction detail. I am closing this as superseded by #73799 because both PRs address the same legacy .doc prompt-embedding failure, but #73799 is the current canonical Clownfish path with the full application/msword + application/x-cfb guard, the strengthened regression test, and current green checks.

Clownfish cannot safely land this branch as-is because it omits the application/x-cfb case covered by the canonical bug and still has Greptile cleanup notes on redundant/dead-code changes. Credit for this source PR is preserved in the replacement context: #44068 -> #73799.

If this covers a distinct MIME path or still reproduces after #73799 lands, please reply and we can reopen or split it back out.

openclaw-barnacle Bot added the size: XS label Mar 12, 2026

greptile-apps Bot reviewed Mar 12, 2026

View reviewed changes

Comment thread src/media-understanding/apply.ts Outdated

Comment thread src/media-understanding/apply.ts Outdated

This comment was marked as spam.

Sign in to view

legolaz8451 force-pushed the bugfix-msword-binary-guard branch from d343903 to 9a4954b Compare March 12, 2026 18:07

legolaz8451 and others added 3 commits March 12, 2026 22:33

fix(media-understanding): skip legacy Word docs from text extraction

e879159

Update src/media-understanding/apply.ts

acad9fa

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

Update src/media-understanding/apply.ts

f7ba427

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

legolaz8451 force-pushed the bugfix-msword-binary-guard branch from 9a4954b to f7ba427 Compare March 12, 2026 21:33

openclaw-barnacle Bot added the stale Marked as stale due to inactivity label Apr 27, 2026

vincentkoc mentioned this pull request Apr 28, 2026

fix(media): tighten sanitizeMimeType anchoring #73229

Merged

openclaw-clownfish Bot mentioned this pull request Apr 28, 2026

fix(media): treat legacy Word docs as binary attachments #73799

Merged

clawsweeper Bot mentioned this pull request Apr 29, 2026

[Feature]: WebChat: Add file upload support for documents (PDF, Word, TXT) #30759

Closed

openclaw-clownfish Bot closed this Apr 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(media-understanding): skip legacy Word docs from text extraction#44068

fix(media-understanding): skip legacy Word docs from text extraction#44068
legolaz8451 wants to merge 3 commits into
openclaw:mainfrom
legolaz8451:bugfix-msword-binary-guard

legolaz8451 commented Mar 12, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Mar 12, 2026

Uh oh!

Uh oh!

Uh oh!

This comment was marked as spam.

openclaw-barnacle Bot commented Apr 27, 2026

Uh oh!

openclaw-clownfish Bot commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

legolaz8451 commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

User-visible / Behavior Changes

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Failure Recovery (if this breaks)

Risks and Mitigations

Uh oh!

greptile-apps Bot commented Mar 12, 2026

Greptile Summary

Confidence Score: 4/5

Uh oh!

Uh oh!

Uh oh!

This comment was marked as spam.

openclaw-barnacle Bot commented Apr 27, 2026

Uh oh!

openclaw-clownfish Bot commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

legolaz8451 commented Mar 12, 2026 •

edited

Loading