fix(media-understanding): skip legacy Word docs from text extraction#44068
fix(media-understanding): skip legacy Word docs from text extraction#44068legolaz8451 wants to merge 3 commits into
Conversation
Greptile SummaryThis PR fixes a bug where legacy The fix adds Key observations:
Confidence Score: 4/5
Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/media-understanding/apply.ts
Line: 314-320
Comment:
**Redundant explicit entries for `vnd.ms-word` and `vnd.ms-office`**
`application/vnd.ms-word` and `application/vnd.ms-office` are already handled by the existing `startsWith("application/vnd.")` catch-all block further down in this function (lines 331-338), which returns `true` for any `application/vnd.*` that doesn't end in `+json` or `+xml`. The only MIME type that was genuinely unguarded before this PR is `application/msword` (no `vnd.` prefix).
The two extra entries are harmless dead code that can be simplified:
```suggestion
if (mime === "application/msword") {
return true;
}
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: src/media-understanding/apply.ts
Line: 395-396
Comment:
**`!isBinaryMediaMime` check in `allowTextHeuristic` is dead code**
By the time execution reaches this line, the early-exit guard at line 388–390 has already run:
```ts
if (!forcedTextMimeResolved && isBinaryMediaMime(normalizedRawMime)) {
continue;
}
```
In the common case (`forcedTextMimeResolved` is `null`/`undefined`), any MIME for which `isBinaryMediaMime` returns `true` already triggered `continue`. So at line 396, `isBinaryMediaMime(normalizedRawMime)` is always `false`, making `!isBinaryMediaMime(normalizedRawMime)` always `true` — the condition never actually guards anything.
In the forced-text-MIME case (`forcedTextMimeResolved` is set), the early exit is skipped, but `textHint` (line 400–401) always takes `forcedTextMimeResolved` first, so `textLike`/`guessedDelimited` are never used anyway.
Consider reverting this half of the change to keep the intent clear — only the addition to `isBinaryMediaMime` itself was needed to fix the bug:
```suggestion
const allowTextHeuristic = normalizedRawMime !== "application/pdf";
```
How can I resolve this? If you propose a fix, please make it concise.Last reviewed commit: a4ba15e |
This comment was marked as spam.
This comment was marked as spam.
d343903 to
9a4954b
Compare
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
9a4954b to
f7ba427
Compare
|
This pull request has been automatically marked as stale due to inactivity. |
|
Thanks @legolaz8451 for the focused Word-doc guard and reproduction detail. I am closing this as superseded by #73799 because both PRs address the same legacy Clownfish cannot safely land this branch as-is because it omits the If this covers a distinct MIME path or still reproduces after #73799 lands, please reply and we can reopen or split it back out. |
Summary
Describe the problem and fix in 2-5 bullets:
.docattachments with MIMEapplication/mswordcould pass through media-understanding text heuristics, be misclassified as text, and get inlined into prompt context.context_length_exceeded, and leave the session unusable until reset.src/media-understanding/apply.tsnow treats legacy Word MIME types as binary and skips text-heuristic coercion for already-binary MIME types;src/media-understanding/apply.test.tsadds a regression test for the UTF-16-plus-TSV false-positive path..docextraction support, does not change session compaction/recovery behavior, and does not address the separate Brave Search API422country-validation bug.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
User-visible / Behavior Changes
.docattachments are no longer auto-inlined into media-understanding file blocks through text MIME heuristics.Security Impact (required)
Yes/No) NoYes/No) NoYes/No) NoYes/No) NoYes/No) NoYes, explain risk + mitigation:Repro + Verification
Environment
bugfix-msword-binary-guardSteps
incident.docwith MIMEapplication/mswordand bytes that look UTF-16/text-like near the start.applyMediaUnderstandingwith file-block extraction enabled and media features otherwise disabled.<file ...>block and appended to the body.Expected
.docattachments are treated as binary and skipped by file-block text extraction.Actual
.doc-like payload could be treated as text-like, relabeled, and inlined.<file>block is added.Evidence
Attach at least one:
Human Verification (required)
What you personally verified (not just CI), and how:
.docMIME misclassification; verified the branch diff is limited to the guard and test; verifiedpnpm buildsucceeds in the corrected local environment..docxor other Office formats beyond the specific MIME guard added here.Review Conversations
If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.
Compatibility / Migration
Yes/No) YesYes/No) NoYes/No) NoFailure Recovery (if this breaks)
src/media-understanding/apply.tsandsrc/media-understanding/apply.test.ts.docattachments unexpectedly no longer contribute extracted text; any regressions in non-binary text attachment extraction in the same code path.Risks and Mitigations
List only real risks for this PR. Add/remove entries as needed. If none, write
None.application/mswordwill now be skipped instead of being coerced into text.