fix(gateway): align MEDIA: regex whitelist with extract_local_files#34518
fix(gateway): align MEDIA: regex whitelist with extract_local_files#34518crazyhulk wants to merge 1 commit into
Conversation
The strict regex in extract_media was missing extensions (.md, .json, .yaml, .xml, .tsv, etc.) that extract_local_files supports. Because line 3709 unconditionally strips all MEDIA: tags from text regardless of whether extract_media matched them, files with unsupported extensions were silently dropped — neither extracted as media nor detected as bare paths. Align the extension whitelist so extract_media captures all formats that the platform can deliver. Fixes NousResearch#34517 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
I support either of the PRs — we just need to solve the problem. I can close it anytime. |
|
Superseded by #34844, which consolidates this cluster. This PR widens the Closing as superseded — thanks for surfacing and helping pin down this bug; it was part of getting the full fix right. See #34844. |
Summary
Align
extract_media's extension whitelist with the broader set supported byextract_local_files, fixing silent file drops for.md,.json,.yaml, and other formats.Problem
PR #28350 introduced a strict extension whitelist in
extract_mediabut the cleanup regex on line 3709 (MEDIA:\s*\S+) unconditionally strips allMEDIA:tags from text — even those the strict regex didn't match. This creates a black hole where files with unsupported extensions are neither extracted nor detectable by the downstreamextract_local_filesfallback.Changes
Added missing extensions to
extract_media's regex:md,json,xml,ya?ml,tsv,odt,rtf,bmp,tiff,svg,tar,gz,tgz,bz2,xz,xls,ods,ppt,odp,key.Testing
MEDIA:/tmp/paid_users_up_analysis.mdnow matchestest_media_extraction.py: 4 passedtest_platform_base.py: 114 passed, 2 skippedFixes #34517