Related
Introduced by PR #28350 (diagnosable MEDIA rejections + canonical cache roots + null-path guard).
Problem
extract_media uses a strict extension whitelist that does not include .md (nor .json, .yaml, .xml, .tsv, etc.), while the fallback extract_local_files does support them.
However, line 3709 unconditionally strips all MEDIA: tags from the response text with a loose regex (MEDIA:\s*\S+) — even those that extract_media failed to match.
This creates a black hole for unsupported extensions:
extract_media (strict regex) → no match for .md
- Cleanup regex
re.sub(r"MEDIA:\s*\S+", "", ...) → removes the path from text
extract_local_files (broad extension list) → runs on already-cleaned text, path is gone
Result: The file is neither extracted as media nor detected as a bare path. The user receives nothing.
Reproduction
import re
# extract_media pattern (line 2524)
media_pattern = re.compile(
r'''[`"']?MEDIA:\s*(?P<path>...)\.(?:png|jpe?g|gif|webp|mp4|mov|avi|mkv|webm|ogg|opus|mp3|wav|m4a|flac|epub|pdf|zip|rar|7z|docx?|xlsx?|pptx?|txt|csv|apk|ipa)(?=[\s\`"',;:)\]}]|$))[`"']?'''
)
# cleanup pattern (line 3709)
cleanup = re.compile(r'MEDIA:\s*\S+')
text = 'Here is your report: MEDIA:/tmp/paid_users_up_analysis.md'
assert media_pattern.search(text) is None # not extracted
cleaned = cleanup.sub('', text).strip()
assert '/tmp/paid_users_up_analysis.md' not in cleaned # path gone
Fix
Align extract_media's extension whitelist with extract_local_files's supported set. Missing extensions include: md, json, xml, ya?ml, tsv, odt, rtf, bmp, tiff, svg, tar, gz, tgz, bz2, xz, xls, ods, ppt, odp, key.
Related
Introduced by PR #28350 (diagnosable MEDIA rejections + canonical cache roots + null-path guard).
Problem
extract_mediauses a strict extension whitelist that does not include.md(nor.json,.yaml,.xml,.tsv, etc.), while the fallbackextract_local_filesdoes support them.However, line 3709 unconditionally strips all
MEDIA:tags from the response text with a loose regex (MEDIA:\s*\S+) — even those thatextract_mediafailed to match.This creates a black hole for unsupported extensions:
extract_media(strict regex) → no match for.mdre.sub(r"MEDIA:\s*\S+", "", ...)→ removes the path from textextract_local_files(broad extension list) → runs on already-cleaned text, path is goneResult: The file is neither extracted as media nor detected as a bare path. The user receives nothing.
Reproduction
Fix
Align
extract_media's extension whitelist withextract_local_files's supported set. Missing extensions include:md,json,xml,ya?ml,tsv,odt,rtf,bmp,tiff,svg,tar,gz,tgz,bz2,xz,xls,ods,ppt,odp,key.