fix(gateway): sync MEDIA regex extension allowlist with SUPPORTED_DOCUMENT_TYPES#29609
fix(gateway): sync MEDIA regex extension allowlist with SUPPORTED_DOCUMENT_TYPES#29609chenglyes wants to merge 1 commit into
Conversation
…UMENT_TYPES After commit ea49b38 tightened the MEDIA extraction regex by removing the greedy \S+ fallback, the hardcoded extension allowlist fell out of sync with SUPPORTED_DOCUMENT_TYPES. This caused valid file types (.md, .json, .xml, .yaml, .py, .ts, .sh, .log, .ini, .cfg, .html, etc.) to be silently ignored as MEDIA attachments. Fix: build the regex extension list dynamically from SUPPORTED_DOCUMENT_TYPES plus known media types, so they stay in sync. Future additions to SUPPORTED_DOCUMENT_TYPES propagate automatically. Closes NousResearch#29582
|
This change would fix an issue I'm currently running into. After the recent tightening of the MEDIA extraction regex, .html files (and a few other document types) are being silently ignored when using the tag, even when the file exists and the path is correct. My primary communication channel is Telegram. I confirmed the problem persists after a gateway restart and gateway logs reflect the issue. |
|
Thanks for the context @alt-glitch! Totally agree that #24384's regex fixes (prefix-glue, space-truncation) are important and should be addressed alongside this. Happy to close this PR or rebase onto #24384. Appreciate the engagement from everyone on this! |
…MENT_TYPES Instead of maintaining separate hardcoded extension lists for extract_media regex and _LOCAL_MEDIA_EXTS, build a single set at module level derived from SUPPORTED_DOCUMENT_TYPES + SUPPORTED_IMAGE_DOCUMENT_TYPES + known audio/video/archive types. - extract_media() now uses the precompiled module-level _MEDIA_TAG_RE - extract_local_files() builds _LOCAL_MEDIA_EXTS from _MEDIA_EXTS_SET - 60 extensions covered (was ~30 in the original regex) - Adding new types to SUPPORTED_DOCUMENT_TYPES auto-propagates Closes NousResearch#29609 (preferred dynamic approach) Fixes NousResearch#31137
|
Superseded by #34844, which consolidates this cluster. This PR widens the Closing as superseded — thanks for surfacing and helping pin down this bug; it was part of getting the full fix right. See #34844. |
Problem
After commit
ea49b3862tightened the MEDIA extraction regex by removing the greedy\S+fallback, the hardcoded extension allowlist inextract_media()fell out of sync withSUPPORTED_DOCUMENT_TYPESdefined in the same file (line 817).This caused valid file attachments with extensions like
.md,.json,.xml,.yaml,.yml,.toml,.py,.ts,.sh,.log,.ini,.cfg,.html,.htmto be silently ignored when sent viaMEDIA:<path>.Fix
Instead of maintaining two separate extension lists (the regex and
SUPPORTED_DOCUMENT_TYPES), the regex now derives its extension set dynamically fromSUPPORTED_DOCUMENT_TYPES+ known media types (image/video/audio/archive).Before
After
Benefits
SUPPORTED_DOCUMENT_TYPESautomatically work withMEDIA:tests/gateway/test_platform_base.pyRelated
.htmlfiles not sent via WeChat)ea49b3862Checklist
tests/gateway/test_platform_base.py)