Skip to content

fix(gateway): sync MEDIA regex extension allowlist with SUPPORTED_DOCUMENT_TYPES#29609

Closed
chenglyes wants to merge 1 commit into
NousResearch:mainfrom
chenglyes:fix/media-extension-allowlist-sync
Closed

fix(gateway): sync MEDIA regex extension allowlist with SUPPORTED_DOCUMENT_TYPES#29609
chenglyes wants to merge 1 commit into
NousResearch:mainfrom
chenglyes:fix/media-extension-allowlist-sync

Conversation

@chenglyes

Copy link
Copy Markdown

Problem

After commit ea49b3862 tightened the MEDIA extraction regex by removing the greedy \S+ fallback, the hardcoded extension allowlist in extract_media() fell out of sync with SUPPORTED_DOCUMENT_TYPES defined in the same file (line 817).

This caused valid file attachments with extensions like .md, .json, .xml, .yaml, .yml, .toml, .py, .ts, .sh, .log, .ini, .cfg, .html, .htm to be silently ignored when sent via MEDIA:<path>.

Fix

Instead of maintaining two separate extension lists (the regex and SUPPORTED_DOCUMENT_TYPES), the regex now derives its extension set dynamically from SUPPORTED_DOCUMENT_TYPES + known media types (image/video/audio/archive).

Before

# Hardcoded extension list — out of sync with SUPPORTED_DOCUMENT_TYPES
media_pattern = re.compile(
    r'''...\.(?:png|jpe?g|gif|...|txt|csv|apk|ipa)...'''
)

After

# Extension list derived from SUPPORTED_DOCUMENT_TYPES automatically
_media_exts = set()
for ext in SUPPORTED_DOCUMENT_TYPES:
    _media_exts.add(ext.lstrip("."))
_media_exts.update({"png", "jpg", ..., "markdown"})
_ext_pattern = "|".join(sorted(_media_exts, key=lambda x: (-len(x), x)))
# regex built from _ext_pattern dynamically

Benefits

  • ✅ Single source of truth — no more desync
  • ✅ Future additions to SUPPORTED_DOCUMENT_TYPES automatically work with MEDIA:
  • ✅ Existing tests pass: 92 passed in tests/gateway/test_platform_base.py
  • ✅ 1 file changed, +17 / -2 lines

Related

Checklist

  • Tests pass (tests/gateway/test_platform_base.py)
  • Code review
  • CI (pending)

…UMENT_TYPES

After commit ea49b38 tightened the MEDIA extraction regex by removing
the greedy \S+ fallback, the hardcoded extension allowlist fell out of
sync with SUPPORTED_DOCUMENT_TYPES.  This caused valid file types (.md,
.json, .xml, .yaml, .py, .ts, .sh, .log, .ini, .cfg, .html, etc.) to be
silently ignored as MEDIA attachments.

Fix: build the regex extension list dynamically from
SUPPORTED_DOCUMENT_TYPES plus known media types, so they stay in sync.
Future additions to SUPPORTED_DOCUMENT_TYPES propagate automatically.

Closes NousResearch#29582
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels May 21, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Supersedes #29374 (adds only .md) — this PR derives the full extension set dynamically from SUPPORTED_DOCUMENT_TYPES. Overlaps with #24384 (which also fixes prefix-glue and space-truncation regex bugs beyond the allowlist). Closes #29582.

@nathanpt

Copy link
Copy Markdown

This change would fix an issue I'm currently running into. After the recent tightening of the MEDIA extraction regex, .html files (and a few other document types) are being silently ignored when using the tag, even when the file exists and the path is correct.

My primary communication channel is Telegram. I confirmed the problem persists after a gateway restart and gateway logs reflect the issue.

@chenglyes

Copy link
Copy Markdown
Author

Thanks for the context @alt-glitch! Totally agree that #24384's regex fixes (prefix-glue, space-truncation) are important and should be addressed alongside this. Happy to close this PR or rebase onto #24384.

Appreciate the engagement from everyone on this!

mohamedorigami-jpg added a commit to mohamedorigami-jpg/hermes-agent that referenced this pull request May 23, 2026
…MENT_TYPES

Instead of maintaining separate hardcoded extension lists for
extract_media regex and _LOCAL_MEDIA_EXTS, build a single set at
module level derived from SUPPORTED_DOCUMENT_TYPES +
SUPPORTED_IMAGE_DOCUMENT_TYPES + known audio/video/archive types.

- extract_media() now uses the precompiled module-level _MEDIA_TAG_RE
- extract_local_files() builds _LOCAL_MEDIA_EXTS from _MEDIA_EXTS_SET
- 60 extensions covered (was ~30 in the original regex)
- Adding new types to SUPPORTED_DOCUMENT_TYPES auto-propagates

Closes NousResearch#29609 (preferred dynamic approach)
Fixes NousResearch#31137
@teknium1

Copy link
Copy Markdown
Contributor

Superseded by #34844, which consolidates this cluster.

This PR widens the extract_media extension allowlist, which is the right direction — but on its own it leaves the unconditional MEDIA:\s*\S+ strip in place, so a MEDIA: tag with any extension still outside the (now wider) list keeps getting deleted from the body before extract_local_files can pick up the bare path. #34844 fixes both halves: it unifies the two extractors onto a single shared extension set (MEDIA_DELIVERY_EXTS) AND replaces the loose strip with an extension-anchored one, so an unknown-extension path survives in the text instead of vanishing.

Closing as superseded — thanks for surfacing and helping pin down this bug; it was part of getting the full fix right. See #34844.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: WeChat gateway fails to send .html files due to MEDIA extension allowlist

4 participants