Skip to content

fix(gateway): derive MEDIA extensions dynamically from SUPPORTED_DOCUMENT_TYPES#31159

Closed
mohamedorigami-jpg wants to merge 2 commits into
NousResearch:mainfrom
mohamedorigami-jpg:fix/31137-media-ext-sync
Closed

fix(gateway): derive MEDIA extensions dynamically from SUPPORTED_DOCUMENT_TYPES#31159
mohamedorigami-jpg wants to merge 2 commits into
NousResearch:mainfrom
mohamedorigami-jpg:fix/31137-media-ext-sync

Conversation

@mohamedorigami-jpg

@mohamedorigami-jpg mohamedorigami-jpg commented May 23, 2026

Copy link
Copy Markdown
Contributor

Instead of maintaining separate hardcoded extension lists for extract_media and extract_local_files that keep drifting out of sync, this builds a single set at module level derived from SUPPORTED_DOCUMENT_TYPES + SUPPORTED_IMAGE_DOCUMENT_TYPES + known audio/video/archive types.

Changes:

  • extract_media() now uses the precompiled module-level _MEDIA_TAG_RE
  • extract_local_files() builds _LOCAL_MEDIA_EXTS from _MEDIA_EXTS_SET
  • 60 extensions covered (was ~30 in the original regex)
  • Adding a new type to SUPPORTED_DOCUMENT_TYPES auto-propagates to both functions

95 tests pass in tests/gateway/test_platform_base.py.

Closes #29609 (preferred dynamic approach per alt-glitch's review)
Fixes #31137

…A_EXTS

extract_media() regex was missing 20+ extensions that extract_local_files()
already accepts: svg, bmp, tiff, md, odt, rtf, ods, tsv, json, xml, yaml,
yml, ppt, odp, key, tar, gz, tgz, bz2, xz, html, htm.

When the model emits MEDIA:/path/to/file.html (or any other missing
extension), the regex silently drops the file while send_message reports
success: true. The path falls between both detectors because extract_media
rejects it and extract_local_files anti-URL guard disqualifies the MEDIA:
prefix.

Fixes NousResearch#31137
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels May 23, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Duplicate of #29609 (preferred approach: dynamically derive extension set from SUPPORTED_DOCUMENT_TYPES). This is the same static-extension fix as #31138, #31150, #30588. See also parent issue #31137.

…MENT_TYPES

Instead of maintaining separate hardcoded extension lists for
extract_media regex and _LOCAL_MEDIA_EXTS, build a single set at
module level derived from SUPPORTED_DOCUMENT_TYPES +
SUPPORTED_IMAGE_DOCUMENT_TYPES + known audio/video/archive types.

- extract_media() now uses the precompiled module-level _MEDIA_TAG_RE
- extract_local_files() builds _LOCAL_MEDIA_EXTS from _MEDIA_EXTS_SET
- 60 extensions covered (was ~30 in the original regex)
- Adding new types to SUPPORTED_DOCUMENT_TYPES auto-propagates

Closes NousResearch#29609 (preferred dynamic approach)
Fixes NousResearch#31137
@mohamedorigami-jpg mohamedorigami-jpg changed the title fix(gateway): sync extract_media extension whitelist with _LOCAL_MEDIA_EXTS fix(gateway): derive MEDIA extensions dynamically from SUPPORTED_DOCUMENT_TYPES May 23, 2026
@mohamedorigami-jpg

Copy link
Copy Markdown
Contributor Author

Fair call. I updated the PR to use the dynamic approach from #29609 - the regex and _LOCAL_MEDIA_EXTS are now both derived from a single set built from SUPPORTED_DOCUMENT_TYPES + SUPPORTED_IMAGE_DOCUMENT_TYPES + known audio/video/archive types. 60 extensions total, 95 tests pass.

@teknium1

Copy link
Copy Markdown
Contributor

Superseded by #34844, which consolidates this cluster.

This PR widens the extract_media extension allowlist, which is the right direction — but on its own it leaves the unconditional MEDIA:\s*\S+ strip in place, so a MEDIA: tag with any extension still outside the (now wider) list keeps getting deleted from the body before extract_local_files can pick up the bare path. #34844 fixes both halves: it unifies the two extractors onto a single shared extension set (MEDIA_DELIVERY_EXTS) AND replaces the loose strip with an extension-anchored one, so an unknown-extension path survives in the text instead of vanishing.

Closing as superseded — thanks for surfacing and helping pin down this bug; it was part of getting the full fix right. See #34844.

@teknium1 teknium1 closed this May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(gateway): MEDIA: directive silently drops .html and other extensions due to regex/whitelist drift

3 participants