Skip to content

fix(gateway): define missing _MEDIA_EXTS — MEDIA tag and local file extraction#34115

Closed
proactive-assistant wants to merge 3 commits into
NousResearch:mainfrom
proactive-assistant:fix/media-exts-missing-definition
Closed

fix(gateway): define missing _MEDIA_EXTS — MEDIA tag and local file extraction#34115
proactive-assistant wants to merge 3 commits into
NousResearch:mainfrom
proactive-assistant:fix/media-exts-missing-definition

Conversation

@proactive-assistant

@proactive-assistant proactive-assistant commented May 28, 2026

Copy link
Copy Markdown

Summary

Three bugs across two files, all related to MEDIA tag / file attachment handling:

1. base.py_MEDIA_EXTS never defined (crash on use)

extract_media() and extract_local_files() both referenced BasePlatformAdapter._MEDIA_EXTS but the attribute was never defined — would crash with AttributeError on any call. The hardcoded regex in extract_media() also missed .md and 20+ other extensions that extract_local_files() supported.

Fix: Define _MEDIA_EXTS as the union of _AUDIO_EXTS + SUPPORTED_VIDEO_TYPES + SUPPORTED_DOCUMENT_TYPES + SUPPORTED_IMAGE_DOCUMENT_TYPES. Use it in both methods — adding a new doc type automatically enables MEDIA-tag and local-file delivery.

2. run.py — two more hardcoded MEDIA regexes

The TTS/tool-result scanning code had the same hardcoded incomplete regex duplicated at two sites (lines 16982, 17288). Missing the same 20+ extensions.

Fix: Import _MEDIA_EXTS and build the regex dynamically.

3. run.py — non-streaming responses drop MEDIA attachments

The streaming path called _deliver_media_from_response() to extract and deliver files. The non-streaming path returned the raw response as-is — MEDIA tags appeared as literal text in chat and files were never delivered.

Fix: Call _deliver_media_from_response() + extract_media() in the non-streaming path before returning. Files are delivered and tags are stripped from the text.

Test plan

  • All existing tests pass: test_media_extraction.py (4/4), test_platform_base.py (101/101)
  • Manual: .md, .pdf, .png, .mp4, .mp3 all extracted from MEDIA tags and local file paths
  • Manual: inline/fenced code blocks properly ignored
  • Import verification: gateway.run imports cleanly with _MEDIA_EXTS available

Proactive Assistant added 3 commits May 28, 2026 23:32
…on regexes

The MEDIA:<path> tag regex in extract_media() was hardcoded with an
incomplete extension list — missing .md, .json, .yaml, .html, .svg,
and 15 other extensions that extract_local_files() already recognised
via its own _LOCAL_MEDIA_EXTS tuple.

Consolidate both code paths onto a single module-level _MEDIA_EXTS
frozenset built from the union of all four support dicts
(_AUDIO_EXTS, SUPPORTED_VIDEO_TYPES, SUPPORTED_DOCUMENT_TYPES,
SUPPORTED_IMAGE_DOCUMENT_TYPES).  This means:

- Adding a new supported document type automatically picks up
  MEDIA-tag delivery and local-file-path extraction without
  touching the regex.
- extract_media() and extract_local_files() can never drift apart
  again — they share the same source of truth.

The hardcoded extensions already in the regex but missing from the
dicts (.epub, .rar, .7z, .apk, .ipa) are intentionally dropped
because the gateway has no MIME type / delivery support for them.
…ts in non-streaming responses

Three related fixes in run.py:

1. Replace two hardcoded MEDIA regexes (lines 16982, 17288) — both used
   the same incomplete list (png|jpe?g|...|apk|ipa) missing .md and 20+
   other extensions.  Both now build the regex dynamically from _MEDIA_EXTS,
   imported from gateway.platforms.base.

2. The non-streaming response path returned raw response text with MEDIA:
   tags still embedded — the adapter sent them as literal text and the
   files were never extracted or delivered.  Only the streaming path had
   media delivery (via _deliver_media_from_response).

   Fix: call _deliver_media_from_response + extract_media before returning
   the response in the non-streaming path.  Files are delivered and MEDIA
   tags are stripped from the text the adapter receives.

3. Added _MEDIA_EXTS to the top-level import from gateway.platforms.base
   so both regex sites and future consumers can use it.
- Compute _ext_part and _TOOL_MEDIA_RE once before each loop instead
  of recompiling on every iteration.  No behavioral change.
- Fix misleading "directly above" comment on _MEDIA_EXTS — the
  source dicts are spread across the module, not directly above.
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels May 29, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Competing with #34345, #30588, #31150, #31159 — all unify extract_media extension lists. #29609 (dynamic derivation from SUPPORTED_DOCUMENT_TYPES) is the preferred approach per prior triage.

@teknium1

Copy link
Copy Markdown
Contributor

Superseded by #34844, which consolidates this cluster.

This PR widens the extract_media extension allowlist, which is the right direction — but on its own it leaves the unconditional MEDIA:\s*\S+ strip in place, so a MEDIA: tag with any extension still outside the (now wider) list keeps getting deleted from the body before extract_local_files can pick up the bare path. #34844 fixes both halves: it unifies the two extractors onto a single shared extension set (MEDIA_DELIVERY_EXTS) AND replaces the loose strip with an extension-anchored one, so an unknown-extension path survives in the text instead of vanishing.

Closing as superseded — thanks for surfacing and helping pin down this bug; it was part of getting the full fix right. See #34844.

@teknium1 teknium1 closed this May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants