Skip to content

fix(gateway): match extract_media extension list to extract_local_files#31138

Closed
Razultull wants to merge 1 commit into
NousResearch:mainfrom
Razultull:fix/gateway-media-extension-mismatch
Closed

fix(gateway): match extract_media extension list to extract_local_files#31138
Razultull wants to merge 1 commit into
NousResearch:mainfrom
Razultull:fix/gateway-media-extension-mismatch

Conversation

@Razultull

Copy link
Copy Markdown

Summary

gateway/platforms/base.py had two file-extension whitelists that drifted out of sync:

  • extract_media() parses explicit MEDIA:<path> directives.
  • extract_local_files() finds bare local paths in response text.

extract_local_files's _LOCAL_MEDIA_EXTS includes .html, .htm, .md, .svg, .bmp, .tiff, .json, .xml, .yaml, .yml, .tsv, .ods, .odp, .odt, .rtf, .key, .tar, .gz, .tgz, .bz2, .xz. The regex inside extract_media did not. A MEDIA:/abs/path/file.html directive was silently dropped — the parent send_message tool still reported success: true because the text portion shipped fine.

Closes #31137.

Changes

  • gateway/platforms/base.py (+4): extension list in the extract_media regex is now a superset of _LOCAL_MEDIA_EXTS. Added a NOTE comment binding the two together so future drift is caught in review.
  • tests/gateway/test_platform_base.py (+36): TestExtractMediaExtensionCoverage — 43-case parametrized regression test covering every previously-missing extension + every previously-working one, plus the literal user-reported failure path (MEDIA:/root/.hermes/media_cache/00-Visual-Report.html).

Validation

Before After
MEDIA:/tmp/report.html regex no match → dropped silently matched → file delivered
MEDIA:/tmp/notes.md dropped delivered
MEDIA:/tmp/icon.svg dropped delivered
MEDIA:/tmp/data.json dropped delivered
MEDIA:/tmp/foo.mp3, .pdf, .docx, etc. delivered delivered (unchanged)
send_message success reporting on dropped files success: true (misleading) success: true and file actually ships

Targeted suite:

pytest tests/gateway/test_platform_base.py::TestExtractMediaExtensionCoverage -q
............................................                             [100%]
44 passed in 0.50s

Broader media-related suite (no regressions):

pytest tests/gateway/test_platform_base.py tests/gateway/test_media_extraction.py \
       tests/gateway/test_tts_media_routing.py tests/gateway/test_send_image_file.py \
       tests/tools/test_send_message_tool.py -q
294 passed, 2 skipped in 4.56s

Tested manually on Telegram (Ubuntu, Python 3.11, gateway under user systemd unit) — same path that produced the silent drop now ships the document natively.

Why

This is a top-priority class of bug per CONTRIBUTING.md: silent data loss (file attachments dropped without warning) affecting every messaging platform that goes through BasePlatformAdapter.extract_media. The fix surface is tiny and well-tested.

The MEDIA: directive parser and the bare-path detector use independent
extension whitelists. extract_local_files accepts .html/.md/.svg/.json
and many other doc/data/web extensions; extract_media did not. When the
model returned 'MEDIA:/abs/path/file.html', the regex skipped it, the
file was never attached, and the parent send_message tool still reported
success because the text portion shipped fine — silent data loss with
no log line.

The (?<![/:\\w.]) anti-URL guard on extract_local_files also disqualifies
the path (the ':' in 'MEDIA:' defeats it), so the path falls between
both detectors.

Extend the extract_media regex extension list to be a superset of
_LOCAL_MEDIA_EXTS and add a NOTE comment binding the two lists together
so future drift is caught in review.

Regression test: TestExtractMediaExtensionCoverage in
tests/gateway/test_platform_base.py — 43 parametrized extensions plus
the literal user-reported failure path.
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery duplicate This issue or pull request already exists labels May 23, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Competing with #29609 (dynamic allowlist sync from SUPPORTED_DOCUMENT_TYPES — preferred approach) and #22492 (static extension additions). Both are open and cover the same extensions. #29609 is the canonical fix for this class of bugs.

@teknium1

Copy link
Copy Markdown
Contributor

Superseded by #34844, which consolidates this cluster.

This PR widens the extract_media extension allowlist, which is the right direction — but on its own it leaves the unconditional MEDIA:\s*\S+ strip in place, so a MEDIA: tag with any extension still outside the (now wider) list keeps getting deleted from the body before extract_local_files can pick up the bare path. #34844 fixes both halves: it unifies the two extractors onto a single shared extension set (MEDIA_DELIVERY_EXTS) AND replaces the loose strip with an extension-anchored one, so an unknown-extension path survives in the text instead of vanishing.

Closing as superseded — thanks for surfacing and helping pin down this bug; it was part of getting the full fix right. See #34844.

@teknium1 teknium1 closed this May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery duplicate This issue or pull request already exists P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(gateway): MEDIA: directive silently drops .html and other extensions due to regex/whitelist drift

3 participants