Skip to content

fix(gateway): align MEDIA: regex whitelist with SUPPORTED_DOCUMENT_TYPES#32995

Closed
echo26max wants to merge 1 commit into
NousResearch:mainfrom
echo26max:fix-extract-media-document-extensions
Closed

fix(gateway): align MEDIA: regex whitelist with SUPPORTED_DOCUMENT_TYPES#32995
echo26max wants to merge 1 commit into
NousResearch:mainfrom
echo26max:fix-extract-media-document-extensions

Conversation

@echo26max

Copy link
Copy Markdown

Summary

The MEDIA:<path> regex in extract_media() (gateway/platforms/base.py) carries its own hard-coded extension whitelist that has drifted out of sync with SUPPORTED_DOCUMENT_TYPES. Extensions registered as deliverable documents but missing from the regex are silently stripped from response text — no WARNING is logged because the path never reaches validate_media_delivery_path().

This PR aligns the two whitelists and adds a regression test.

Reproduction (before the fix)

from gateway.platforms.base import BasePlatformAdapter

content = "Here is your report:\nMEDIA:/tmp/report.md"
media, cleaned = BasePlatformAdapter.extract_media(content)

# Expected: media == [("/tmp/report.md", False)]
# Actual:   media == []  (path silently stripped, never dispatched)

Same failure mode for .json, .yaml, .toml, .ini, .cfg, .log, .ts, .py, .sh, .xml — all listed in SUPPORTED_DOCUMENT_TYPES (line ~1023) but absent from the regex (line ~2416).

Real-world impact

Surfaced when delivering a .md research report on Telegram. The MEDIA: line vanished from the rendered message; no document attachment; gateway.log showed:

response ready: ... response=404 chars
[Telegram] Sending response (198 chars) to ...

…with no Skipping unsafe MEDIA directive path outside allowed roots warning that normally accompanies allowlist rejection. The 206-char delta corresponds to the three stripped MEDIA: lines.

Fix

Add the missing extensions to the regex alternation:
md | log | json | xml | ya?ml | toml | ini | cfg | ts | py | sh

…and a comment block above the pattern documenting the alignment contract so the next person who edits either side knows to update the other.

Tests

  • Parametrized regression (test_media_tag_recognizes_document_extensions) covers every newly-allowed extension.
  • Real-world reproduction (test_media_tag_recognizes_markdown_with_quoted_path) — .md path with a leading non-ASCII context line, mirroring the original failure.

Verification

Reverted the gateway/platforms/base.py hunk and re-ran the new tests against the un-fixed code: 13 failures, exactly as expected (12 parametrized + 1 real-world). Restored the fix → 27 passed in TestExtractMedia. Full tests/gateway/test_platform_base.py suite: 114 passed, 2 skipped.

$ pytest tests/gateway/test_platform_base.py
======================== 114 passed, 2 skipped in 0.67s ========================

Scope

Minimal, additive — only adds extensions already enumerated as deliverable per SUPPORTED_DOCUMENT_TYPES. Does not touch extract_local_files() (which already supports these via its own broader list), validate_media_delivery_path(), or the document dispatch path.

A follow-up PR could refactor the regex to construct its alternation from SUPPORTED_DOCUMENT_TYPES programmatically, eliminating the drift class entirely. That's deliberately out of scope here to keep this fix small and reviewable.

The MEDIA:<path> regex in extract_media() carried its own hard-coded
extension whitelist that drifted out of sync with SUPPORTED_DOCUMENT_TYPES
(line ~1023). Extensions registered as deliverable documents but missing
from the regex were silently stripped from response text, with no WARNING
logged — the gateway just never saw a media directive.

Surfaced when an agent emitted MEDIA:/path/to/report.md on Telegram. The
.md path vanished from the rendered message and no document was attached;
gateway.log showed 'Sending response (M chars)' where M < 'response=N
chars' but no 'Skipping unsafe MEDIA directive' warning, because the path
never reached validate_media_delivery_path().

Add the missing extensions to the regex (md, log, json, xml, yaml/yml,
toml, ini, cfg, ts, py, sh) and a comment documenting the alignment
contract with SUPPORTED_DOCUMENT_TYPES so the next person who edits
either side knows to update the other.

Tests:
  * Parametrize over every newly-allowed extension.
  * Add a regression test for the original bug report (.md path with
    leading non-ASCII context line).
@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists labels May 27, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Duplicate of #29609 (preferred approach: dynamically derive extensions from SUPPORTED_DOCUMENT_TYPES). Part of a saturated MEDIA regex cluster with 5+ competing PRs (#22492, #30588, #32398, #32604, #32751). #29609 is the canonical fix.

@teknium1

Copy link
Copy Markdown
Contributor

Superseded by #34844, which consolidates this cluster.

This PR widens the extract_media extension allowlist, which is the right direction — but on its own it leaves the unconditional MEDIA:\s*\S+ strip in place, so a MEDIA: tag with any extension still outside the (now wider) list keeps getting deleted from the body before extract_local_files can pick up the bare path. #34844 fixes both halves: it unifies the two extractors onto a single shared extension set (MEDIA_DELIVERY_EXTS) AND replaces the loose strip with an extension-anchored one, so an unknown-extension path survives in the text instead of vanishing.

Closing as superseded — thanks for surfacing and helping pin down this bug; it was part of getting the full fix right. See #34844.

@teknium1 teknium1 closed this May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants