Skip to content

fix(gateway): include html in MEDIA: extraction allowlist#29710

Closed
Lef-F wants to merge 1 commit into
NousResearch:mainfrom
Lef-F:fix/media-extract-html-allowlist
Closed

fix(gateway): include html in MEDIA: extraction allowlist#29710
Lef-F wants to merge 1 commit into
NousResearch:mainfrom
Lef-F:fix/media-extract-html-allowlist

Conversation

@Lef-F

@Lef-F Lef-F commented May 21, 2026

Copy link
Copy Markdown

What does this PR do?

After ea49b3862 (fix(gateway): tighten MEDIA extraction regex...) removed the |\S+ fallback, the explicit extension allowlist became the only path for MEDIA: tag delivery. .html and .htm were never on that list — but they are in _LOCAL_MEDIA_EXTS (used by extract_local_files), so the two delivery pathways drifted out of sync.

Visible symptom: an agent emits MEDIA: /tmp/report.html, the tag is silently stripped from user-visible text by the cleanup re.sub(r"MEDIA:\s*\S+", "", ...) at gateway/platforms/base.py:3175, but the file is never delivered. The bare-path detector that does know about .html runs after the strip, on text that no longer contains the path. I hit this on Telegram; #29582 hit it on WeChat. The code is platform-agnostic (lives on BasePlatformAdapter and in the gateway runner), so every adapter is affected.

This patch adds html? to the three regex allowlists so the MEDIA: tag path matches what _LOCAL_MEDIA_EXTS already supports.

Related Issue

Fixes #29582

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)

Changes Made

  • gateway/platforms/base.pyhtml? added to extract_media() MEDIA: regex (used by the gateway response loop).
  • gateway/run.py (×2) — html? added to the two _TOOL_MEDIA_RE patterns at lines 16523 and 16819 (used by the agent loop to dedupe and collect MEDIA: paths emitted inside tool/function results, e.g. an MCP tool returning MEDIA:/tmp/foo.html in its JSON output).
  • tests/gateway/test_platform_base.py — regression test asserting .html and .htm paths are extracted by BasePlatformAdapter.extract_media.

Relationship to other open PRs

A couple of friendly in-flight PRs target the same issue. To save reviewer time and avoid duplication, here's how this one differs — happy to defer to either if preferred:

@alt-glitch — flagging you since you've been triaging the duplicate cluster. Happy to close immediately or rebase onto whichever path you prefer; just wanted the run.py gap visible somewhere so it doesn't get lost.

How to Test

  1. Run the regression test:
    scripts/run_tests.sh tests/gateway/test_platform_base.py::TestExtractMedia::test_media_tag_extracts_html_and_htm_paths
    
  2. Verify nothing else regressed:
    scripts/run_tests.sh tests/gateway/test_platform_base.py tests/gateway/test_media_extraction.py tests/gateway/test_extract_local_files.py tests/gateway/test_tts_media_routing.py
    
    Locally: 145 passed / 2 skipped / 0 failures.
  3. End-to-end repro: have an agent emit MEDIA: /tmp/report.html (file exists). Before this patch: tag silently stripped from user-visible text, file never delivered. After: file is delivered as a native attachment.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(gateway):)
  • I searched for existing PRs — see "Relationship to other open PRs" above
  • My PR contains only changes related to this fix (no unrelated commits)
  • I've run scripts/run_tests.sh against the affected tests and all pass
  • I've added tests for my changes
  • I've tested on my platform: macOS (Darwin 25.5.0, Python 3.13)

Documentation & Housekeeping

  • N/A — no documentation impact (website/docs/user-guide/features/deliverable-mode.md already lists .html .htm as supported; this patch restores the behavior to match what the docs already promise)
  • N/A — no config keys added
  • N/A — no architecture or workflow changes
  • N/A — no cross-platform impact (regex literals only; behavior identical across OS)
  • N/A — no tool schema changes

After ea49b38 tightened extract_media() and the run.py tool-result
MEDIA: regexes by removing the |\S+ fallback, the explicit extension
list became the only path for MEDIA: tag delivery. .html and .htm
were never on that list, so MEDIA:/tmp/foo.html silently dropped —
base.py:3175's cleanup pass strips the tag, and extract_local_files
(which does list html) never sees the path.

Add html? to all three regex sites:
  - gateway/platforms/base.py (extract_media)
  - gateway/run.py x2 (tool-result MEDIA: dedupe + collection)

Bare-path auto-detect already supports .html via _LOCAL_MEDIA_EXTS;
this aligns the MEDIA: tag path with it.

Fixes NousResearch#29582
@daimon-nous daimon-nous Bot added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels May 21, 2026
@daimon-nous

daimon-nous Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

Duplicate of #29592 — same fix adding html/htm to MEDIA: extraction allowlist. Also see #29609 for a broader fix that derives extensions from SUPPORTED_DOCUMENT_TYPES.

@Lef-F

Lef-F commented May 21, 2026

Copy link
Copy Markdown
Author

Duplicate of #29592 — same fix adding html/htm to MEDIA: extraction allowlist. Also see #29609 for a broader fix that derives extensions from SUPPORTED_DOCUMENT_TYPES.

Thanks for triaging @daimon-nous — fair flag at a glance, but this isn't quite the same shape as #29592. That PR only patches gateway/platforms/base.py; the same regression also lives in gateway/run.py:16523 and gateway/run.py:16819 (the _TOOL_MEDIA_RE patterns that extract MEDIA: paths from tool/function return values — TTS tool output, MCP tool results, etc.). All three sites came from ea49b3862 and need the same fix.

@teknium1

Copy link
Copy Markdown
Contributor

Superseded by #34844, which consolidates this cluster.

This PR widens the extract_media extension allowlist, which is the right direction — but on its own it leaves the unconditional MEDIA:\s*\S+ strip in place, so a MEDIA: tag with any extension still outside the (now wider) list keeps getting deleted from the body before extract_local_files can pick up the bare path. #34844 fixes both halves: it unifies the two extractors onto a single shared extension set (MEDIA_DELIVERY_EXTS) AND replaces the loose strip with an extension-anchored one, so an unknown-extension path survives in the text instead of vanishing.

Closing as superseded — thanks for surfacing and helping pin down this bug; it was part of getting the full fix right. See #34844.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: WeChat gateway fails to send .html files due to MEDIA extension allowlist

2 participants