fix(gateway): include html in MEDIA: extraction allowlist#29710
Conversation
After ea49b38 tightened extract_media() and the run.py tool-result MEDIA: regexes by removing the |\S+ fallback, the explicit extension list became the only path for MEDIA: tag delivery. .html and .htm were never on that list, so MEDIA:/tmp/foo.html silently dropped — base.py:3175's cleanup pass strips the tag, and extract_local_files (which does list html) never sees the path. Add html? to all three regex sites: - gateway/platforms/base.py (extract_media) - gateway/run.py x2 (tool-result MEDIA: dedupe + collection) Bare-path auto-detect already supports .html via _LOCAL_MEDIA_EXTS; this aligns the MEDIA: tag path with it. Fixes NousResearch#29582
Thanks for triaging @daimon-nous — fair flag at a glance, but this isn't quite the same shape as #29592. That PR only patches |
|
Superseded by #34844, which consolidates this cluster. This PR widens the Closing as superseded — thanks for surfacing and helping pin down this bug; it was part of getting the full fix right. See #34844. |
What does this PR do?
After
ea49b3862(fix(gateway): tighten MEDIA extraction regex...) removed the|\S+fallback, the explicit extension allowlist became the only path forMEDIA:tag delivery..htmland.htmwere never on that list — but they are in_LOCAL_MEDIA_EXTS(used byextract_local_files), so the two delivery pathways drifted out of sync.Visible symptom: an agent emits
MEDIA: /tmp/report.html, the tag is silently stripped from user-visible text by the cleanupre.sub(r"MEDIA:\s*\S+", "", ...)atgateway/platforms/base.py:3175, but the file is never delivered. The bare-path detector that does know about.htmlruns after the strip, on text that no longer contains the path. I hit this on Telegram; #29582 hit it on WeChat. The code is platform-agnostic (lives onBasePlatformAdapterand in the gateway runner), so every adapter is affected.This patch adds
html?to the three regex allowlists so theMEDIA:tag path matches what_LOCAL_MEDIA_EXTSalready supports.Related Issue
Fixes #29582
Type of Change
Changes Made
gateway/platforms/base.py—html?added toextract_media()MEDIA:regex (used by the gateway response loop).gateway/run.py(×2) —html?added to the two_TOOL_MEDIA_REpatterns at lines 16523 and 16819 (used by the agent loop to dedupe and collectMEDIA:paths emitted inside tool/function results, e.g. an MCP tool returningMEDIA:/tmp/foo.htmlin its JSON output).tests/gateway/test_platform_base.py— regression test asserting.htmland.htmpaths are extracted byBasePlatformAdapter.extract_media.Relationship to other open PRs
A couple of friendly in-flight PRs target the same issue. To save reviewer time and avoid duplication, here's how this one differs — happy to defer to either if preferred:
gateway/platforms/base.py:2162(the adapter's response-text extractor). The same regression also lives ingateway/run.py:16523andgateway/run.py:16819— both were introduced by the sameea49b3862tightening commit and inherit the same missing.htmlentry. Those two regexes are used inside the agent loop to dedupe and collectMEDIA:paths from tool/function results (independent of the adapter path), so fix(gateway): allow html media attachments #29592 will still leave tool-emittedMEDIA:/tmp/foo.htmlpaths dropped on the floor. This PR patches all three sites.MEDIA:allowlist dynamically fromSUPPORTED_DOCUMENT_TYPESso this kind of drift can't recur. If you're happy to merge that, please go with fix(gateway): sync MEDIA regex extension allowlist with SUPPORTED_DOCUMENT_TYPES #29609 — this PR becomes redundant and I'll close it. This PR is the minimal three-spot patch for the case where a refactor is out of scope right now.base.py. Therun.pysites are the new substance here.@alt-glitch — flagging you since you've been triaging the duplicate cluster. Happy to close immediately or rebase onto whichever path you prefer; just wanted the run.py gap visible somewhere so it doesn't get lost.
How to Test
MEDIA: /tmp/report.html(file exists). Before this patch: tag silently stripped from user-visible text, file never delivered. After: file is delivered as a native attachment.Checklist
Code
fix(gateway):)scripts/run_tests.shagainst the affected tests and all passDocumentation & Housekeeping
website/docs/user-guide/features/deliverable-mode.mdalready lists.html .htmas supported; this patch restores the behavior to match what the docs already promise)