fix(gateway): deliver MEDIA: tags wrapped in Markdown emphasis#45773
Open
Julientalbot wants to merge 1 commit into
Open
fix(gateway): deliver MEDIA: tags wrapped in Markdown emphasis#45773Julientalbot wants to merge 1 commit into
Julientalbot wants to merge 1 commit into
Conversation
Models routinely present a file to the user with the delivery tag wrapped in
Markdown emphasis — `**MEDIA:/path.pptx**`, `*MEDIA:/path*`, `_MEDIA:/path_`.
MEDIA_TAG_CLEANUP_RE only tolerated a single leading/trailing quote or backtick
(`[`"']?`), and its closing lookahead set excluded `*` and `_`, so an
emphasis-wrapped tag never matched. The file was then silently never delivered
and the literal `MEDIA:/path` text leaked into the chat instead — the user sees
a path, not the attachment.
Allow a short run of emphasis/quote markers (`[`"'*_]{0,3}`) on both sides of
the tag and add `*`/`_` to the closing lookahead. Code-block, inline-code and
blockquote contexts are still neutralised earlier by `_mask_protected_spans`
(NousResearch#35695), so documentation/example tags remain non-deliverable; the
absolute-path anchor still rejects relative paths; `_` inside a filename is
unaffected.
Adds regression coverage in TestExtractMedia for bold/italic/underscore
wrapping, mid-prose bold, emphasis-wrapped .html, underscore-in-filename, and
emphasis-wrapped relative-path rejection.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
13 tasks
Contributor
Author
|
Companion PR opened: #45786 fixes the structured delivery path ( |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
extract_media()fails to deliver a file when the model wraps the delivery tagin Markdown emphasis —
**MEDIA:/path.pptx**,*MEDIA:/path*,_MEDIA:/path_. The file is then silently never attached and the literalMEDIA:/pathtext is shown to the user instead (the reported symptom: "the botsends a path instead of the file").
Root cause is purely in
MEDIA_TAG_CLEANUP_RE(gateway/platforms/base.py):[`"']?), soa leading
**/*/_prevented the match;(?=[\s"',;:)]}]|$)excludedand_, so a trailing**//_` also prevented the match.This change lets a short run of emphasis/quote markers (
[`"'*_]{0,3})appear on either side of the tag and adds
*/_to the closing lookahead. Itkeeps the existing strict behavior (a recognized extension is still
required — no loose
\S+fallback) and does not touch the_mask_protected_spans/_mask_json_string_medialayer, so the deliberatesuppression of example tags inside code blocks, inline code, blockquotes and
serialized JSON (#35695, #34375) is preserved. The absolute-path anchor
(incl. the Windows drive support from #34632) is unchanged, so relative paths
are still rejected and
_inside a filename is unaffected.Observed in production across multiple Telegram instances: on instances where
the model emitted emphasis-wrapped tags, those file deliveries were lost 1:1;
instances that never wrapped the tag had zero loss. Some Grok/GPT variants bold
file names by default, so the collision is frequent.
Related Issue
Fixes #23759.
Supersedes #23765 (open since 2026-05-11, no reviews): that patch predates the
refactor that moved the inline regex into the module-level
MEDIA_TAG_CLEANUP_RE, switched the hardcoded extension list to_MEDIA_EXT_ALTERNATION, added the Windows anchor (#34632) and the maskinglayer (#35695). It also reintroduced the loose
|\S+path fallback that therefactor had removed. This PR applies cleanly on current
main, is moregeneral (handles mixed quote+emphasis and
__/***), preserves strictextension matching, and adds masking-invariant regression coverage. Credit to
@LeonSGP43 for the original report and fix.
Type of Change
Changes Made
gateway/platforms/base.py—MEDIA_TAG_CLEANUP_RE: tolerate[`"'*_]{0,3}on both sides of the tag and add*/_to the closinglookahead. Both the non-streaming dispatch path and the streaming consumer
reference this constant, so extraction and visible-text stripping are fixed
on both paths at once.
tests/gateway/test_platform_base.py— addedTestExtractMediacases forbold/italic(
*)/italic(_) wrapping, mid-prose bold, emphasis-wrapped.html, underscore-in-filename (must be unaffected), andemphasis-wrapped relative path (must stay rejected).
How to Test
BasePlatformAdapter.extract_media("**MEDIA:/tmp/r.pptx**")→([], ...)(file dropped, literal text leaks).([("/tmp/r.pptx", False)], "")(delivered, text stripped).pytest tests/gateway/test_platform_base.py -q→ all pass (incl. thepre-existing code-block / inline-code / blockquote / JSON suppression tests,
confirming no regression of Bug: extract_media() false-positives on example paths in quoted text / code blocks #35695 / Gateway MEDIA extraction can attach stale files from serialized tool/search-result text #34375).
Checklist
Code
fix(gateway):)pytest tests/ -qand all tests pass — ran the relevant filetests/gateway/test_platform_base.py(165 passed, 2 skipped); change is an isolated regex editDocumentation & Housekeeping