Skip to content

fix(gateway): deliver MEDIA: tags wrapped in Markdown emphasis#45773

Open
Julientalbot wants to merge 1 commit into
NousResearch:mainfrom
Julientalbot:fix/media-emphasis-extraction
Open

fix(gateway): deliver MEDIA: tags wrapped in Markdown emphasis#45773
Julientalbot wants to merge 1 commit into
NousResearch:mainfrom
Julientalbot:fix/media-emphasis-extraction

Conversation

@Julientalbot

Copy link
Copy Markdown
Contributor

What does this PR do?

extract_media() fails to deliver a file when the model wraps the delivery tag
in Markdown emphasis**MEDIA:/path.pptx**, *MEDIA:/path*,
_MEDIA:/path_. The file is then silently never attached and the literal
MEDIA:/path text is shown to the user instead (the reported symptom: "the bot
sends a path instead of the file").

Root cause is purely in MEDIA_TAG_CLEANUP_RE (gateway/platforms/base.py):

  • the leading anchor only tolerated a single quote/backtick ([`"']?), so
    a leading **/*/_ prevented the match;
  • the closing lookahead set (?=[\s"',;:)]}]|$)excludedand_, so a trailing **//_` also prevented the match.

This change lets a short run of emphasis/quote markers ([`"'*_]{0,3})
appear on either side of the tag and adds */_ to the closing lookahead. It
keeps the existing strict behavior (a recognized extension is still
required — no loose \S+ fallback) and does not touch the
_mask_protected_spans / _mask_json_string_media layer, so the deliberate
suppression of example tags inside code blocks, inline code, blockquotes and
serialized JSON (#35695, #34375) is preserved. The absolute-path anchor
(incl. the Windows drive support from #34632) is unchanged, so relative paths
are still rejected and _ inside a filename is unaffected.

Observed in production across multiple Telegram instances: on instances where
the model emitted emphasis-wrapped tags, those file deliveries were lost 1:1;
instances that never wrapped the tag had zero loss. Some Grok/GPT variants bold
file names by default, so the collision is frequent.

Related Issue

Fixes #23759.

Supersedes #23765 (open since 2026-05-11, no reviews): that patch predates the
refactor that moved the inline regex into the module-level
MEDIA_TAG_CLEANUP_RE, switched the hardcoded extension list to
_MEDIA_EXT_ALTERNATION, added the Windows anchor (#34632) and the masking
layer (#35695). It also reintroduced the loose |\S+ path fallback that the
refactor had removed. This PR applies cleanly on current main, is more
general (handles mixed quote+emphasis and __/***), preserves strict
extension matching, and adds masking-invariant regression coverage. Credit to
@LeonSGP43 for the original report and fix.

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)

Changes Made

  • gateway/platforms/base.pyMEDIA_TAG_CLEANUP_RE: tolerate
    [`"'*_]{0,3} on both sides of the tag and add */_ to the closing
    lookahead. Both the non-streaming dispatch path and the streaming consumer
    reference this constant, so extraction and visible-text stripping are fixed
    on both paths at once.
  • tests/gateway/test_platform_base.py — added TestExtractMedia cases for
    bold/italic(*)/italic(_) wrapping, mid-prose bold, emphasis-wrapped
    .html, underscore-in-filename (must be unaffected), and
    emphasis-wrapped relative path (must stay rejected).

How to Test

  1. Before: BasePlatformAdapter.extract_media("**MEDIA:/tmp/r.pptx**")([], ...) (file dropped, literal text leaks).
  2. After: same input → ([("/tmp/r.pptx", False)], "") (delivered, text stripped).
  3. pytest tests/gateway/test_platform_base.py -q → all pass (incl. the
    pre-existing code-block / inline-code / blockquote / JSON suppression tests,
    confirming no regression of Bug: extract_media() false-positives on example paths in quoted text / code blocks #35695 / Gateway MEDIA extraction can attach stale files from serialized tool/search-result text #34375).

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(gateway):)
  • I searched for existing PRs (found fix(gateway): strip markdown wrappers from media paths #23765 — see Related Issue; this supersedes it)
  • My PR contains only changes related to this fix
  • I've run pytest tests/ -q and all tests pass — ran the relevant file tests/gateway/test_platform_base.py (165 passed, 2 skipped); change is an isolated regex edit
  • I've added tests for my changes
  • I've tested on my platform: Debian 13 (and the regex is platform-agnostic)

Documentation & Housekeeping

  • I've updated relevant documentation (docstring above the regex) — code comment added
  • N/A — no config keys changed
  • N/A — no architecture/workflow change
  • I've considered cross-platform impact — the Windows drive anchor (Bug: MEDIA directive silently fails on Windows #34632) is preserved; emphasis markers are ASCII
  • N/A — no tool behavior/schema change

Models routinely present a file to the user with the delivery tag wrapped in
Markdown emphasis — `**MEDIA:/path.pptx**`, `*MEDIA:/path*`, `_MEDIA:/path_`.
MEDIA_TAG_CLEANUP_RE only tolerated a single leading/trailing quote or backtick
(`[`"']?`), and its closing lookahead set excluded `*` and `_`, so an
emphasis-wrapped tag never matched. The file was then silently never delivered
and the literal `MEDIA:/path` text leaked into the chat instead — the user sees
a path, not the attachment.

Allow a short run of emphasis/quote markers (`[`"'*_]{0,3}`) on both sides of
the tag and add `*`/`_` to the closing lookahead. Code-block, inline-code and
blockquote contexts are still neutralised earlier by `_mask_protected_spans`
(NousResearch#35695), so documentation/example tags remain non-deliverable; the
absolute-path anchor still rejects relative paths; `_` inside a filename is
unaffected.

Adds regression coverage in TestExtractMedia for bold/italic/underscore
wrapping, mid-prose bold, emphasis-wrapped .html, underscore-in-filename, and
emphasis-wrapped relative-path rejection.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@Julientalbot

Copy link
Copy Markdown
Contributor Author

Companion PR opened: #45786 fixes the structured delivery path (video_generate results were never auto-appended in gateway/run.py), complementing this PR which fixes the prose-parse path (extract_media regex in base.py). Disjoint files, independent — can land in either order.

@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists labels Jun 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] extract_media: Markdown bold ** in MEDIA paths causes file-not-found on all platforms

3 participants