Skip to content

fix(gateway): accept text/config extensions in MEDIA tag regex#32751

Closed
briandevans wants to merge 1 commit into
NousResearch:mainfrom
briandevans:fix/platforms-media-tag-text-types-32601
Closed

fix(gateway): accept text/config extensions in MEDIA tag regex#32751
briandevans wants to merge 1 commit into
NousResearch:mainfrom
briandevans:fix/platforms-media-tag-text-types-32601

Conversation

@briandevans

Copy link
Copy Markdown
Contributor

What does this PR do?

BasePlatformAdapter.extract_media() recognizes MEDIA:<path> tags via a regex extension whitelist. The whitelist covers media/archive/office formats but omits common text and config types (.md, .json, .yaml, .yml, .toml, .log). Today, when a model emits MEDIA:/tmp/notes.md, the regex misses it and the raw MEDIA: text survives into the cleaned body, so WeChat/Feishu/Telegram/etc. all render the tag literally instead of routing the file through their document-upload paths.

This PR adds md|json|ya?ml|toml|log to the regex alternation. All other parts of the pattern (path wrapping, quote/backtick stripping, trailing-delimiter lookahead, [[audio_as_voice]] / [[as_document]] directives) are unchanged.

Related Issue

Fixes #32601 (Bug 1 only — see Positioning below).

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • gateway/platforms/base.py — extend the MEDIA tag regex alternation with md|json|ya?ml|toml|log (single-line change).
  • tests/gateway/test_platform_base.py — add a parametrized test (`TestExtractMedia.test_media_tag_accepts_text_config_extensions`) covering each new extension.

How to Test

  1. Without the fix, BasePlatformAdapter.extract_media('MEDIA:/tmp/notes.md') returns ([], 'MEDIA:/tmp/notes.md') — the tag leaks as text.

  2. With the fix, it returns ([('/tmp/notes.md', False)], '').

  3. ```bash
    uv run --with pytest --with pytest-xdist --with pytest-asyncio python3 -m pytest tests/gateway/test_platform_base.py -v -k TestExtractMedia
    ```

    20 passed (6 new parametrized cases for md/json/yaml/yml/toml/log + 14 existing).

Adjacent suites (`tests/gateway/test_media_extraction.py`, `test_weixin.py`, `test_wecom.py`, `test_feishu.py`, `test_dingtalk.py`) also pass: 326 passed, 45 skipped.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (`fix(gateway): ...`)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix (no unrelated commits)
  • I've run focused tests for the touched code and all pass
  • I've added tests for my changes
  • I've tested on my platform: macOS 15.x

Documentation & Housekeeping

  • I've updated relevant documentation — N/A (regex-internal change)
  • I've updated `cli-config.yaml.example` if I added/changed config keys — N/A
  • I've updated `CONTRIBUTING.md` or `AGENTS.md` if I changed architecture or workflows — N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — regex matches absolute (`/...`) and tilde (`~/...`) paths consistently with the existing pattern; no Windows-path coupling
  • I've updated tool descriptions/schemas if I changed tool behavior — N/A

Related / Positioning

Issue #32601 lists two bugs:

Bug Scope This PR?
1 MEDIA tag leaks for text/config file types `gateway/platforms/base.py::extract_media` — global (all platforms) ✅ Covered here
2 `send_weixin_direct()` returns immediately on `ret=-2`/stale-token `gateway/platforms/weixin.py` — WeChat-only ❌ Distinct file, distinct fix, distinct test surface — left for a separate PR

Keeping the two as separate PRs lets the regex change land independently of any WeChat session/retry semantics review. Happy to widen this PR to cover Bug 2 if preferred, or leave Bug 2 to another contributor.

Audited siblings: `BasePlatformAdapter.extract_local_files` (same file, line ~2458) maintains an independent `_LOCAL_MEDIA_EXTS` tuple that already accepts `.md`, `.json`, `.yaml`, `.yml` for bare-path auto-delivery but is missing `.toml` and `.log`. Intentionally left out of this PR's scope to keep the diff to the MEDIA-tag path the issue specifically calls out. Happy to widen `_LOCAL_MEDIA_EXTS` to mirror the same set if useful.

For New Skills

N/A.

`BasePlatformAdapter.extract_media()` uses a regex whitelist of file
extensions to recognize `MEDIA:<path>` tags emitted by the model. The
whitelist covers images, video, audio, archives and office documents
but omits common text/config file types (`.md`, `.json`, `.yaml`,
`.yml`, `.toml`, `.log`). For those, the regex does not match so the
MEDIA tag survives into the cleaned message body and is rendered as
raw text on every platform (WeChat, Feishu, Slack, Telegram, ...)
instead of being routed to the file-attachment dispatch path.

Add `md|json|ya?ml|toml|log` to the extension alternation. The
existing wrapping (`(?:~/|/)\S+...`, quote/backtick stripping,
trailing-delimiter lookahead) all still apply. Parametrized tests
cover the six new extensions, including the standard wrapping that
already worked for `.png` etc.

Fixes Bug 1 of NousResearch#32601. Bug 2 (session-expiry retry path inside
`send_weixin_direct()`) lives in `gateway/platforms/weixin.py` and is
a distinct fix; left for a follow-up.
Copilot AI review requested due to automatic review settings May 26, 2026 18:15

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds support for attaching additional text/config file types via MEDIA: tags in the platform adapter.

Changes:

  • Extend BasePlatformAdapter.extract_media MEDIA: regex to include additional extensions (md, json, yaml/yml, toml, log).
  • Add a parametrized test to ensure these extensions are parsed into media attachments and removed from the cleaned message body.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
tests/gateway/test_platform_base.py Adds coverage for extracting MEDIA: paths with new text/config extensions.
gateway/platforms/base.py Expands the MEDIA: extraction regex to recognize additional text/config extensions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread gateway/platforms/base.py
Comment on lines 2415 to 2417
media_pattern = re.compile(
r'''[`"']?MEDIA:\s*(?P<path>`[^`\n]+`|"[^"\n]+"|'[^'\n]+'|(?:~/|/)\S+(?:[^\S\n]+\S+)*?\.(?:png|jpe?g|gif|webp|mp4|mov|avi|mkv|webm|ogg|opus|mp3|wav|m4a|flac|epub|pdf|zip|rar|7z|docx?|xlsx?|pptx?|txt|csv|apk|ipa)(?=[\s`"',;:)\]}]|$))[`"']?'''
r'''[`"']?MEDIA:\s*(?P<path>`[^`\n]+`|"[^"\n]+"|'[^'\n]+'|(?:~/|/)\S+(?:[^\S\n]+\S+)*?\.(?:png|jpe?g|gif|webp|mp4|mov|avi|mkv|webm|ogg|opus|mp3|wav|m4a|flac|epub|pdf|zip|rar|7z|docx?|xlsx?|pptx?|txt|csv|md|json|ya?ml|toml|log|apk|ipa)(?=[\s`"',;:)\]}]|$))[`"']?'''
)
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery duplicate This issue or pull request already exists labels May 26, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Duplicate of #32604 which applies the same MEDIA regex extension whitelist fix to base.py (md/json/yaml/yml/toml/log) and additionally includes WeChat session retry logic. Part of the MEDIA regex cluster (#29609, #31560, #30186).

@briandevans

Copy link
Copy Markdown
Contributor Author

Closing — superseded by @hanhan-tg's #32604, which is the better fix here.

  • Root cause coverage: same MEDIA regex extension whitelist (md/json/yaml/yml/toml/log) in , identical scope.
  • Additional coverage: fix(gateway,weixin): extend MEDIA regex whitelist + add retry on session errors #32604 also adds session/rate-limit retry logic in gateway/platforms/weixin.py (send_weixin_direct), which is orthogonal to my PR.
  • Layering: keeping the regex fix consolidated with the related WeCom retry hardening in a single review unit is cleaner than splitting across two PRs.

Thanks @hanhan-tg — your PR is the right one to land. Closing this to keep the queue clean.

hanhan-tg pushed a commit to hanhan-tg/hermes-agent that referenced this pull request May 27, 2026
…sions

Cover each new extension (md, json, yaml, yml, toml, log) with a
parametrized test verifying extract_media correctly parses the
MEDIA: tag and strips it from the cleaned content.

Refs: NousResearch#32601, credit to @briandevans for the test template from NousResearch#32751.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery duplicate This issue or pull request already exists P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(weixin): MEDIA tag leaks for .md/.json/.yaml/.toml/.log files + no retry on session expiry

3 participants