Skip to content

fix(gateway): extend MEDIA: regex whitelist to include common document/code extensions (#37318)#37395

Closed
alaamohanad169-ship-it wants to merge 2 commits into
NousResearch:mainfrom
alaamohanad169-ship-it:fix/37318-media-md-extensions
Closed

fix(gateway): extend MEDIA: regex whitelist to include common document/code extensions (#37318)#37395
alaamohanad169-ship-it wants to merge 2 commits into
NousResearch:mainfrom
alaamohanad169-ship-it:fix/37318-media-md-extensions

Conversation

@alaamohanad169-ship-it

Copy link
Copy Markdown
Contributor

Summary

Fixes the silent-failure described in #37318: agents that emit MEDIA:<path> for common document, data, or source-code files (e.g. .md, .json, .yaml, .py, .sh, .tar.gz) had those tags stripped from the body by the loose cleanup regex, the file was never delivered, and the agent received no signal that the tag had been lost.

Root cause

Two regexes whitelisted a narrow set of extensions (png, jpg, pdf, txt, csv, apk, ipa, etc.) and silently dropped everything else:

  1. MEDIA_DELIVERY_EXTS tuple in gateway/platforms/base.py — the single source of truth that drives MEDIA_TAG_CLEANUP_RE and extract_local_files.
  2. _TOOL_MEDIA_RE regex in gateway/run.py (both the module-level constant at line 684 and the inline copy in GatewayRunner at line 17769 used for history deduplication).

Fix

Add the missing common extensions to all three call sites:

- txt|csv|apk|ipa
+ txt|csv|apk|ipa|md|markdown|json|yaml|yml|toml|py|js|sh|tar\.gz|tgz

And extend MEDIA_DELIVERY_EXTS:

  • Add .markdown and .toml to the data row
  • Add .py, .js, .sh in a new "Source / script files" group
  • Add .tar.gz to the archives row

Also updates tests/gateway/test_run_tool_media_re.py to keep its inline test regexes in sync with the production patterns.

Test coverage

New test file tests/gateway/test_extract_media_extensions.py (246 lines) pins the new behaviour:

  • TestMediaDeliveryExtsTuple — parametrized over 11 new extensions and 7 existing ones, verifies membership in the source-of-truth tuple
  • TestExtractMediaRecognizesNewExtensions — verifies BasePlatformAdapter.extract_media captures MEDIA:<path> tags for each new extension, strips them from the cleaned text, and still ignores unknown extensions
  • TestBasePlatformAdapterExtractMediaRegression — end-to-end regression on the exact failure cases from the issue (.md and .json paths)
$ python -m pytest tests/gateway/test_extract_media_extensions.py tests/gateway/test_run_tool_media_re.py -q
67 passed in 2.60s

Verification

Local Termux: BasePlatformAdapter.extract_media('MEDIA:/tmp/notes.md\n') now returns [('/tmp/notes.md', False)] (previously returned []).

Closes #37318

Out of scope

Siblings #37315 (QQ Bot) and #37364 (WeCom) report the same root cause via separate platform adapters; this PR fixes the central whitelist but does not touch the per-platform _send_to_platform branches. They can be addressed in a follow-up.

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels Jun 2, 2026
@alaamohanad169-ship-it alaamohanad169-ship-it marked this pull request as ready for review June 2, 2026 23:58
@alaamohanad169-ship-it alaamohanad169-ship-it force-pushed the fix/37318-media-md-extensions branch from 8b4cd7d to 614b94b Compare June 3, 2026 00:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MEDIA: tag fails silently for .md files — extract_media() regex whitelist missing common extensions

2 participants