Skip to content

fix(gateway): deliver .json and .md files sent via MEDIA: tags#33089

Closed
ZMGID wants to merge 3 commits into
NousResearch:mainfrom
ZMGID:fix/media-tag-json-md
Closed

fix(gateway): deliver .json and .md files sent via MEDIA: tags#33089
ZMGID wants to merge 3 commits into
NousResearch:mainfrom
ZMGID:fix/media-tag-json-md

Conversation

@ZMGID

@ZMGID ZMGID commented May 27, 2026

Copy link
Copy Markdown

Summary

MEDIA:<path> tags emitted by the agent are only turned into native file
attachments if the path ends in a whitelisted extension. The whitelist
listed common document/data types (txt, csv, docx, xlsx, pdf,
zip, …) but was missing .json and .md. As a result, when the agent
produced a JSON or Markdown file and sent MEDIA:/path/file.json /
MEDIA:/path/file.md, the tag was never matched, so the literal text was
delivered to the chat instead of the file. Every other document type worked,
which made this look type-specific.

The extension list appears in three places (one extract_media regex plus
two GatewayRunner tool-result MEDIA regexes); json|md is added to all
three so behavior is consistent across the CLI/tool and platform-reply paths.

Changes

  • gateway/platforms/base.py — add json|md to the extract_media extension whitelist
  • gateway/run.py — add json|md to the two GatewayRunner MEDIA tool-result regexes
  • tests/gateway/test_platform_base.py — regression test asserting MEDIA:/…/.json and /…/.md are extracted

Reproduced live on Telegram (personal WeChat affected identically): with the
fix, freshly-produced .json/.md files are now delivered as native
attachments. Files outside the cache allowlist and older than the recency
window are still gated by validate_media_delivery_path, unchanged.

Test plan

  • pytest tests/gateway/test_platform_base.py::TestExtractMedia -q → 15 passed
  • Manual: send_message(MEDIA:/tmp/x.json) and MEDIA:/tmp/x.md to a Telegram chat deliver as document attachments

ZMGID added 2 commits May 27, 2026 15:16
extract_media regex (base.py) and the two GatewayRunner tool-result MEDIA
regexes (run.py) listed common document extensions but were missing .json
and .md, so MEDIA:/path/x.{json,md} emitted by the agent was never
extracted and got delivered as raw text instead of as a native attachment.

Whitelist already contained txt/csv/docx/pdf/zip/etc., so other document
types worked; only json and md were affected.
Regression test for extract_media now accepting MEDIA:/path.{json,md}.
Copilot AI review requested due to automatic review settings May 27, 2026 07:23

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds support for treating .json and .md file paths as extractable MEDIA: attachments, preventing them from being forwarded as raw text.

Changes:

  • Extend the MEDIA: extension whitelist to include json and md
  • Update two tool-related media path regexes to recognize MEDIA: .json/.md paths
  • Add a regression test validating extraction/cleaning for .json and .md

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
tests/gateway/test_platform_base.py Adds regression coverage for .json/.md extraction via BasePlatformAdapter.extract_media.
gateway/run.py Updates two regex patterns used to detect MEDIA: file paths to allow .json/.md.
gateway/platforms/base.py Extends extract_media() regex whitelist to recognize .json/.md attachments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread gateway/platforms/base.py Outdated
# and quoted/backticked paths for LLM-formatted outputs.
media_pattern = re.compile(
r'''[`"']?MEDIA:\s*(?P<path>`[^`\n]+`|"[^"\n]+"|'[^'\n]+'|(?:~/|/)\S+(?:[^\S\n]+\S+)*?\.(?:png|jpe?g|gif|webp|mp4|mov|avi|mkv|webm|ogg|opus|mp3|wav|m4a|flac|epub|pdf|zip|rar|7z|docx?|xlsx?|pptx?|txt|csv|apk|ipa)(?=[\s`"',;:)\]}]|$))[`"']?'''
r'''[`"']?MEDIA:\s*(?P<path>`[^`\n]+`|"[^"\n]+"|'[^'\n]+'|(?:~/|/)\S+(?:[^\S\n]+\S+)*?\.(?:png|jpe?g|gif|webp|mp4|mov|avi|mkv|webm|ogg|opus|mp3|wav|m4a|flac|epub|pdf|zip|rar|7z|docx?|xlsx?|pptx?|txt|csv|json|md|apk|ipa)(?=[\s`"',;:)\]}]|$))[`"']?'''
Comment thread gateway/run.py Outdated
Comment on lines +16853 to +16856
r'MEDIA:((?:/|~\/)\S+\.(?:png|jpe?g|gif|webp|'
r'mp4|mov|avi|mkv|webm|ogg|opus|mp3|wav|m4a|'
r'flac|epub|pdf|zip|rar|7z|docx?|xlsx?|pptx?|'
r'txt|csv|apk|ipa))',
r'txt|csv|json|md|apk|ipa))',
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels May 27, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Related: #30588 adds the same missing extensions and is itself a duplicate of #29609 which takes the preferred approach of dynamically deriving extensions from SUPPORTED_DOCUMENT_TYPES rather than manually adding to each regex. This PR hardcodes json|md in three places while #29609 would prevent future omissions.

Same gap as json/md: .html and .htm were missing from the extract_media
regex (base.py) and the two GatewayRunner tool-result MEDIA regexes
(run.py), so MEDIA:/path/file.html was delivered as raw text instead of a
native attachment. extract_local_files already accepted .html/.htm for the
platform-reply path, so this aligns the MEDIA-tag path with it.
@teknium1

Copy link
Copy Markdown
Contributor

Superseded by #34844, which consolidates this cluster.

This PR widens the extract_media extension allowlist, which is the right direction — but on its own it leaves the unconditional MEDIA:\s*\S+ strip in place, so a MEDIA: tag with any extension still outside the (now wider) list keeps getting deleted from the body before extract_local_files can pick up the bare path. #34844 fixes both halves: it unifies the two extractors onto a single shared extension set (MEDIA_DELIVERY_EXTS) AND replaces the loose strip with an extension-anchored one, so an unknown-extension path survives in the text instead of vanishing.

Closing as superseded — thanks for surfacing and helping pin down this bug; it was part of getting the full fix right. See #34844.

@teknium1 teknium1 closed this May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants