Skip to content

fix(#30186): re-include .md in MEDIA extraction regex#30193

Closed
xxxigm wants to merge 2 commits into
NousResearch:mainfrom
xxxigm:fix/30186-media-regex-add-md
Closed

fix(#30186): re-include .md in MEDIA extraction regex#30193
xxxigm wants to merge 2 commits into
NousResearch:mainfrom
xxxigm:fix/30186-media-regex-add-md

Conversation

@xxxigm

@xxxigm xxxigm commented May 22, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Restores .md (Markdown) to the MEDIA extraction regex in the three places commit ea49b38 dropped it. Agents can again deliver Markdown files (skills, READMEs, run reports, plans) to users via the MEDIA: tag — not just .txt and .csv.

Before:

  • Agent emits MEDIA:/tmp/report.md in its final response.
  • BasePlatformAdapter.extract_media runs the tightened regex, doesn't match because md isn't in the allowlist, and returns media=[].
  • The literal MEDIA:/tmp/report.md string leaks into the user-visible text and no file is attached.
  • The history-side _TOOL_MEDIA_RE in gateway/run.py has the same gap, so the streaming-history fallback can't recover the path either.

After:

  • The three regex sites include md between csv and apk in the text/document group.
  • Same MEDIA:/tmp/report.md extracts cleanly, the tag is scrubbed from the visible text, and the Markdown file is delivered as a document attachment exactly like .txt / .csv already are.
  • The tightening that ea49b38 introduced (no more permissive \S+ tail) is preserved — .xyz and similar still fail to match.

Related Issue

Fixes #30186.

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • gateway/platforms/base.py (~L2162) — BasePlatformAdapter.extract_media. Inserted md into the alternation: ...|txt|csv|md|apk|ipa).
  • gateway/run.py (~L16522) — first _TOOL_MEDIA_RE, the dedupe scan that builds _history_media_paths so the current turn doesn't re-emit paths already delivered in earlier turns. Same insertion.
  • gateway/run.py (~L16818) — second _TOOL_MEDIA_RE, the streaming-history fallback that re-extracts MEDIA tags from tool messages when the final response is missing them. Same insertion.
  • tests/gateway/test_media_md_extension_30186.py — new file with 16 regression tests across 4 classes (+255 lines):
    • TestExtractMediaSupportsMd — 7 cases driving extract_media directly: plain .md path is extracted and tag scrubbed (the literal repro from MEDIA extraction regex missing .md extension after ea49b3862 #30186); tilde paths are expanded via os.path.expanduser (assertion is $HOME-hermetic); quoted paths with spaces; [[audio_as_voice]] flag is preserved per-tuple; mixed-document response extracts every tag; .xyz still rejected; bare /tmp/md/ directory not falsely matched.
    • TestToolMediaRegexSupportsMd — 6 cases against a reconstructed copy of _TOOL_MEDIA_RE (the two inline copies in gateway/run.py aren't exported). Covers single .md, uppercase .MD (this copy uses re.IGNORECASE), tilde, multiple .md paths via finditer, mixed extensions, and unrelated-extension rejection.
    • TestGatewayRunSourceContainsMdInBothRegexes — belt-and-braces source-level guard: greps gateway/run.py for the exact txt|csv|md|apk|ipa allowlist tail and asserts exactly two occurrences (one per _TOOL_MEDIA_RE site). A second test asserts the pre-MEDIA extraction regex missing .md extension after ea49b3862 #30186 tail txt|csv|apk|ipa is gone so a revert of either inline copy fires immediately.
    • TestPlatformsBaseSourceContainsMd — same source-level guard for gateway/platforms/base.py.

No other code touched. All other extensions in the alternation, the surrounding extraction loops, the re.IGNORECASE flag on the _TOOL_MEDIA_RE copies, and the dispatch routing are byte-identical.

How to Test

  1. Check out this branch and ensure .venv is set up: python3 -m venv .venv && source .venv/bin/activate && pip install -e ".[all,dev]"
  2. Run the new regression tests on their own:
    scripts/run_tests.sh tests/gateway/test_media_md_extension_30186.py -v
    
    Expected: 16 passed.
  3. Run the surrounding media/extract sweep to confirm no cross-file regressions:
    scripts/run_tests.sh tests/gateway/test_platform_base.py tests/gateway/test_media_md_extension_30186.py tests/gateway/test_send_image_file.py tests/gateway/test_tts_media_routing.py
    
    Expected: 135 passed, 2 skipped.
  4. Manual repro (matches the issue body):
    • Ask an agent to write a Markdown file (e.g. "summarize X into /tmp/spec.md and deliver it").
    • The agent emits MEDIA:/tmp/spec.md in its final response.
    • Expected before the fix: literal MEDIA:/tmp/spec.md text appears in the chat, no file attachment.
    • Expected after the fix: file is delivered as a document attachment exactly like a .txt or .csv would be.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(gateway): ... + test(gateway): ...)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix (no unrelated commits)
  • I've run scripts/run_tests.sh tests/gateway/test_media_md_extension_30186.py and all tests pass
  • I've added tests for my changes
  • I've tested on my platform: macOS 15.2 (Darwin 24.6.0), Python 3.12

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — N/A (no behaviour change to document beyond ".md works again"; the issue itself is the changelog entry)
  • I've updated cli-config.yaml.example if I added/changed config keys — N/A (no new config key)
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — regex change; identical behaviour across platforms. Tests are hermetic (no real filesystem; tilde-path test asserts the .md suffix instead of the expanded $HOME prefix).
  • I've updated tool descriptions/schemas if I changed tool behavior — N/A (tools that emit MEDIA: tags are unchanged; only the gateway-side extractor was missing .md).

Screenshots / Logs

$ scripts/run_tests.sh tests/gateway/test_media_md_extension_30186.py -v
collected 16 items

tests/gateway/test_media_md_extension_30186.py::TestExtractMediaSupportsMd::test_plain_md_path_is_extracted PASSED
tests/gateway/test_media_md_extension_30186.py::TestExtractMediaSupportsMd::test_md_in_tilde_path_is_extracted_and_expanded PASSED
tests/gateway/test_media_md_extension_30186.py::TestExtractMediaSupportsMd::test_md_in_quoted_path_with_spaces_is_extracted PASSED
tests/gateway/test_media_md_extension_30186.py::TestExtractMediaSupportsMd::test_md_with_audio_as_voice_directive_preserves_voice_flag PASSED
tests/gateway/test_media_md_extension_30186.py::TestExtractMediaSupportsMd::test_md_alongside_other_documents_in_same_response PASSED
tests/gateway/test_media_md_extension_30186.py::TestExtractMediaSupportsMd::test_unrelated_extension_still_rejected PASSED
tests/gateway/test_media_md_extension_30186.py::TestExtractMediaSupportsMd::test_md_in_path_segment_does_not_match_without_extension PASSED
tests/gateway/test_media_md_extension_30186.py::TestToolMediaRegexSupportsMd::test_md_path_is_matched PASSED
tests/gateway/test_media_md_extension_30186.py::TestToolMediaRegexSupportsMd::test_uppercase_md_path_is_matched PASSED
tests/gateway/test_media_md_extension_30186.py::TestToolMediaRegexSupportsMd::test_tilde_md_path_is_matched PASSED
tests/gateway/test_media_md_extension_30186.py::TestToolMediaRegexSupportsMd::test_multiple_md_paths_in_tool_message_all_matched PASSED
tests/gateway/test_media_md_extension_30186.py::TestToolMediaRegexSupportsMd::test_md_extracted_alongside_other_doc_extensions PASSED
tests/gateway/test_media_md_extension_30186.py::TestToolMediaRegexSupportsMd::test_unrelated_extension_still_rejected PASSED
tests/gateway/test_media_md_extension_30186.py::TestGatewayRunSourceContainsMdInBothRegexes::test_both_tool_media_regex_copies_include_md PASSED
tests/gateway/test_media_md_extension_30186.py::TestGatewayRunSourceContainsMdInBothRegexes::test_no_pre_30186_allowlist_tail_remains PASSED
tests/gateway/test_media_md_extension_30186.py::TestPlatformsBaseSourceContainsMd::test_extract_media_pattern_includes_md PASSED

============================== 16 passed in 0.15s ==============================

$ scripts/run_tests.sh tests/gateway/test_platform_base.py tests/gateway/test_media_md_extension_30186.py tests/gateway/test_send_image_file.py tests/gateway/test_tts_media_routing.py
.................................................................................................................................     [100%]
135 passed, 2 skipped in 0.56s

xxxigm added 2 commits May 22, 2026 10:05
…30186)

Commit ea49b38 ("tighten MEDIA extraction regex + silent skip on
file-not-found") replaced the permissive MEDIA:\S+ pattern with an
explicit extension allowlist in three places. The .md entry was
dropped from the text/document group, so agents could no longer
deliver Markdown files to users via the MEDIA: tag — only .txt and
.csv worked from that group. Markdown is a standard document format
routinely exchanged between agents and users (skills, READMEs, run
reports, plans) and belongs alongside txt/csv.

Restore md in all three regex sites the issue cites:

  * gateway/platforms/base.py:2162 — BasePlatformAdapter.extract_media,
    the user-facing extractor that runs over every final response.
  * gateway/run.py:16522 — first _TOOL_MEDIA_RE, builds the history
    dedupe set so the current turn doesn't re-emit paths already
    delivered.
  * gateway/run.py:16818 — second _TOOL_MEDIA_RE, the streaming-history
    fallback that re-extracts MEDIA tags from tool messages when the
    final response is missing them.

All three patches are a literal addition of "|md" to the existing
allowlist between "csv" and "apk"; the rest of the alternation,
flags, and surrounding extraction logic are byte-identical. The
tightening that ea49b38 introduced (no more permissive \S+ tail) is
preserved — an unrelated extension like .xyz still fails to match.

Refs NousResearch#30186.
…action

16 new tests in tests/gateway/test_media_md_extension_30186.py pin
.md into all three regex sites the fix touches:

  * TestExtractMediaSupportsMd — 7 cases driving
    BasePlatformAdapter.extract_media directly:
      - plain MEDIA:/tmp/report.md extracts the path and scrubs the
        tag from the visible text (the literal repro from NousResearch#30186);
      - tilde MEDIA:~/Documents/spec.md is extracted and expanded via
        os.path.expanduser (assertion is HOME-hermetic — checks the
        .md suffix, not the prefix);
      - quoted path with spaces ('/tmp/my notes/spec final.md') works
        — guards against regressing the quoted-path branch when a new
        extension is added to the alternation;
      - [[audio_as_voice]] voice flag is preserved per-tuple even
        when the file is .md (the dispatch layer decides, not the
        extractor);
      - mixed-document response (md + csv + txt + pdf in the same
        block) extracts every tag — proves the new entry doesn't
        short-circuit the alternation;
      - unrelated extension like .xyz still fails to match — the
        tightening from ea49b38 is preserved;
      - bare /tmp/md/ directory path (no .md extension) does NOT
        match — guards against false-positive on path components
        that merely contain "md".

  * TestToolMediaRegexSupportsMd — 6 cases against a reconstructed
    copy of the _TOOL_MEDIA_RE pattern that appears twice in
    gateway/run.py. Covers single .md, uppercase .MD (this copy uses
    re.IGNORECASE), tilde, multiple .md paths via finditer (the
    dedupe loop must catch every one or the streaming fallback
    re-emits stale paths next turn), mixed extensions, and unrelated
    extension rejection.

  * TestGatewayRunSourceContainsMdInBothRegexes — belt-and-braces
    source-level guard: greps gateway/run.py for the exact
    "txt|csv|md|apk|ipa" allowlist tail and asserts exactly two
    occurrences (one per _TOOL_MEDIA_RE site). A second test asserts
    the pre-NousResearch#30186 tail "txt|csv|apk|ipa" is gone — so a revert of
    either inline copy fires immediately.

  * TestPlatformsBaseSourceContainsMd — same source-level guard for
    gateway/platforms/base.py.

Refs NousResearch#30186.
@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists labels May 22, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Duplicate of #29609 which dynamically derives the MEDIA extension allowlist from SUPPORTED_DOCUMENT_TYPES — the preferred approach per #30106 triage. Also duplicates #30192 (identical .md-only fix). See also #29710 (.html variant).

@teknium1

Copy link
Copy Markdown
Contributor

Superseded by #34844, which consolidates this cluster.

This PR widens the extract_media extension allowlist, which is the right direction — but on its own it leaves the unconditional MEDIA:\s*\S+ strip in place, so a MEDIA: tag with any extension still outside the (now wider) list keeps getting deleted from the body before extract_local_files can pick up the bare path. #34844 fixes both halves: it unifies the two extractors onto a single shared extension set (MEDIA_DELIVERY_EXTS) AND replaces the loose strip with an extension-anchored one, so an unknown-extension path survives in the text instead of vanishing.

Closing as superseded — thanks for surfacing and helping pin down this bug; it was part of getting the full fix right. See #34844.

@teknium1 teknium1 closed this May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MEDIA extraction regex missing .md extension after ea49b3862

3 participants