fix(#30186): re-include .md in MEDIA extraction regex#30193
Closed
xxxigm wants to merge 2 commits into
Closed
Conversation
…30186) Commit ea49b38 ("tighten MEDIA extraction regex + silent skip on file-not-found") replaced the permissive MEDIA:\S+ pattern with an explicit extension allowlist in three places. The .md entry was dropped from the text/document group, so agents could no longer deliver Markdown files to users via the MEDIA: tag — only .txt and .csv worked from that group. Markdown is a standard document format routinely exchanged between agents and users (skills, READMEs, run reports, plans) and belongs alongside txt/csv. Restore md in all three regex sites the issue cites: * gateway/platforms/base.py:2162 — BasePlatformAdapter.extract_media, the user-facing extractor that runs over every final response. * gateway/run.py:16522 — first _TOOL_MEDIA_RE, builds the history dedupe set so the current turn doesn't re-emit paths already delivered. * gateway/run.py:16818 — second _TOOL_MEDIA_RE, the streaming-history fallback that re-extracts MEDIA tags from tool messages when the final response is missing them. All three patches are a literal addition of "|md" to the existing allowlist between "csv" and "apk"; the rest of the alternation, flags, and surrounding extraction logic are byte-identical. The tightening that ea49b38 introduced (no more permissive \S+ tail) is preserved — an unrelated extension like .xyz still fails to match. Refs NousResearch#30186.
…action
16 new tests in tests/gateway/test_media_md_extension_30186.py pin
.md into all three regex sites the fix touches:
* TestExtractMediaSupportsMd — 7 cases driving
BasePlatformAdapter.extract_media directly:
- plain MEDIA:/tmp/report.md extracts the path and scrubs the
tag from the visible text (the literal repro from NousResearch#30186);
- tilde MEDIA:~/Documents/spec.md is extracted and expanded via
os.path.expanduser (assertion is HOME-hermetic — checks the
.md suffix, not the prefix);
- quoted path with spaces ('/tmp/my notes/spec final.md') works
— guards against regressing the quoted-path branch when a new
extension is added to the alternation;
- [[audio_as_voice]] voice flag is preserved per-tuple even
when the file is .md (the dispatch layer decides, not the
extractor);
- mixed-document response (md + csv + txt + pdf in the same
block) extracts every tag — proves the new entry doesn't
short-circuit the alternation;
- unrelated extension like .xyz still fails to match — the
tightening from ea49b38 is preserved;
- bare /tmp/md/ directory path (no .md extension) does NOT
match — guards against false-positive on path components
that merely contain "md".
* TestToolMediaRegexSupportsMd — 6 cases against a reconstructed
copy of the _TOOL_MEDIA_RE pattern that appears twice in
gateway/run.py. Covers single .md, uppercase .MD (this copy uses
re.IGNORECASE), tilde, multiple .md paths via finditer (the
dedupe loop must catch every one or the streaming fallback
re-emits stale paths next turn), mixed extensions, and unrelated
extension rejection.
* TestGatewayRunSourceContainsMdInBothRegexes — belt-and-braces
source-level guard: greps gateway/run.py for the exact
"txt|csv|md|apk|ipa" allowlist tail and asserts exactly two
occurrences (one per _TOOL_MEDIA_RE site). A second test asserts
the pre-NousResearch#30186 tail "txt|csv|apk|ipa" is gone — so a revert of
either inline copy fires immediately.
* TestPlatformsBaseSourceContainsMd — same source-level guard for
gateway/platforms/base.py.
Refs NousResearch#30186.
Collaborator
This was referenced May 22, 2026
Contributor
|
Superseded by #34844, which consolidates this cluster. This PR widens the Closing as superseded — thanks for surfacing and helping pin down this bug; it was part of getting the full fix right. See #34844. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Restores
.md(Markdown) to the MEDIA extraction regex in the three places commit ea49b38 dropped it. Agents can again deliver Markdown files (skills, READMEs, run reports, plans) to users via theMEDIA:tag — not just.txtand.csv.Before:
MEDIA:/tmp/report.mdin its final response.BasePlatformAdapter.extract_mediaruns the tightened regex, doesn't match becausemdisn't in the allowlist, and returnsmedia=[].MEDIA:/tmp/report.mdstring leaks into the user-visible text and no file is attached._TOOL_MEDIA_REingateway/run.pyhas the same gap, so the streaming-history fallback can't recover the path either.After:
mdbetweencsvandapkin the text/document group.MEDIA:/tmp/report.mdextracts cleanly, the tag is scrubbed from the visible text, and the Markdown file is delivered as a document attachment exactly like.txt/.csvalready are.\S+tail) is preserved —.xyzand similar still fail to match.Related Issue
Fixes #30186.
Type of Change
Changes Made
gateway/platforms/base.py(~L2162) —BasePlatformAdapter.extract_media. Insertedmdinto the alternation:...|txt|csv|md|apk|ipa).gateway/run.py(~L16522) — first_TOOL_MEDIA_RE, the dedupe scan that builds_history_media_pathsso the current turn doesn't re-emit paths already delivered in earlier turns. Same insertion.gateway/run.py(~L16818) — second_TOOL_MEDIA_RE, the streaming-history fallback that re-extracts MEDIA tags from tool messages when the final response is missing them. Same insertion.tests/gateway/test_media_md_extension_30186.py— new file with 16 regression tests across 4 classes (+255 lines):TestExtractMediaSupportsMd— 7 cases drivingextract_mediadirectly: plain.mdpath is extracted and tag scrubbed (the literal repro from MEDIA extraction regex missing .md extension after ea49b3862 #30186); tilde paths are expanded viaos.path.expanduser(assertion is$HOME-hermetic); quoted paths with spaces;[[audio_as_voice]]flag is preserved per-tuple; mixed-document response extracts every tag;.xyzstill rejected; bare/tmp/md/directory not falsely matched.TestToolMediaRegexSupportsMd— 6 cases against a reconstructed copy of_TOOL_MEDIA_RE(the two inline copies ingateway/run.pyaren't exported). Covers single.md, uppercase.MD(this copy usesre.IGNORECASE), tilde, multiple.mdpaths viafinditer, mixed extensions, and unrelated-extension rejection.TestGatewayRunSourceContainsMdInBothRegexes— belt-and-braces source-level guard: grepsgateway/run.pyfor the exacttxt|csv|md|apk|ipaallowlist tail and asserts exactly two occurrences (one per_TOOL_MEDIA_REsite). A second test asserts the pre-MEDIA extraction regex missing .md extension after ea49b3862 #30186 tailtxt|csv|apk|ipais gone so a revert of either inline copy fires immediately.TestPlatformsBaseSourceContainsMd— same source-level guard forgateway/platforms/base.py.No other code touched. All other extensions in the alternation, the surrounding extraction loops, the
re.IGNORECASEflag on the_TOOL_MEDIA_REcopies, and the dispatch routing are byte-identical.How to Test
.venvis set up:python3 -m venv .venv && source .venv/bin/activate && pip install -e ".[all,dev]"/tmp/spec.mdand deliver it").MEDIA:/tmp/spec.mdin its final response.MEDIA:/tmp/spec.mdtext appears in the chat, no file attachment..txtor.csvwould be.Checklist
Code
fix(gateway): ...+test(gateway): ...)scripts/run_tests.sh tests/gateway/test_media_md_extension_30186.pyand all tests passDocumentation & Housekeeping
docs/, docstrings) — N/A (no behaviour change to document beyond ".mdworks again"; the issue itself is the changelog entry)cli-config.yaml.exampleif I added/changed config keys — N/A (no new config key)CONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows — N/A.mdsuffix instead of the expanded$HOMEprefix).MEDIA:tags are unchanged; only the gateway-side extractor was missing.md).Screenshots / Logs