fix(gateway): extend MEDIA: regex whitelist to include common document/code extensions (#37318)#37395
Closed
alaamohanad169-ship-it wants to merge 2 commits into
Closed
Conversation
…t/code extensions (closes NousResearch#37318)
8b4cd7d to
614b94b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the silent-failure described in #37318: agents that emit
MEDIA:<path>for common document, data, or source-code files (e.g..md,.json,.yaml,.py,.sh,.tar.gz) had those tags stripped from the body by the loose cleanup regex, the file was never delivered, and the agent received no signal that the tag had been lost.Root cause
Two regexes whitelisted a narrow set of extensions (png, jpg, pdf, txt, csv, apk, ipa, etc.) and silently dropped everything else:
MEDIA_DELIVERY_EXTStuple ingateway/platforms/base.py— the single source of truth that drivesMEDIA_TAG_CLEANUP_REandextract_local_files._TOOL_MEDIA_REregex ingateway/run.py(both the module-level constant at line 684 and the inline copy inGatewayRunnerat line 17769 used for history deduplication).Fix
Add the missing common extensions to all three call sites:
And extend
MEDIA_DELIVERY_EXTS:.markdownand.tomlto the data row.py,.js,.shin a new "Source / script files" group.tar.gzto the archives rowAlso updates
tests/gateway/test_run_tool_media_re.pyto keep its inline test regexes in sync with the production patterns.Test coverage
New test file
tests/gateway/test_extract_media_extensions.py(246 lines) pins the new behaviour:TestMediaDeliveryExtsTuple— parametrized over 11 new extensions and 7 existing ones, verifies membership in the source-of-truth tupleTestExtractMediaRecognizesNewExtensions— verifiesBasePlatformAdapter.extract_mediacapturesMEDIA:<path>tags for each new extension, strips them from the cleaned text, and still ignores unknown extensionsTestBasePlatformAdapterExtractMediaRegression— end-to-end regression on the exact failure cases from the issue (.mdand.jsonpaths)Verification
Local Termux:
BasePlatformAdapter.extract_media('MEDIA:/tmp/notes.md\n')now returns[('/tmp/notes.md', False)](previously returned[]).Closes #37318
Out of scope
Siblings #37315 (QQ Bot) and #37364 (WeCom) report the same root cause via separate platform adapters; this PR fixes the central whitelist but does not touch the per-platform
_send_to_platformbranches. They can be addressed in a follow-up.