fix(gateway): unify MEDIA: extraction extension set + close the unknown-ext black hole (#34517)#34844
Merged
Merged
Conversation
…wn-ext black hole (#34517) MEDIA:<path> tags for .md/.json/.yaml/.xml/.html and other document extensions were silently dropped. extract_media() carried a narrow extension allowlist that omitted them, while extract_local_files() had a broad one. The dispatch sites then ran an unconditional re.sub(r'MEDIA:\\s*\\S+', '') that stripped the tag from the body even when extract_media had not matched it — so extract_local_files (broad list) ran on text where the path was already gone, and the file was delivered by neither path. - Add MEDIA_DELIVERY_EXTS in gateway/platforms/base.py as the single source of truth; extract_media and extract_local_files both derive their extension set from it (no more drift). - Replace the loose MEDIA cleanup at the non-streaming dispatch site (base.py) and the streaming consumer (stream_consumer.py) with the shared, extension-anchored MEDIA_TAG_CLEANUP_RE. A MEDIA: tag with an unknown extension is left in the body so the bare-path detector can still pick it up instead of being black-holed. - Chain cleaned text through extract_media -> extract_images -> extract_local_files in run.py's post-stream media delivery (it was dropping the cleaned text and rescanning raw text with MEDIA: tags). - Regression tests covering both halves: previously-dropped extensions now extract, and unknown-ext paths survive the cleanup. Consolidates the MEDIA extension-allowlist PR cluster. Co-authored-by: Bartok9 <259807879+Bartok9@users.noreply.github.com> Co-authored-by: banditburai <123342691+banditburai@users.noreply.github.com> Co-authored-by: Kyzcreig <9063726+Kyzcreig@users.noreply.github.com>
Contributor
🔎 Lint report:
|
Collaborator
This was referenced May 29, 2026
Closed
Closed
This was referenced May 29, 2026
4 tasks
sradetzky
pushed a commit
to sradetzky/hermes-agent
that referenced
this pull request
May 30, 2026
…wn-ext black hole (NousResearch#34517) (NousResearch#34844) MEDIA:<path> tags for .md/.json/.yaml/.xml/.html and other document extensions were silently dropped. extract_media() carried a narrow extension allowlist that omitted them, while extract_local_files() had a broad one. The dispatch sites then ran an unconditional re.sub(r'MEDIA:\\s*\\S+', '') that stripped the tag from the body even when extract_media had not matched it — so extract_local_files (broad list) ran on text where the path was already gone, and the file was delivered by neither path. - Add MEDIA_DELIVERY_EXTS in gateway/platforms/base.py as the single source of truth; extract_media and extract_local_files both derive their extension set from it (no more drift). - Replace the loose MEDIA cleanup at the non-streaming dispatch site (base.py) and the streaming consumer (stream_consumer.py) with the shared, extension-anchored MEDIA_TAG_CLEANUP_RE. A MEDIA: tag with an unknown extension is left in the body so the bare-path detector can still pick it up instead of being black-holed. - Chain cleaned text through extract_media -> extract_images -> extract_local_files in run.py's post-stream media delivery (it was dropping the cleaned text and rescanning raw text with MEDIA: tags). - Regression tests covering both halves: previously-dropped extensions now extract, and unknown-ext paths survive the cleanup. Consolidates the MEDIA extension-allowlist PR cluster. Co-authored-by: Bartok9 <259807879+Bartok9@users.noreply.github.com> Co-authored-by: banditburai <123342691+banditburai@users.noreply.github.com> Co-authored-by: Kyzcreig <9063726+Kyzcreig@users.noreply.github.com>
KKT-OPT
pushed a commit
to KKT-OPT/hermes-agent
that referenced
this pull request
May 31, 2026
…wn-ext black hole (NousResearch#34517) (NousResearch#34844) MEDIA:<path> tags for .md/.json/.yaml/.xml/.html and other document extensions were silently dropped. extract_media() carried a narrow extension allowlist that omitted them, while extract_local_files() had a broad one. The dispatch sites then ran an unconditional re.sub(r'MEDIA:\\s*\\S+', '') that stripped the tag from the body even when extract_media had not matched it — so extract_local_files (broad list) ran on text where the path was already gone, and the file was delivered by neither path. - Add MEDIA_DELIVERY_EXTS in gateway/platforms/base.py as the single source of truth; extract_media and extract_local_files both derive their extension set from it (no more drift). - Replace the loose MEDIA cleanup at the non-streaming dispatch site (base.py) and the streaming consumer (stream_consumer.py) with the shared, extension-anchored MEDIA_TAG_CLEANUP_RE. A MEDIA: tag with an unknown extension is left in the body so the bare-path detector can still pick it up instead of being black-holed. - Chain cleaned text through extract_media -> extract_images -> extract_local_files in run.py's post-stream media delivery (it was dropping the cleaned text and rescanning raw text with MEDIA: tags). - Regression tests covering both halves: previously-dropped extensions now extract, and unknown-ext paths survive the cleanup. Consolidates the MEDIA extension-allowlist PR cluster. Co-authored-by: Bartok9 <259807879+Bartok9@users.noreply.github.com> Co-authored-by: banditburai <123342691+banditburai@users.noreply.github.com> Co-authored-by: Kyzcreig <9063726+Kyzcreig@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
MEDIA:<path>tags for.md/.json/.yaml/.xml/.htmland other document extensions now deliver instead of being silently dropped (issue #34517).Root cause:
extract_media()carried a narrow extension allowlist that omitted those types whileextract_local_files()had a broad one. The dispatch sites then ran an unconditionalre.sub(r"MEDIA:\s*\S+", "")that stripped the tag from the body even whenextract_mediahad not matched it — soextract_local_filesran on text where the path was already gone, and the file was delivered by neither path. A black hole for any extension not in the narrow list.Changes
gateway/platforms/base.py: addMEDIA_DELIVERY_EXTSas the single source of truth for deliverable extensions;extract_media()andextract_local_files()both derive their extension set from it (no more drift between the two).gateway/platforms/base.py: replace the looseMEDIA:\s*\S+cleanup at the non-streaming dispatch site with the shared, extension-anchoredMEDIA_TAG_CLEANUP_RE. AMEDIA:tag whose path has an unknown extension is now left in the body so the bare-path detector (extract_local_files) can still pick it up, instead of being deleted.gateway/stream_consumer.py: the streaming consumer uses the same shared anchored regex, so streaming and non-streaming delivery behave identically.gateway/run.py: chain cleaned text throughextract_media → extract_images → extract_local_filesin the post-stream media-delivery path (it was dropping the cleaned text and re-scanning raw text that still containedMEDIA:tags, producing false-positive bare-path matches).tests/gateway/test_platform_base.py: regression coverage for both halves — previously-dropped extensions now extract, and an unknown-extension path survives the cleanup instead of vanishing.scripts/release.py: AUTHOR_MAP entries for the credited contributors.Validation
MEDIA:/x.md(and.json/.yaml/.xml/.html/.tsv/.svg)MEDIA:/x.<unknown-ext>MEDIA_DELIVERY_EXTSE2E-verified with real imports from the worktree: known extensions match
extract_mediadirectly, unknown-extension paths are preserved through the anchored cleanup, and the streaming consumer uses the identical regex.Cluster consolidation
Consolidates the large open MEDIA: extension-allowlist PR cluster. Most of those PRs widen the allowlist but leave the unconditional strip in place, so bare unknown-extension paths stay black-holed; this PR fixes both halves (shared extension set + anchored strip) in one place. Adopts the shared-constant structure from #34345 (@Bartok9), the strip-gating idea from #34656 (@banditburai), and the run.py chaining fix from #24384 (@Kyzcreig). Their contributions are credited via
Co-authored-by:trailers. The superseded allowlist-only PRs will be closed with credit and a pointer here.Closes #34517
Infographic