Skip to content

fix(gateway): unify MEDIA: extraction extension set + close the unknown-ext black hole (#34517)#34844

Merged
teknium1 merged 1 commit into
mainfrom
hermes/hermes-1739d2e6
May 29, 2026
Merged

fix(gateway): unify MEDIA: extraction extension set + close the unknown-ext black hole (#34517)#34844
teknium1 merged 1 commit into
mainfrom
hermes/hermes-1739d2e6

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Summary

MEDIA:<path> tags for .md/.json/.yaml/.xml/.html and other document extensions now deliver instead of being silently dropped (issue #34517).

Root cause: extract_media() carried a narrow extension allowlist that omitted those types while extract_local_files() had a broad one. The dispatch sites then ran an unconditional re.sub(r"MEDIA:\s*\S+", "") that stripped the tag from the body even when extract_media had not matched it — so extract_local_files ran on text where the path was already gone, and the file was delivered by neither path. A black hole for any extension not in the narrow list.

Changes

  • gateway/platforms/base.py: add MEDIA_DELIVERY_EXTS as the single source of truth for deliverable extensions; extract_media() and extract_local_files() both derive their extension set from it (no more drift between the two).
  • gateway/platforms/base.py: replace the loose MEDIA:\s*\S+ cleanup at the non-streaming dispatch site with the shared, extension-anchored MEDIA_TAG_CLEANUP_RE. A MEDIA: tag whose path has an unknown extension is now left in the body so the bare-path detector (extract_local_files) can still pick it up, instead of being deleted.
  • gateway/stream_consumer.py: the streaming consumer uses the same shared anchored regex, so streaming and non-streaming delivery behave identically.
  • gateway/run.py: chain cleaned text through extract_media → extract_images → extract_local_files in the post-stream media-delivery path (it was dropping the cleaned text and re-scanning raw text that still contained MEDIA: tags, producing false-positive bare-path matches).
  • tests/gateway/test_platform_base.py: regression coverage for both halves — previously-dropped extensions now extract, and an unknown-extension path survives the cleanup instead of vanishing.
  • scripts/release.py: AUTHOR_MAP entries for the credited contributors.

Validation

Before After
MEDIA:/x.md (and .json/.yaml/.xml/.html/.tsv/.svg) dropped — no delivery extracted and delivered
MEDIA:/x.<unknown-ext> stripped from body, lost left in text, falls through to bare-path detector
extract_media vs extract_local_files extension set two independent lists (drift) one shared MEDIA_DELIVERY_EXTS
streaming vs non-streaming strip two loose regexes one shared anchored regex
gateway media tests 223 pass (incl. 4 new regression tests)

E2E-verified with real imports from the worktree: known extensions match extract_media directly, unknown-extension paths are preserved through the anchored cleanup, and the streaming consumer uses the identical regex.

Cluster consolidation

Consolidates the large open MEDIA: extension-allowlist PR cluster. Most of those PRs widen the allowlist but leave the unconditional strip in place, so bare unknown-extension paths stay black-holed; this PR fixes both halves (shared extension set + anchored strip) in one place. Adopts the shared-constant structure from #34345 (@Bartok9), the strip-gating idea from #34656 (@banditburai), and the run.py chaining fix from #24384 (@Kyzcreig). Their contributions are credited via Co-authored-by: trailers. The superseded allowlist-only PRs will be closed with credit and a pointer here.

Closes #34517

Infographic

media-extension-black-hole-closed

…wn-ext black hole (#34517)

MEDIA:<path> tags for .md/.json/.yaml/.xml/.html and other document
extensions were silently dropped. extract_media() carried a narrow
extension allowlist that omitted them, while extract_local_files()
had a broad one. The dispatch sites then ran an unconditional
re.sub(r'MEDIA:\\s*\\S+', '') that stripped the tag from the body even
when extract_media had not matched it — so extract_local_files (broad
list) ran on text where the path was already gone, and the file was
delivered by neither path.

- Add MEDIA_DELIVERY_EXTS in gateway/platforms/base.py as the single
  source of truth; extract_media and extract_local_files both derive
  their extension set from it (no more drift).
- Replace the loose MEDIA cleanup at the non-streaming dispatch site
  (base.py) and the streaming consumer (stream_consumer.py) with the
  shared, extension-anchored MEDIA_TAG_CLEANUP_RE. A MEDIA: tag with an
  unknown extension is left in the body so the bare-path detector can
  still pick it up instead of being black-holed.
- Chain cleaned text through extract_media -> extract_images ->
  extract_local_files in run.py's post-stream media delivery (it was
  dropping the cleaned text and rescanning raw text with MEDIA: tags).
- Regression tests covering both halves: previously-dropped extensions
  now extract, and unknown-ext paths survive the cleanup.

Consolidates the MEDIA extension-allowlist PR cluster.

Co-authored-by: Bartok9 <259807879+Bartok9@users.noreply.github.com>
Co-authored-by: banditburai <123342691+banditburai@users.noreply.github.com>
Co-authored-by: Kyzcreig <9063726+Kyzcreig@users.noreply.github.com>
@github-actions

Copy link
Copy Markdown
Contributor

🔎 Lint report: hermes/hermes-1739d2e6 vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 9453 on HEAD, 9453 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 4907 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels May 29, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Supersedes #34518, #34656, #34345, #31150, #31138. Part of saturated MEDIA extension unification cluster — canonical dynamic approach is #29609. This PR (by teknium1) adds the black-hole fix (unknown-ext tags no longer stripped) on top of the extension unification.

@teknium1 teknium1 merged commit 781604c into main May 29, 2026
22 checks passed
@teknium1 teknium1 deleted the hermes/hermes-1739d2e6 branch May 29, 2026 20:24
This was referenced May 29, 2026
sradetzky pushed a commit to sradetzky/hermes-agent that referenced this pull request May 30, 2026
…wn-ext black hole (NousResearch#34517) (NousResearch#34844)

MEDIA:<path> tags for .md/.json/.yaml/.xml/.html and other document
extensions were silently dropped. extract_media() carried a narrow
extension allowlist that omitted them, while extract_local_files()
had a broad one. The dispatch sites then ran an unconditional
re.sub(r'MEDIA:\\s*\\S+', '') that stripped the tag from the body even
when extract_media had not matched it — so extract_local_files (broad
list) ran on text where the path was already gone, and the file was
delivered by neither path.

- Add MEDIA_DELIVERY_EXTS in gateway/platforms/base.py as the single
  source of truth; extract_media and extract_local_files both derive
  their extension set from it (no more drift).
- Replace the loose MEDIA cleanup at the non-streaming dispatch site
  (base.py) and the streaming consumer (stream_consumer.py) with the
  shared, extension-anchored MEDIA_TAG_CLEANUP_RE. A MEDIA: tag with an
  unknown extension is left in the body so the bare-path detector can
  still pick it up instead of being black-holed.
- Chain cleaned text through extract_media -> extract_images ->
  extract_local_files in run.py's post-stream media delivery (it was
  dropping the cleaned text and rescanning raw text with MEDIA: tags).
- Regression tests covering both halves: previously-dropped extensions
  now extract, and unknown-ext paths survive the cleanup.

Consolidates the MEDIA extension-allowlist PR cluster.

Co-authored-by: Bartok9 <259807879+Bartok9@users.noreply.github.com>
Co-authored-by: banditburai <123342691+banditburai@users.noreply.github.com>
Co-authored-by: Kyzcreig <9063726+Kyzcreig@users.noreply.github.com>
KKT-OPT pushed a commit to KKT-OPT/hermes-agent that referenced this pull request May 31, 2026
…wn-ext black hole (NousResearch#34517) (NousResearch#34844)

MEDIA:<path> tags for .md/.json/.yaml/.xml/.html and other document
extensions were silently dropped. extract_media() carried a narrow
extension allowlist that omitted them, while extract_local_files()
had a broad one. The dispatch sites then ran an unconditional
re.sub(r'MEDIA:\\s*\\S+', '') that stripped the tag from the body even
when extract_media had not matched it — so extract_local_files (broad
list) ran on text where the path was already gone, and the file was
delivered by neither path.

- Add MEDIA_DELIVERY_EXTS in gateway/platforms/base.py as the single
  source of truth; extract_media and extract_local_files both derive
  their extension set from it (no more drift).
- Replace the loose MEDIA cleanup at the non-streaming dispatch site
  (base.py) and the streaming consumer (stream_consumer.py) with the
  shared, extension-anchored MEDIA_TAG_CLEANUP_RE. A MEDIA: tag with an
  unknown extension is left in the body so the bare-path detector can
  still pick it up instead of being black-holed.
- Chain cleaned text through extract_media -> extract_images ->
  extract_local_files in run.py's post-stream media delivery (it was
  dropping the cleaned text and rescanning raw text with MEDIA: tags).
- Regression tests covering both halves: previously-dropped extensions
  now extract, and unknown-ext paths survive the cleanup.

Consolidates the MEDIA extension-allowlist PR cluster.

Co-authored-by: Bartok9 <259807879+Bartok9@users.noreply.github.com>
Co-authored-by: banditburai <123342691+banditburai@users.noreply.github.com>
Co-authored-by: Kyzcreig <9063726+Kyzcreig@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MEDIA: tag silently drops .md (and other) files due to regex whitelist mismatch

2 participants