Skip to content

fix(gateway): MEDIA: extraction regex — prefix-glue, space-truncation, narrow allowlist#24384

Closed
Kyzcreig wants to merge 1 commit into
NousResearch:mainfrom
Kyzcreig:fix/gateway-media-extract-regex
Closed

fix(gateway): MEDIA: extraction regex — prefix-glue, space-truncation, narrow allowlist#24384
Kyzcreig wants to merge 1 commit into
NousResearch:mainfrom
Kyzcreig:fix/gateway-media-extract-regex

Conversation

@Kyzcreig

Copy link
Copy Markdown
Contributor

Summary

Three compounding bugs in extract_media (gateway/platforms/base.py) caused MEDIA:/path tags to silently fail on Discord and likely other platforms. Discovered while trying to attach an HTML file from chat.

Bugs Fixed

1. \s* after MEDIA: consumes newlines, bridges to later MEDIA: tag

When the agent's response contains MEDIA: as conversational text on one line AND a real MEDIA:/path tag on a later line, the old regex bridges across the gap and the path slot falls through to the |\S+ fallback — capturing the literal MEDIA: prefix as part of the path.

Repro:

response = "Test 3 PNG via MEDIA:\n\nMEDIA:/tmp/foo.png"
BasePlatformAdapter.extract_media(response)
# Old: [('MEDIA:/tmp/foo.png', False)]   ← prefix glued on
# New: [('/tmp/foo.png', False)]

Downstream send_multiple_images then checks os.path.exists("MEDIA:/tmp/foo.png") → False → logs Skipping missing image.

Fix: change \s* to [^\S\n]* (horizontal whitespace only).

2. |\S+ final fallback truncates paths with spaces

The |\S+ catch-all in the path alternatives truncated paths containing spaces at the first whitespace character. For example, an Obsidian vault path:

response = 'MEDIA:/Users/alex/Obsidian/Ace Place/Engineering/foo.html'
# Old: matched '/Users/alex/Obsidian/Ace'  ← truncated, then File not found
# Old gateway log: '[Discord] Failed to send media (): File not found: /Users/alex/Obsidian/Ace'

It also enabled bug #1 by allowing pathological captures of MEDIA:/... text.

Fix: drop |\S+ entirely. Unrecognized extensions now require quoting (MEDIA:"/path/anything"). Recognized extensions still support unquoted paths with internal spaces via the existing (?:[^\S\n]+\S+)*? continuation.

3. Extension allowlist excluded common developer text formats

.html, .svg, .md, .json, .yaml, .xml, .log, source code (.py/.js/.ts/.jsx/.tsx), shell (.sh/.bash/.zsh), and config (.toml/.ini/.conf/.cfg/.env) were all rejected — MEDIA: tags for these file types appeared as plain text in chat instead of being uploaded as attachments.

Fix: expand the allowlist to cover these formats.

Bonus fix: streaming dispatch path discards cleaned text

gateway/run.py:~10015 was dropping the cleaned-text return value from extract_media/extract_images on the floor:

# BEFORE
media_files, _ = adapter.extract_media(response)        # discards cleaned
_, cleaned = adapter.extract_images(response)            # uses RAW response
local_files, _ = adapter.extract_local_files(cleaned)   # MEDIA: never stripped

This caused extract_local_files to scan raw MEDIA: text in the streaming dispatch path, producing false-positive bare-path matches.

Fix: chain cleaned text through the extractors in order, matching the non-streaming path in base.py:~2983.

Testing

Added regression tests in tests/gateway/test_platform_base.py:

  • test_prose_media_word_does_not_bridge_to_later_tag — bug Terminal tool #1
  • test_html_extension_is_allowlisted — bug Architecture planning #3 (parameterized over 23 extensions)
  • test_unquoted_path_with_unrecognized_extension_is_left_alone — bug Support passing morph snapshot id #2 + new behavior
  • test_quoted_path_with_unrecognized_extension_still_works — verifies the quoted-path escape hatch still works for unusual extensions
  • test_path_with_spaces_in_allowlisted_extension_still_works — confirms existing unquoted-spaces support didn't regress

All 95 tests in test_platform_base.py pass. Three pre-existing failures in test_tts_media_routing.py are unrelated (_thread_metadata_for_source attribute missing on a mock) — they fail identically on clean origin/main.

Backwards compatibility

  • All existing tests pass unchanged.
  • The one behavior change for callers: unquoted MEDIA:/path.unknownext (unrecognized extension) now leaves the tag in user-visible text instead of silently extracting a truncated/corrupted path. Quoted paths (MEDIA:"/path.unknownext") continue to work for the escape-hatch case. Net result: a clearer failure mode for unsupported extensions.

Discovery

This was found while trying to attach an HTML spec file to a Discord chat. The agent emitted MEDIA:/path/spec.html and the gateway:

  1. Failed silently (.html not in allowlist → bug Architecture planning #3)
  2. When the workaround used an obsidian path with spaces, truncated it (bug Support passing morph snapshot id #2)
  3. When the agent had said MEDIA: in prose earlier in the response, the second MEDIA:/path.png got the prefix glued on (bug Terminal tool #1)

Log evidence (from ~/.hermes/logs/gateway.log):

[Discord] Failed to send media (.html): File not found: /path/to/foo.html
[Discord] Failed to send media (): File not found: /Users/alexgierczyk/Obsidian/Ace
[Discord] Skipping missing image: MEDIA:/Users/alexgierczyk/.hermes/cache/images/openai_gpt-image-2-medium_20260512_063250_f4610f91.png

…, narrow allowlist

Three compounding bugs in extract_media (gateway/platforms/base.py) caused
MEDIA:/path tags to silently fail on Discord (and likely other platforms):

1. The \s* after MEDIA: greedily consumed newlines. When the agent response
   contained MEDIA: as conversational text on one line and a real
   MEDIA:/path tag on a later line, the regex bridged across the gap and
   the path slot fell through to the |\S+ fallback — producing tuples
   like ("MEDIA:/Users/.../foo.png", False). Downstream send_multiple_images
   then checked os.path.exists() on the prefixed path and logged
   "Skipping missing image". Fix: change \s* to [^\S\n]* (horizontal only).

2. The |\S+ final fallback in the path alternatives truncated paths with
   spaces at the first whitespace character. /Users/alex/Obsidian/Ace Place/
   foo.html became /Users/alex/Obsidian/Ace, then logged as File not found.
   It also enabled bug #1 by capturing MEDIA:/... in pathological cases.
   Fix: drop |\S+. Unrecognized extensions now require quoting
   (MEDIA:"/path/anything"); recognized ones still support unquoted paths
   with internal spaces via the existing (?:[^\S\n]+\S+)*? continuation.

3. The extension allowlist excluded common developer text formats. .html,
   .svg, .md, .json, .yaml, .xml, .log, .py/.js/.ts/.sh/.toml/.ini/.conf/
   .env etc. were rejected, leaving raw MEDIA: tags as plain text in chat.
   Fix: expand allowlist to cover these formats.

Also fixes the streaming dispatch path in gateway/run.py (~10015), which
dropped the cleaned-text return value from extract_media/extract_images
on the floor, causing extract_local_files to scan raw MEDIA: text and
produce false-positive bare-path matches. Chains cleaned text through
the extractors in order, matching the non-streaming path in base.py.

Includes regression tests in tests/gateway/test_platform_base.py covering
all three regex fixes and the unquoted-spaces preservation case.

Discovered debugging an attempt to attach an HTML file in Discord chat.
@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists labels May 12, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Related: #24049 (Windows drive-letter paths + GIS extensions in MEDIA regex), #24032 (issue for that). This PR addresses a broader set of regex bugs (prefix-glue, space-truncation, narrow allowlist) beyond just path prefix matching.

@cristianmgm7

Copy link
Copy Markdown

+1 from another angle — hit the same allowlist gap from a Carbon Voice plugin (PhononX/hermes-plugin-carbonvoice) where the agent reliably generated .md reports that never reached the user. The failure was confusingly silent because the secondary cleanup regex (base.py:~3476, re.sub(r"MEDIA:\s*\S+", "", text_content)) still stripped the line from visible text, so the recipient saw a reply with no MEDIA line AND no attachment.

Closing my own duplicate (#32358) in favor of this PR — yours is strictly broader (catches the prefix-glue and space-truncation bugs I didn't notice, plus the streaming-dispatch cleanup-discard in run.py). My regression tests in tests/gateway/test_extract_media.py overlap with your test_html_extension_is_allowlisted parametrization, so probably nothing to port — but happy to contribute the negative cases (xyz/exe/dmg/iso/deb/rpm rejection) if useful.

Looking forward to seeing this merge.

@Bartok9

Bartok9 commented May 29, 2026

Copy link
Copy Markdown
Contributor

Rechecked this against current origin/main (a87f0a82a) — the narrow allowlist at base.py:2524 and the discarded-_ chaining in run.py are both still present, so this PR is still needed and applies cleanly to the symptom.

New issue #34517 reports the same .md black-hole independently (with the line-3709 cleanup-regex coupling spelled out). This PR is the broader fix for it — the quoted-path escape hatch in particular is a stronger mitigation than #34517's "just enumerate more extensions" suggestion. Cross-referenced #34517 → here for consolidation. +1.

@teknium1

Copy link
Copy Markdown
Contributor

Merged via #34844, which consolidates the MEDIA: extension-allowlist cluster.

Your contribution was used directly and your authorship is preserved via a Co-authored-by: trailer on the merge commit (781604c on main). #34844 fixes both halves of issue #34517: it unifies extract_media and extract_local_files onto one shared extension set AND gates the cleanup strip so unknown-extension paths are no longer silently dropped.

Thanks for the work here.

@teknium1 teknium1 closed this May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants