fix(gateway): MEDIA: extraction regex — prefix-glue, space-truncation, narrow allowlist#24384
fix(gateway): MEDIA: extraction regex — prefix-glue, space-truncation, narrow allowlist#24384Kyzcreig wants to merge 1 commit into
Conversation
…, narrow allowlist
Three compounding bugs in extract_media (gateway/platforms/base.py) caused
MEDIA:/path tags to silently fail on Discord (and likely other platforms):
1. The \s* after MEDIA: greedily consumed newlines. When the agent response
contained MEDIA: as conversational text on one line and a real
MEDIA:/path tag on a later line, the regex bridged across the gap and
the path slot fell through to the |\S+ fallback — producing tuples
like ("MEDIA:/Users/.../foo.png", False). Downstream send_multiple_images
then checked os.path.exists() on the prefixed path and logged
"Skipping missing image". Fix: change \s* to [^\S\n]* (horizontal only).
2. The |\S+ final fallback in the path alternatives truncated paths with
spaces at the first whitespace character. /Users/alex/Obsidian/Ace Place/
foo.html became /Users/alex/Obsidian/Ace, then logged as File not found.
It also enabled bug #1 by capturing MEDIA:/... in pathological cases.
Fix: drop |\S+. Unrecognized extensions now require quoting
(MEDIA:"/path/anything"); recognized ones still support unquoted paths
with internal spaces via the existing (?:[^\S\n]+\S+)*? continuation.
3. The extension allowlist excluded common developer text formats. .html,
.svg, .md, .json, .yaml, .xml, .log, .py/.js/.ts/.sh/.toml/.ini/.conf/
.env etc. were rejected, leaving raw MEDIA: tags as plain text in chat.
Fix: expand allowlist to cover these formats.
Also fixes the streaming dispatch path in gateway/run.py (~10015), which
dropped the cleaned-text return value from extract_media/extract_images
on the floor, causing extract_local_files to scan raw MEDIA: text and
produce false-positive bare-path matches. Chains cleaned text through
the extractors in order, matching the non-streaming path in base.py.
Includes regression tests in tests/gateway/test_platform_base.py covering
all three regex fixes and the unquoted-spaces preservation case.
Discovered debugging an attempt to attach an HTML file in Discord chat.
|
+1 from another angle — hit the same allowlist gap from a Carbon Voice plugin (PhononX/hermes-plugin-carbonvoice) where the agent reliably generated Closing my own duplicate (#32358) in favor of this PR — yours is strictly broader (catches the prefix-glue and space-truncation bugs I didn't notice, plus the streaming-dispatch cleanup-discard in Looking forward to seeing this merge. |
|
Rechecked this against current New issue #34517 reports the same |
|
Merged via #34844, which consolidates the MEDIA: extension-allowlist cluster. Your contribution was used directly and your authorship is preserved via a Thanks for the work here. |
Summary
Three compounding bugs in
extract_media(gateway/platforms/base.py) causedMEDIA:/pathtags to silently fail on Discord and likely other platforms. Discovered while trying to attach an HTML file from chat.Bugs Fixed
1.
\s*afterMEDIA:consumes newlines, bridges to laterMEDIA:tagWhen the agent's response contains
MEDIA:as conversational text on one line AND a realMEDIA:/pathtag on a later line, the old regex bridges across the gap and the path slot falls through to the|\S+fallback — capturing the literalMEDIA:prefix as part of the path.Repro:
Downstream
send_multiple_imagesthen checksos.path.exists("MEDIA:/tmp/foo.png")→ False → logsSkipping missing image.Fix: change
\s*to[^\S\n]*(horizontal whitespace only).2.
|\S+final fallback truncates paths with spacesThe
|\S+catch-all in the path alternatives truncated paths containing spaces at the first whitespace character. For example, an Obsidian vault path:It also enabled bug #1 by allowing pathological captures of
MEDIA:/...text.Fix: drop
|\S+entirely. Unrecognized extensions now require quoting (MEDIA:"/path/anything"). Recognized extensions still support unquoted paths with internal spaces via the existing(?:[^\S\n]+\S+)*?continuation.3. Extension allowlist excluded common developer text formats
.html,.svg,.md,.json,.yaml,.xml,.log, source code (.py/.js/.ts/.jsx/.tsx), shell (.sh/.bash/.zsh), and config (.toml/.ini/.conf/.cfg/.env) were all rejected —MEDIA:tags for these file types appeared as plain text in chat instead of being uploaded as attachments.Fix: expand the allowlist to cover these formats.
Bonus fix: streaming dispatch path discards cleaned text
gateway/run.py:~10015was dropping the cleaned-text return value fromextract_media/extract_imageson the floor:This caused
extract_local_filesto scan rawMEDIA:text in the streaming dispatch path, producing false-positive bare-path matches.Fix: chain cleaned text through the extractors in order, matching the non-streaming path in
base.py:~2983.Testing
Added regression tests in
tests/gateway/test_platform_base.py:test_prose_media_word_does_not_bridge_to_later_tag— bug Terminal tool #1test_html_extension_is_allowlisted— bug Architecture planning #3 (parameterized over 23 extensions)test_unquoted_path_with_unrecognized_extension_is_left_alone— bug Support passing morph snapshot id #2 + new behaviortest_quoted_path_with_unrecognized_extension_still_works— verifies the quoted-path escape hatch still works for unusual extensionstest_path_with_spaces_in_allowlisted_extension_still_works— confirms existing unquoted-spaces support didn't regressAll 95 tests in
test_platform_base.pypass. Three pre-existing failures intest_tts_media_routing.pyare unrelated (_thread_metadata_for_sourceattribute missing on a mock) — they fail identically on cleanorigin/main.Backwards compatibility
MEDIA:/path.unknownext(unrecognized extension) now leaves the tag in user-visible text instead of silently extracting a truncated/corrupted path. Quoted paths (MEDIA:"/path.unknownext") continue to work for the escape-hatch case. Net result: a clearer failure mode for unsupported extensions.Discovery
This was found while trying to attach an HTML spec file to a Discord chat. The agent emitted
MEDIA:/path/spec.htmland the gateway:.htmlnot in allowlist → bug Architecture planning #3)MEDIA:in prose earlier in the response, the secondMEDIA:/path.pnggot the prefix glued on (bug Terminal tool #1)Log evidence (from
~/.hermes/logs/gateway.log):