Skip to content

[BUG] extract_media regex truncates Windows spaced paths and rejects GIS extensions (.kmz/.kml/.geojson/.gpx) #24032

@ancxlol

Description

@ancxlol

Summary

The MEDIA: tag extractor in gateway/platforms/base.py (extract_media) fails on Windows absolute paths that contain spaces (e.g. C:\Users\Foo\OneDrive\My Folder\file.pdf). The path is silently truncated at the first whitespace, so the file is never attached. Additionally, several common GIS / structured-data extensions (kmz, kml, geojson, gpx, json, xml, html) are absent from the spaced-path allowlist, so even POSIX-style spaced paths fail for those types.

Repro

On Windows (any platform, but Telegram makes it most obvious), have the agent emit:

MEDIA:C:\Users\Confera\OneDrive\Nusa Alam Kreasindo\Project\Foo\report.pdf

Expected: file delivered as attachment.
Actual: path is truncated to C:\Users\Confera\OneDrive\Nusa — gateway logs file not found (or silently drops it), and the rest of the path (Alam Kreasindo\Project\Foo\report.pdf) leaks into the user-visible text.

Also reproducible with:

MEDIA:/home/user/My Folder/coords.kmz

Even with the existing spaced-path branch, .kmz is not in the extension allowlist, so the regex falls through to the \S+ branch and truncates at the first space.

Root cause

In gateway/platforms/base.py around line 2067, the current pattern is:

media_pattern = re.compile(
    r'''[`"']?MEDIA:\s*(?P<path>`[^`\n]+`|"[^"\n]+"|'[^'\n]+'|(?:~/|/)\S+(?:[^\S\n]+\S+)*?\.(?:png|jpe?g|gif|webp|mp4|mov|avi|mkv|webm|ogg|opus|mp3|wav|m4a|flac|epub|pdf|zip|rar|7z|docx?|xlsx?|pptx?|txt|csv|apk|ipa)(?=[\s`"',;:)\]}]|$)|\S+)[`"']?'''
)

Two problems:

  1. The spaced-path branch is anchored to (?:~/|/) — i.e. only POSIX paths starting with ~/ or /. Windows drive paths (C:\…, D:\…) and UNC paths (\server\share\…) skip this branch and fall into the final \S+, which stops at the first whitespace.

  2. GIS/structured extensions are missing from the allowlist (kmz, kml, geojson, gpx, json, xml, html?). Any user delivering coordinate exports, OpenAPI specs, sitemaps, etc. from a spaced path hits this even on Linux/macOS.

Suggested fix

Drop the (?:~/|/) anchor (since MEDIA: is itself the start-of-token marker, and the regex is already terminated by an extension + lookahead) and extend the allowlist. Diff against current main:

-r'''[`"']?MEDIA:\s*(?P<path>`[^`\n]+`|"[^"\n]+"|'[^'\n]+'|(?:~/|/)\S+(?:[^\S\n]+\S+)*?\.(?:png|jpe?g|gif|webp|mp4|mov|avi|mkv|webm|ogg|opus|mp3|wav|m4a|flac|epub|pdf|zip|rar|7z|docx?|xlsx?|pptx?|txt|csv|apk|ipa)(?=[\s`"',;:)\]}]|$)|\S+)[`"']?'''
+r'''[`"']?MEDIA:\s*(?P<path>`[^`\n]+`|"[^"\n]+"|'[^'\n]+'|[^\n]+?\.(?:png|jpe?g|gif|webp|mp4|mov|avi|mkv|webm|ogg|opus|mp3|wav|m4a|flac|epub|pdf|zip|rar|7z|docx?|xlsx?|pptx?|txt|csv|apk|ipa|kmz|kml|json|xml|html?|geojson|gpx)(?=[\s`"',;:)\]}]|$)|\S+)[`"']?'''

[^\n]+? (non-greedy, line-bounded) handles Windows drive paths, UNC paths, and POSIX paths uniformly. The trailing extension + lookahead ((?=[\s\"',;:)]}]|$)`) still terminates the match cleanly so it doesn't swallow following sentences.

The same fix needs to be mirrored at the other call sites that use MEDIA:\S+ for cleanup/history/UI:

  • gateway/platforms/base.py cleanup re.sub(r"MEDIA:[^\n]+", …) (line ~2993) — already correct
  • gateway/platforms/stream_consumer.py — cleanup regex
  • gateway/run.py — history dedup (2 occurrences)
  • gateway/mcp_serve.py — MCP attachments
  • ui-tui/src/components/markdown.tsx — UI renderer

Tests

tests/gateway/test_media_extraction.py passes after the patch (4/4) including a new fixture for Windows spaced paths. Happy to PR if maintainers want.

Related

Environment

  • Hermes Agent main @ 271883447 (May 12 2026)
  • Windows 11 Pro, Python 3.x, Telegram gateway
  • Real-world trigger: OneDrive-rooted project paths (C:\Users\<user>\OneDrive\<Org With Spaces>\…)

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/gatewayGateway runner, session dispatch, deliverytype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions