Skip to content

fix(gateway): support Windows drive-letter paths and GIS extensions in MEDIA: regex#24049

Open
liuhao1024 wants to merge 1 commit into
NousResearch:mainfrom
liuhao1024:fix/issue-24032-extract-media-windows-spaced-paths
Open

fix(gateway): support Windows drive-letter paths and GIS extensions in MEDIA: regex#24049
liuhao1024 wants to merge 1 commit into
NousResearch:mainfrom
liuhao1024:fix/issue-24032-extract-media-windows-spaced-paths

Conversation

@liuhao1024

@liuhao1024 liuhao1024 commented May 11, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

The MEDIA: tag extractor in gateway/platforms/base.py (extract_media) fails on Windows absolute paths containing spaces (e.g. C:\Users\Foo\OneDrive\My Folder\file.pdf). The path is silently truncated at the first whitespace because the spaced-path branch only matches POSIX paths starting with ~/ or /.

Additionally, common GIS/structured-data extensions (kmz, kml, geojson, gpx, json, xml, html) are absent from the spaced-path extension allowlist, so even POSIX-style spaced paths fail for those file types.

Root Cause

The media_pattern regex at line 2066 has a spaced-path branch anchored to (?:~/|/), which only matches POSIX-style paths. Windows drive-letter paths (C:\...) and UNC paths (\\server\share\...) fall through to the \S+ fallback branch, which stops at the first whitespace.

The extension allowlist also omits GIS and structured-data formats, causing .kmz, .kml, .geojson, .gpx, .json, .xml, and .html files to fail the spaced-path match even on POSIX systems.

Related Issue

N/A

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)

Changes Made

  • See commit messages for detailed changes

How to Test

  1. Run pytest tests/ -q — all tests should pass
  2. Verify the specific scenario described above is resolved

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS 26.4.1

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture and workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A

…n MEDIA: regex

The MEDIA: tag extractor in gateway/platforms/base.py fails on Windows
absolute paths containing spaces (e.g. C:\Users\Foo\OneDrive\My Folder\file.pdf).
The path is silently truncated at the first whitespace because the spaced-path
branch only matches POSIX paths starting with ~/ or /.

Additionally, common GIS/structured-data extensions (kmz, kml, geojson, gpx,
json, xml, html) are absent from the spaced-path extension allowlist, so even
POSIX-style spaced paths fail for those types.

Changes:
- Add [A-Za-z]: to the spaced-path prefix group to match Windows drive-letter paths
- Add kmz, kml, geojson, gpx, json, xml, html to the extension allowlist
- Add 9 regression tests covering Windows paths, GIS extensions, and combinations

Fixes [bug] extract_media regex truncates Windows spaced paths and rejects GIS extensions NousResearch#24032
@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists labels May 11, 2026
briandevans added a commit to briandevans/hermes-agent that referenced this pull request May 29, 2026
Revert the extract_media() regex change and its tests after
@alt-glitch flagged partial overlap with NousResearch#24049 (which covers the
extract_media() half plus GIS extensions). This PR now narrows to
the parallel-but-distinct bug in extract_local_files(), where the
same Unix-only path anchor (?:~/|/) silently drops Windows
drive-letter paths from bare-path uploads.

NousResearch#24049 remains the canonical fix for the MEDIA: tag regex.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants