Skip to content

fix: parse MEDIA paths with spaced filenames#31035

Open
williamumu wants to merge 2 commits into
NousResearch:mainfrom
williamumu:fix/media-path-spaces
Open

fix: parse MEDIA paths with spaced filenames#31035
williamumu wants to merge 2 commits into
NousResearch:mainfrom
williamumu:fix/media-path-spaces

Conversation

@williamumu

Copy link
Copy Markdown

Summary

  • Handle unquoted MEDIA: file paths that contain spaces before the file extension, e.g. V1.2 .docx
  • Ensure streamed gateway display cleanup removes the full MEDIA: tag for those paths
  • Add regression coverage for both media extraction and stream display cleanup

Test Plan

  • python -m pytest -o 'addopts=' tests/gateway/test_platform_base.py -q

@jsboige

jsboige commented May 23, 2026

Copy link
Copy Markdown

Thanks for this fix. The core change from \S+(?:[^\S\n]+\S+)*?\.<ext> to [^\n]+?\.<ext>|\S+ is a clear improvement — the original pattern couldn't handle spaces immediately before the dot extension (e.g. V1.2 .docx), and the lazy quantifier with the extension-specific lookahead keeps the match well-constrained.

A few observations:

  1. Regex duplication: The extraction pattern in base.py and the cleanup pattern in stream_consumer.py are now near-identical. If the extension list diverges between the two files, media tags could be extracted but not cleaned (or vice versa). Consider extracting the pattern or extension list into a shared constant.

  2. Over-matching with [^\n]+?: While the lazy quantifier + lookahead is correct here, a pathological input like MEDIA:/a/b c d e f g h i j k.pdf unwanted text here would match greedily up to .pdf and include all the intermediate text as part of the path. The lookahead prevents capturing past the extension, so this is acceptable in practice.

  3. Tests cover the reported case well (CJK characters + space-before-dot). Consider also adding a test for a path with multiple internal spaces (e.g. /tmp/my file name.docx) to confirm the pattern handles repeated whitespace segments.

The fix is correct and targeted. Ship-ready with the minor maintainability note above.

@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists labels May 23, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Part of the MEDIA path parsing cluster: related to #26407 (spaced unicode paths from tool output), #26368 (Windows paths with spaces), #24132 (spaced file paths). This PR specifically handles the edge case of spaces before the file extension (e.g. V1.2 .docx).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants