Skip to content

fix(gateway): validate MEDIA: paths to reject regex patterns and placeholders#21593

Closed
liuhao1024 wants to merge 3 commits into
NousResearch:mainfrom
liuhao1024:fix/issue-21527-media-regex-path-validation
Closed

fix(gateway): validate MEDIA: paths to reject regex patterns and placeholders#21593
liuhao1024 wants to merge 3 commits into
NousResearch:mainfrom
liuhao1024:fix/issue-21527-media-regex-path-validation

Conversation

@liuhao1024

@liuhao1024 liuhao1024 commented May 8, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

The extract_media() regex in gateway/platforms/base.py has a \S+ fallback that matches non-path content (regex patterns, debug placeholders, bare words). These invalid strings are then passed to os.path.exists() in platform adapters, causing "file not found" errors in gateway logs.

Root Cause

The MEDIA tag regex (line 1354) uses a \S+ fallback alternative that matches any non-whitespace. When an LLM outputs malformed MEDIA tags — containing regex patterns like (?P<path>...), debug placeholders like <path>, or bare filenames like filename.ogg — the fallback captures them as paths. No validation exists between extraction and downstream os.path.exists() calls.

Related Issue

N/A

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)

Changes Made

  • See commit messages for detailed changes

How to Test

  1. Run pytest tests/ -q — all tests should pass
  2. Verify the specific scenario described above is resolved

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS 26.4.1

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture and workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A

liuhao1024 and others added 3 commits April 24, 2026 22:14
…INSTALL_TIMEOUT

Increase the default npm install timeout for WhatsApp bridge from 60s
to 300s (5 minutes) to accommodate slower systems like Unraid NAS.
Make it configurable via WHATSAPP_NPM_INSTALL_TIMEOUT environment variable
for users who need even longer timeouts.

Closes NousResearch#14980
- Add 'path', 'old_string', 'new_string', and 'patch' to required list
- Update description to clarify mode-specific parameter requirements
- This addresses issue where LLMs would omit these parameters because
  they were not marked as required in the schema, even though they
  are required depending on the mode

Fixes NousResearch#15524
…eholders

The \S+ fallback in extract_media() regex matches non-path content
(regex patterns, debug placeholders, bare words) which then get passed
to os.path.exists() causing 'file not found' errors.

Add path format validation after extraction: require paths to start
with /, ~, or a Windows drive letter before accepting as media.

Fixes NousResearch#21527
@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists labels May 8, 2026
@teknium1

Copy link
Copy Markdown
Contributor

Closing as implemented on main by automated hermes-sweeper review.

Evidence:

  • Current main no longer has the loose MEDIA:\s*\S+ extraction behavior described here. gateway/platforms/base.py:1222 defines MEDIA_TAG_CLEANUP_RE so MEDIA: paths must be anchored as /, ~/, or a Windows drive path and end in a known deliverable extension.
  • gateway/platforms/base.py:2956 uses that shared regex in extract_media(), so malformed tags like MEDIA:(?P<path>...), MEDIA:<path>, and MEDIA:filename.ogg are not appended as media paths.
  • I verified the PR's examples directly on current main: regex pattern, <path>, and bare filename all returned [], while /tmp/audio.ogg and ~/audio.ogg still extracted as valid media paths.
  • The main-line implementation landed in 781604ce4 (fix(gateway): unify MEDIA: extraction extension set + close the unknown-ext black hole (#34517) (#34844)).

The branch also carries older unrelated commits; those have main-line equivalents or safer superseding behavior, so there is no remaining change here to salvage as-is.

@teknium1 teknium1 closed this Jun 11, 2026
@teknium1 teknium1 added the sweeper:implemented-on-main Sweeper: behavior already present on current main label Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists sweeper:implemented-on-main Sweeper: behavior already present on current main type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants