Skip to content

fix(extract_media): restrict \S+ fallback to absolute paths only#24672

Closed
wesleysimplicio wants to merge 1 commit into
NousResearch:mainfrom
wesleysimplicio:fix/cx16-issue-24575-extract-media-greedy-fallback
Closed

fix(extract_media): restrict \S+ fallback to absolute paths only#24672
wesleysimplicio wants to merge 1 commit into
NousResearch:mainfrom
wesleysimplicio:fix/cx16-issue-24575-extract-media-greedy-fallback

Conversation

@wesleysimplicio

@wesleysimplicio wesleysimplicio commented May 13, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

BasePlatformAdapter.extract_media() logs spurious File not found warnings whenever the agent explains the MEDIA: syntax in a response. The regex's last fallback alternative — a bare \S+ — matches any non-whitespace token, so plain words, relative tokens, and template placeholders like <filename> are all captured as media paths. send_document() then fails with os.path.exists() == False and the gateway emits:

Root cause

BasePlatformAdapter.extract_media() logs spurious File not found warnings
whenever the agent explains the MEDIA: syntax in a response. The regex's
last fallback alternative — a bare \S+ — matches any non-whitespace token,
so plain words, relative tokens, and template placeholders like <filename>
are all captured as media paths. send_document() then fails with
os.path.exists() == False and the gateway emits:

WARNING: File file not found: /home/hermes/outbox/<filename>
WARNING: File file not found: /home/hermes/outbox/file
WARNING: File file not found: /outbox/

Fix

Change the fallback from \S+ to (?:~/|/)\S+ so only strings beginning
with an absolute-path prefix are captured. Relative tokens and plain words no
longer trigger false-positive media sends.

# After (restricted to absolute paths)
r'''[`"']?MEDIA:\s*(?P<path>...|(?:~/|/)\S+...extension-list...|(?:~/|/)\S+)[`"']?'''
#                                                                  ^^^^^^^^^^

Why this shape

This shape mirrors #29640 so reviewers can quickly compare scope, root cause, fix, tests, and related context without having to decode a custom PR description.

Tests

  • Veja a descrição original preservada abaixo para detalhes de validação, testes e notas de verificação.
Original body

Related PRs / issues

Closes #24575

Original body

Summary

BasePlatformAdapter.extract_media() logs spurious File not found warnings whenever the agent explains the MEDIA: syntax in a response. The regex's last fallback alternative — a bare \S+ — matches any non-whitespace token, so plain words, relative tokens, and template placeholders like <filename> are all captured as media paths. send_document() then fails with os.path.exists() == False and the gateway emits:

What Changed

  • Standardized this PR body to the current Hermes Turbo template.
  • Preserved the original detailed description below for reference.

Fluxo

A mudança continua seguindo o fluxo original descrito na seção preservada abaixo, sem ampliar o escopo funcional deste PR.

Visão

A padronização melhora a revisão, reduz ruído e evita deriva de formatação entre PRs abertos.

Test Plan

  • Veja a descrição original preservada abaixo para detalhes de validação, testes e notas de verificação.
Original body

What does this PR do?

Problem

BasePlatformAdapter.extract_media() logs spurious File not found warnings
whenever the agent explains the MEDIA: syntax in a response. The regex's
last fallback alternative — a bare \S+ — matches any non-whitespace token,
so plain words, relative tokens, and template placeholders like <filename>
are all captured as media paths. send_document() then fails with
os.path.exists() == False and the gateway emits:

WARNING: File file not found: /home/hermes/outbox/<filename>
WARNING: File file not found: /home/hermes/outbox/file
WARNING: File file not found: /outbox/

Root cause

gateway/platforms/base.py — the named path group in the extract_media
regex ends with |\S+). This makes the fallback match anything:

# Before (too broad)
r'''[`"']?MEDIA:\s*(?P<path>...|(?:~/|/)\S+...extension-list...|\S+)[`"']?'''
#                                                                   ^^^^

Fix

Change the fallback from \S+ to (?:~/|/)\S+ so only strings beginning
with an absolute-path prefix are captured. Relative tokens and plain words no
longer trigger false-positive media sends.

# After (restricted to absolute paths)
r'''[`"']?MEDIA:\s*(?P<path>...|(?:~/|/)\S+...extension-list...|(?:~/|/)\S+)[`"']?'''
#                                                                  ^^^^^^^^^^

Tests

Added test_extract_media_ignores_relative_word_fallback in
tests/gateway/test_signal.py:

  • MEDIA:description → empty (plain word, no leading /)
  • MEDIA:relative/path.png → empty (relative path, no leading /)
  • MEDIA:/tmp/file.ogg → extracted correctly (absolute path still works)

All 3 new assertions pass. Existing test_extract_media_finds_image_tag and
test_extract_media_finds_audio_tag continue to pass.

Visual

flowchart LR
    BUG["❌ Bug: MEDIA:description → captured as media path"] --> WARN["WARNING: File not found"]
    FIX["✅ Fix: MEDIA:description → ignored (no leading /)"] --> OK["No spurious warning"]
Loading

Closes #24575

Solution Sketch

  • fix the root cause in the touched subsystem instead of layering a broad workaround around the symptom
  • keep surrounding behavior stable and avoid unrelated refactors while the area is under review
  • prove the change with focused checks on the exact path that regressed

Related Issue

Closes #24575

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • preserved the existing technical rationale and validation notes inside the template body
  • scoped this PR description to the implementation already present on the branch
  • aligned the delivery format with .github/PULL_REQUEST_TEMPLATE.md

How to Test

  1. Review the existing validation notes preserved in this PR body.
  2. Run the focused checks for the touched area.
  3. Confirm the scoped change still behaves as described above.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform:

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Screenshots / Logs

  • N/A.

Generated by Hermes Turbo


Generated by Hermes Turbo

## Problem
extract_media() logs spurious "File not found" warnings when the agent
includes instructional MEDIA: examples in its response text. The last
alternative in the named-path regex group — a bare `\S+` — matches any
non-whitespace token, including template placeholders and plain words.

## Root cause
gateway/platforms/base.py regex at the `\S+` fallback inside
`(?P<path>...)` would match `MEDIA:description`, `MEDIA:filename`,
or any word-like token after `MEDIA:`, triggering a failed send_document()
call and a WARNING log entry.

## Fix
Changed the fallback alternative from `\S+` to `(?:~/|/)\S+` so only
strings that begin with an absolute path character (`/` or `~/`) are
captured. Relative tokens and plain words no longer match.

## Tests
Added test_extract_media_ignores_relative_word_fallback in
tests/gateway/test_signal.py — asserts that plain words and relative
paths are not extracted, and that absolute paths still are.
Copilot AI review requested due to automatic review settings May 13, 2026 00:34

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/gateway Gateway runner, session dispatch, delivery labels May 13, 2026
@teknium1

Copy link
Copy Markdown
Contributor

Automated hermes-sweeper review: this looks implemented on current main.

Evidence:

  • gateway/platforms/base.py:1222 defines the current shared MEDIA_TAG_CLEANUP_RE; it is anchored to quoted paths or absolute/home/Windows path prefixes ending in known deliverable extensions, with no old bare |\S+ fallback.
  • gateway/platforms/base.py:2956 shows BasePlatformAdapter.extract_media() now uses that shared regex for explicit MEDIA: extraction.
  • I exercised the PR's examples on current main: MEDIA:description and MEDIA:relative/path.png return no media, while MEDIA:/tmp/file.ogg still extracts as expected.
  • Regression coverage exists in tests/gateway/test_platform_base.py:396 for ignored relative paths and tests/gateway/test_platform_base.py:412 for inline-code MEDIA examples being preserved as text.
  • The direct fallback removal landed in ea49b38625b4c8ee9f8aedf6c7c3e9ec36acfe69; the later shared-regex consolidation landed in 781604ce4c826ec06b69a0bde703ab935308893d.

Thanks for the focused fix and clear write-up; the current implementation covers the same behavior.

@teknium1 teknium1 closed this Jun 11, 2026
@teknium1 teknium1 added the sweeper:implemented-on-main Sweeper: behavior already present on current main label Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P3 Low — cosmetic, nice to have sweeper:implemented-on-main Sweeper: behavior already present on current main type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

extract_media() greedy \S+ fallback matches non-file text -> spurious 'File not found' warnings

4 participants