Skip to content

fix(gateway): unify extract_media + extract_local_files extension lists (#34321)#34345

Closed
Bartok9 wants to merge 1 commit into
NousResearch:mainfrom
Bartok9:fix/34321-extract-media-shared-extensions
Closed

fix(gateway): unify extract_media + extract_local_files extension lists (#34321)#34345
Bartok9 wants to merge 1 commit into
NousResearch:mainfrom
Bartok9:fix/34321-extract-media-shared-extensions

Conversation

@Bartok9

@Bartok9 Bartok9 commented May 29, 2026

Copy link
Copy Markdown
Contributor

Closes #34321

Problem

BasePlatformAdapter.extract_media() rejects .md (and 22 other text-based extensions) that extract_local_files() correctly accepts. MEDIA: tags pointing to those files fall through to plain text on every platform, while a bare path in the same response delivers as a native attachment.

Commit ea49b3862 narrowed the extract_media regex to fix Mattermost false positives, but left out:

.md .doc .odt .rtf .ods .tsv .json .xml .yaml .yml .ppt .odp
.key .tar .gz .tgz .bz2 .xz .html .htm .tiff .svg .bmp

Reproduction (from issue)

send_message(message='MEDIA:~/.hermes/cache/documents/report.md',
             target='weixin')
  • Before: literal MEDIA:~/.hermes/... delivered as text
  • After: report.md uploaded as native attachment ✅

Fix

Extract the extension list into a class-level constant BasePlatformAdapter._MEDIA_DELIVERY_EXTS and use it in both extract_media() and extract_local_files(). Single source of truth eliminates the drift class.

Implementation notes:

  • _MEDIA_DELIVERY_EXTS is a class attribute (Tuple[str, ...]) so subclass overrides work naturally if a platform ever needs to narrow it.
  • Regex alternation sorted by length descending so .tgz wins against .gz (would short-circuit otherwise).
  • Added re.IGNORECASE on extract_media to match the case-insensitivity already in extract_local_files. Real responses include MEDIA:/path/foo.PDF.
  • Union of both prior lists: .epub/.opus/.apk/.ipa were in extract_media but missing from extract_local_files. Both sets now in the shared constant.

Tests

10 new tests in TestExtractMediaExtensionParity:

  • Shared constant exists, non-empty, contains .md
  • Parametrized across all 23 previously-missing extensions: each is in the constant AND matches via extract_media()
  • Real-world extract_media() regex missing .md and other text-based extensions, mismatch with extract_local_files() #34321 reproduction case (MEDIA:~/.hermes/cache/documents/report.md)
  • Cross-method probe: both methods accept the same set on real files at /tmp
  • [[audio_as_voice]] still applies to .md (behavior preserved)
  • Unknown extensions like .exe still rejected (this fix expands the allow-list, doesn't open the gate)
$ python -m pytest tests/gateway/test_platform_base.py tests/gateway/test_extract_local_files.py tests/gateway/test_media_extraction.py
=== 209 passed, 2 skipped in 1.58s ===

Files changed

File Change
gateway/platforms/base.py +49 / -19 (shared constant + both methods use it)
tests/gateway/test_platform_base.py +99 / 0 (10 new tests)

🎻 Co-authored-by: Cursor cursoragent@cursor.com

@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/gateway Gateway runner, session dispatch, delivery labels May 29, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Part of the saturated MEDIA extension list cluster: #29609 (dynamic derivation, preferred), #31138, #31150 (closed), #22492, #30588. Fixes #34321.

…ts (NousResearch#34321)

BasePlatformAdapter.extract_media() rejects .md (and 22 other text-based
extensions) that extract_local_files() correctly accepts. This causes
MEDIA:<path> tags pointing to those files to fall through to plain
text on every platform, while a bare path in the same response would
deliver as a native attachment.

Commit ea49b38 narrowed the extract_media regex (to fix Mattermost
false positives) but left out:
  .md .doc .odt .rtf .ods .tsv .json .xml .yaml .yml .ppt .odp
  .key .tar .gz .tgz .bz2 .xz .html .htm .tiff .svg .bmp

Reproduction from issue:
  send_message(message='MEDIA:~/.hermes/cache/documents/report.md',
               target='weixin')
  → before: literal 'MEDIA:~/.hermes/...' delivered as text
  → after:  report.md uploaded as native attachment

Fix:

Extract the extension list into a class-level constant
`BasePlatformAdapter._MEDIA_DELIVERY_EXTS` and use it in BOTH
extract_media() and extract_local_files(). The single source of
truth eliminates the drift class.

Implementation notes:

- _MEDIA_DELIVERY_EXTS is a class attribute (Tuple[str, ...]) so it
  participates in subclass overrides naturally if a platform ever
  needs to narrow it (Mattermost / WhatsApp text-only fallback, etc.)
  without forking extract_media() itself.

- Regex alternation is sorted by length descending so longer
  extensions like `.tgz` win against shorter aliases — necessary
  when the same suffix could otherwise short-circuit a longer one.

- re.IGNORECASE added on extract_media's regex to match the case
  insensitivity already present on extract_local_files. Real
  responses include `MEDIA:/path/foo.PDF` from skill authors who
  uppercase extensions.

- .epub and .opus were already in extract_media but not
  extract_local_files; .apk/.ipa were in extract_media but not
  extract_local_files. Both sets are now in the shared constant,
  which is the union of both prior lists.

Tests (10 new in TestExtractMediaExtensionParity):

- Shared constant exists, is non-empty, contains .md (the issue's
  example).
- Parametrized across all 23 previously-missing extensions: each is
  now in the shared constant AND now matches via extract_media().
- Real-world NousResearch#34321 reproduction case (`MEDIA:~/.hermes/cache/
  documents/report.md`) extracts correctly.
- Cross-method probe: both extract_media and extract_local_files
  accept the same set on real files at /tmp.
- [[audio_as_voice]] still applies to .md (behavior preserved).
- Unknown extensions like .exe still rejected (guardrail: this fix
  expands the allow-list, it does not open the gate to anything).

All 209 tests in test_platform_base.py +
test_extract_local_files.py + test_media_extraction.py pass.

Refs: NousResearch#34321
Closes: NousResearch#34321

Co-authored-by: Cursor <cursoragent@cursor.com>
@teknium1

Copy link
Copy Markdown
Contributor

Merged via #34844, which consolidates the MEDIA: extension-allowlist cluster.

Your contribution was used directly and your authorship is preserved via a Co-authored-by: trailer on the merge commit (781604c on main). #34844 fixes both halves of issue #34517: it unifies extract_media and extract_local_files onto one shared extension set AND gates the cleanup strip so unknown-extension paths are no longer silently dropped.

Thanks for the work here.

@teknium1 teknium1 closed this May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

extract_media() regex missing .md and other text-based extensions, mismatch with extract_local_files()

3 participants