Skip to content

fix(gateway): align MEDIA: regex whitelist with extract_local_files#34518

Closed
crazyhulk wants to merge 1 commit into
NousResearch:mainfrom
crazyhulk:fix/media-regex-whitelist-alignment
Closed

fix(gateway): align MEDIA: regex whitelist with extract_local_files#34518
crazyhulk wants to merge 1 commit into
NousResearch:mainfrom
crazyhulk:fix/media-regex-whitelist-alignment

Conversation

@crazyhulk

Copy link
Copy Markdown

Summary

Align extract_media's extension whitelist with the broader set supported by extract_local_files, fixing silent file drops for .md, .json, .yaml, and other formats.

Problem

PR #28350 introduced a strict extension whitelist in extract_media but the cleanup regex on line 3709 (MEDIA:\s*\S+) unconditionally strips all MEDIA: tags from text — even those the strict regex didn't match. This creates a black hole where files with unsupported extensions are neither extracted nor detectable by the downstream extract_local_files fallback.

Changes

Added missing extensions to extract_media's regex: md, json, xml, ya?ml, tsv, odt, rtf, bmp, tiff, svg, tar, gz, tgz, bz2, xz, xls, ods, ppt, odp, key.

Testing

  • Verified MEDIA:/tmp/paid_users_up_analysis.md now matches
  • test_media_extraction.py: 4 passed
  • test_platform_base.py: 114 passed, 2 skipped

Fixes #34517

The strict regex in extract_media was missing extensions (.md, .json,
.yaml, .xml, .tsv, etc.) that extract_local_files supports. Because
line 3709 unconditionally strips all MEDIA: tags from text regardless
of whether extract_media matched them, files with unsupported
extensions were silently dropped — neither extracted as media nor
detected as bare paths.

Align the extension whitelist so extract_media captures all formats
that the platform can deliver.

Fixes NousResearch#34517

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels May 29, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Duplicate of #29609 (canonical fix: dynamic derivation from SUPPORTED_DOCUMENT_TYPES). Competes with #30588, #33089, #32995, #34016, and others in the same cluster.

@crazyhulk

Copy link
Copy Markdown
Author

Duplicate of #29609 (canonical fix: dynamic derivation from SUPPORTED_DOCUMENT_TYPES). Competes with #30588, #33089, #32995, #34016, and others in the same cluster. #29609 的副本(修正方案:根据 SUPPORTED_DOCUMENT_TYPES 动态生成)。需与同一类别中的 #30588#33089#32995#34016 等竞争。

I support either of the PRs — we just need to solve the problem. I can close it anytime.

@teknium1

Copy link
Copy Markdown
Contributor

Superseded by #34844, which consolidates this cluster.

This PR widens the extract_media extension allowlist, which is the right direction — but on its own it leaves the unconditional MEDIA:\s*\S+ strip in place, so a MEDIA: tag with any extension still outside the (now wider) list keeps getting deleted from the body before extract_local_files can pick up the bare path. #34844 fixes both halves: it unifies the two extractors onto a single shared extension set (MEDIA_DELIVERY_EXTS) AND replaces the loose strip with an extension-anchored one, so an unknown-extension path survives in the text instead of vanishing.

Closing as superseded — thanks for surfacing and helping pin down this bug; it was part of getting the full fix right. See #34844.

@teknium1 teknium1 closed this May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MEDIA: tag silently drops .md (and other) files due to regex whitelist mismatch

3 participants