Skip to content

fix(gateway): unify MEDIA extraction onto one curated extension set#34656

Closed
banditburai wants to merge 16 commits into
NousResearch:mainfrom
banditburai:fix/gateway-media-extensions
Closed

fix(gateway): unify MEDIA extraction onto one curated extension set#34656
banditburai wants to merge 16 commits into
NousResearch:mainfrom
banditburai:fix/gateway-media-extensions

Conversation

@banditburai

Copy link
Copy Markdown
Contributor

Collapse ~5 drifted MEDIA-extraction regex/extension copies onto one curated _DELIVERY_MEDIA_EXTS frozenset + shared compiled regexes in gateway/platforms/base.py (used by extract_media, extract_local_files, the two run.py tool-dedup sites via a new tool_media_paths(), stream_consumer, and the dispatch strip). Broadens deliverable extensions (docs incl. .md/.html, uppercase, bold/quoted/spaced paths, dedup, GIS .kml/.kmz/.geojson/.gpx). The curated set is the union of what extract_media and extract_local_files already delivered, so source/script/config exts (.py/.sh/.ts/.toml/.ini/.cfg/.log) — never in either list — stay excluded (delivering them outbound would be new behavior, and scripts/config often carry secrets).

⚠️ Behavior changes

Change Before After
MEDIA: literal matched loosely case-sensitive (MEDIA: only); extensions stay case-insensitive
Unknown-extension MEDIA: tags in streaming/dispatch display blindly hidden by a MEDIA:\s*\S+ strip left visible, identical to the non-streaming final text
Windows (C:\, C:/) / UNC (\\server\share) MEDIA: paths leaked verbatim into the message as plain text recognized + stripped from text; not delivered (fail-closed)

Security

  • validate_media_delivery_path's safe-roots / denylist / recency logic is unchanged except one additive fail-closed guard that rejects Windows/UNC paths — it only narrows, and closes a pre-existing native-Windows fail-open (the POSIX denylist matches nothing there).
  • Strictly narrowing-or-neutral on the prompt-injection exfiltration boundary; no ReDoS.

Scope — fixed here vs left for later

Issue Fixed here Out of scope (future work)
#24032 GIS extensions + spaced-path extraction native Windows-path delivery
#28989 raw-path text leak (recognized + stripped) native Windows delivery
#32644 silent-success surfaced via media_dropped safe-roots / archival-path delivery

The out-of-scope items all change what paths can be delivered, which touches the path-validation exfiltration boundary (a Windows-aware allowlist; honoring additional safe roots). They're intentionally left for a separate, security-reviewed change — these issues stay open for that delivery half.

Tests

New behavior-focused suites: tests/gateway/test_extract_media_extensions.py (+ stream-strip equivalence, test_platform_base partition/Windows-guard), CI-safe tests/tools/test_send_message_media_dropped.py. grep-verify confirms no drifted copies remain.

Related open PRs (29)

Many open PRs fix subsets of the same drift; this unifies them onto one source of truth. Listed for dedup:

Alternative approaches taken elsewhere (this PR differs):

Out of scope: source/config delivery (5)

#32294 #32398 #32604 #32611 #19926 request auto-delivering source/config/script exts (.toml/.log/.py/.sh/...). This PR keeps the curated set to what was already deliverable and doesn't add them — whether to deliver those outbound is a separate product decision.

Closes #34321
Closes #34517
Closes #31137
Closes #30518
Closes #29582
Closes #30526
Closes #23759
Closes #24575
Closes #31560

The non-streaming dispatch path used a blind re.sub(r"MEDIA:\s*\S+") to
scrub residual MEDIA: tags from outbound text. extract_media already strips
recognized tags upstream, so the blind pass only erased unknown-extension
tags — silently hiding the model's intent to deliver a (non-deliverable)
file (NousResearch#32644) and diverging from the streaming display path, which was moved
to the shared _MEDIA_TAG_RE. Route this 5th site through the same object:
defensive no-op for recognized tags, unknown-ext tags now left visible and
consistent across streaming/non-streaming. Closes the last blind MEDIA regex.
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels May 29, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Part of the saturated MEDIA extension unification cluster. Canonical dynamic approach: #29609. Also overlaps with open #34345 (shared _MEDIA_DELIVERY_EXTS constant). Root issue: #31137.

…ry set

Closes the extension half of NousResearch#24032 — these are legitimate deliverable
artifacts (not source/config), route to send_document via the existing
non-image/video/audio fallback. The Windows-path half of NousResearch#24032 is handled
separately.
…xt leak (NousResearch#28989, NousResearch#24032)

extract_media/_TOOL_MEDIA_RE now recognize Windows drive-letter (C:\...,
C:/...) and UNC (\\server\share\...) paths — incl. spaced and bold forms —
so the MEDIA: tag is stripped from the user-visible message instead of leaking
verbatim as plain text (the headline symptom of NousResearch#28989 and NousResearch#24032).

Safe native-Windows *delivery* is deliberately NOT enabled here: a Windows
credential denylist is unwinnable (\\?\ prefixes, 8.3 names, ADS, UNC all
defeat prefix matching), so it requires the allowlist-model path-validation
security PR (the L0 follow-up, GRILME Q7a). validate_media_delivery_path
fail-closes drive-letter/UNC paths (\w:[\\/]|\\\\ — \w catches Unicode
homoglyph drives) on every host, so they are dropped + surfaced via
media_dropped rather than delivered. This also closes a latent fail-open for
quoted Windows paths on native Windows hosts, where the POSIX credential
denylist matches nothing.

_LOCAL_PATH_RE (bare-path auto-detect) is intentionally left POSIX-only to
avoid false positives on 'C:' in prose.
…osed guard

- Move media_dropped success + send-error-guard assertions out of
  test_send_message_tool.py (gated by importorskip('telegram'), which neither
  [dev] nor [all] installs, so they were silently skipped in CI) into a new
  telegram-free test_send_message_media_dropped.py. Adds an end-to-end
  Windows-tag strip+drop test through the send_message tool path.
- Strengthen the Windows-not-deliverable test: force is_absolute()=True (the
  native-Windows condition) so the fail-closed guard ALONE is what rejects
  drive-letter/UNC/homoglyph paths, instead of the incidental POSIX reject
  masking a deleted guard.
Address code-review test-quality findings (behavior over implementation):
- Extract tool_media_paths() so run.py's two MEDIA-salvage sites share one
  tested helper instead of duplicating group(1).strip().rstrip('",}'); replace
  the impl-coupled _TOOL_MEDIA_RE direct tests (which had drifted from the
  caller's salvage) with a behavioral test of the helper.
- Replace the stream-consumer _MEDIA_RE-is-_MEDIA_TAG_RE identity assert (a
  change-detector) with a behavioral equivalence test: streaming display-strip
  output must equal extract_media's cleaning across inputs — the real zero-drift
  contract, surviving a refactor that doesn't literally share one object.
…ation, trim what-comments

Concision pass: inline the one-use _media_ext_alternation() into the
_MEDIA_EXT_ALT constant; drop the redundant non-capturing group in the
Windows guard; condense the frozenset header + guard comments to the
load-bearing why (single-source + exclusion rationale; fail-closed +
homoglyph rationale), dropping the what-enumeration and an opaque
internal cross-reference. No behavior change.
test_windows_media_path_partitions_to_dropped asserted the composition of two
already-covered behaviors (partition split + validate Windows->None) and is
also covered end-to-end by test_windows_media_tag_stripped_and_dropped_end_to_end.
…ensions

# Conflicts:
#	gateway/platforms/base.py
@teknium1

Copy link
Copy Markdown
Contributor

Merged via #34844, which consolidates the MEDIA: extension-allowlist cluster.

Your contribution was used directly and your authorship is preserved via a Co-authored-by: trailer on the merge commit (781604c on main). #34844 fixes both halves of issue #34517: it unifies extract_media and extract_local_files onto one shared extension set AND gates the cleanup strip so unknown-extension paths are no longer silently dropped.

Thanks for the work here.

@teknium1 teknium1 closed this May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment