fix(gateway): unify MEDIA extraction onto one curated extension set#34656
Closed
banditburai wants to merge 16 commits into
Closed
fix(gateway): unify MEDIA extraction onto one curated extension set#34656banditburai wants to merge 16 commits into
banditburai wants to merge 16 commits into
Conversation
The non-streaming dispatch path used a blind re.sub(r"MEDIA:\s*\S+") to scrub residual MEDIA: tags from outbound text. extract_media already strips recognized tags upstream, so the blind pass only erased unknown-extension tags — silently hiding the model's intent to deliver a (non-deliverable) file (NousResearch#32644) and diverging from the streaming display path, which was moved to the shared _MEDIA_TAG_RE. Route this 5th site through the same object: defensive no-op for recognized tags, unknown-ext tags now left visible and consistent across streaming/non-streaming. Closes the last blind MEDIA regex.
Collaborator
…ry set Closes the extension half of NousResearch#24032 — these are legitimate deliverable artifacts (not source/config), route to send_document via the existing non-image/video/audio fallback. The Windows-path half of NousResearch#24032 is handled separately.
…xt leak (NousResearch#28989, NousResearch#24032) extract_media/_TOOL_MEDIA_RE now recognize Windows drive-letter (C:\..., C:/...) and UNC (\\server\share\...) paths — incl. spaced and bold forms — so the MEDIA: tag is stripped from the user-visible message instead of leaking verbatim as plain text (the headline symptom of NousResearch#28989 and NousResearch#24032). Safe native-Windows *delivery* is deliberately NOT enabled here: a Windows credential denylist is unwinnable (\\?\ prefixes, 8.3 names, ADS, UNC all defeat prefix matching), so it requires the allowlist-model path-validation security PR (the L0 follow-up, GRILME Q7a). validate_media_delivery_path fail-closes drive-letter/UNC paths (\w:[\\/]|\\\\ — \w catches Unicode homoglyph drives) on every host, so they are dropped + surfaced via media_dropped rather than delivered. This also closes a latent fail-open for quoted Windows paths on native Windows hosts, where the POSIX credential denylist matches nothing. _LOCAL_PATH_RE (bare-path auto-detect) is intentionally left POSIX-only to avoid false positives on 'C:' in prose.
…osed guard
- Move media_dropped success + send-error-guard assertions out of
test_send_message_tool.py (gated by importorskip('telegram'), which neither
[dev] nor [all] installs, so they were silently skipped in CI) into a new
telegram-free test_send_message_media_dropped.py. Adds an end-to-end
Windows-tag strip+drop test through the send_message tool path.
- Strengthen the Windows-not-deliverable test: force is_absolute()=True (the
native-Windows condition) so the fail-closed guard ALONE is what rejects
drive-letter/UNC/homoglyph paths, instead of the incidental POSIX reject
masking a deleted guard.
Address code-review test-quality findings (behavior over implementation):
- Extract tool_media_paths() so run.py's two MEDIA-salvage sites share one
tested helper instead of duplicating group(1).strip().rstrip('",}'); replace
the impl-coupled _TOOL_MEDIA_RE direct tests (which had drifted from the
caller's salvage) with a behavioral test of the helper.
- Replace the stream-consumer _MEDIA_RE-is-_MEDIA_TAG_RE identity assert (a
change-detector) with a behavioral equivalence test: streaming display-strip
output must equal extract_media's cleaning across inputs — the real zero-drift
contract, surviving a refactor that doesn't literally share one object.
…ation, trim what-comments Concision pass: inline the one-use _media_ext_alternation() into the _MEDIA_EXT_ALT constant; drop the redundant non-capturing group in the Windows guard; condense the frozenset header + guard comments to the load-bearing why (single-source + exclusion rationale; fail-closed + homoglyph rationale), dropping the what-enumeration and an opaque internal cross-reference. No behavior change.
test_windows_media_path_partitions_to_dropped asserted the composition of two already-covered behaviors (partition split + validate Windows->None) and is also covered end-to-end by test_windows_media_tag_stripped_and_dropped_end_to_end.
…havior-preservation
…ensions # Conflicts: # gateway/platforms/base.py
d3e9e8f to
ee43410
Compare
Contributor
|
Merged via #34844, which consolidates the MEDIA: extension-allowlist cluster. Your contribution was used directly and your authorship is preserved via a Thanks for the work here. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Collapse ~5 drifted MEDIA-extraction regex/extension copies onto one curated
_DELIVERY_MEDIA_EXTSfrozenset + shared compiled regexes ingateway/platforms/base.py(used byextract_media,extract_local_files, the tworun.pytool-dedup sites via a newtool_media_paths(),stream_consumer, and the dispatch strip). Broadens deliverable extensions (docs incl..md/.html, uppercase, bold/quoted/spaced paths, dedup, GIS.kml/.kmz/.geojson/.gpx). The curated set is the union of whatextract_mediaandextract_local_filesalready delivered, so source/script/config exts (.py/.sh/.ts/.toml/.ini/.cfg/.log) — never in either list — stay excluded (delivering them outbound would be new behavior, and scripts/config often carry secrets).MEDIA:literalMEDIA:only); extensions stay case-insensitiveMEDIA:tags in streaming/dispatch displayMEDIA:\s*\S+stripC:\,C:/) / UNC (\\server\share)MEDIA:pathsSecurity
validate_media_delivery_path's safe-roots / denylist / recency logic is unchanged except one additive fail-closed guard that rejects Windows/UNC paths — it only narrows, and closes a pre-existing native-Windows fail-open (the POSIX denylist matches nothing there).Scope — fixed here vs left for later
media_droppedThe out-of-scope items all change what paths can be delivered, which touches the path-validation exfiltration boundary (a Windows-aware allowlist; honoring additional safe roots). They're intentionally left for a separate, security-reviewed change — these issues stay open for that delivery half.
Tests
New behavior-focused suites:
tests/gateway/test_extract_media_extensions.py(+ stream-strip equivalence,test_platform_basepartition/Windows-guard), CI-safetests/tools/test_send_message_media_dropped.py.grep-verify confirms no drifted copies remain.Related open PRs (29)
Many open PRs fix subsets of the same drift; this unifies them onto one source of truth. Listed for dedup:
\S+fallback (1): fix(extract_media): restrict \S+ fallback to absolute paths only #24672Alternative approaches taken elsewhere (this PR differs):
SUPPORTED_DOCUMENT_TYPES(5): fix(gateway): derive MEDIA extensions dynamically from SUPPORTED_DOCUMENT_TYPES #31159 fix(gateway): derive MEDIA attachment extensions #32472 fix(gateway): sync MEDIA regex extension allowlist with SUPPORTED_DOCUMENT_TYPES #29609 fix(gateway): align MEDIA: regex whitelist with SUPPORTED_DOCUMENT_TYPES #32995 fix(gateway): define missing _MEDIA_EXTS — MEDIA tag and local file extraction #34115 — this uses a hand-curated set instead, to keep source/config exts out of outbound deliveryextract_local_filesbare paths (1): fix(gateway): recognize Windows drive-letter paths in extract_local_files() bare-path uploads #28991 — this recognizes Windows only in explicitMEDIA:tags (bare-path stays POSIX-only, fewer false positives)Out of scope: source/config delivery (5)
#32294 #32398 #32604 #32611 #19926 request auto-delivering source/config/script exts (
.toml/.log/.py/.sh/...). This PR keeps the curated set to what was already deliverable and doesn't add them — whether to deliver those outbound is a separate product decision.Closes #34321
Closes #34517
Closes #31137
Closes #30518
Closes #29582
Closes #30526
Closes #23759
Closes #24575
Closes #31560