fix(gateway): expand extract_media regex to common text/data/source formats#32358
Closed
cristianmgm7 wants to merge 1 commit into
Closed
fix(gateway): expand extract_media regex to common text/data/source formats#32358cristianmgm7 wants to merge 1 commit into
cristianmgm7 wants to merge 1 commit into
Conversation
…ormats
The MEDIA:<path> directive is the canonical way for an agent's text
reply to ship a file as a native platform attachment. The regex in
``BasePlatformAdapter.extract_media`` (gateway/platforms/base.py)
gated extraction on a hard-coded extension allowlist that omitted
several formats agents commonly emit — most notably ``.md``.
The failure mode was silent and confusing:
1. Agent generates ``/tmp/report.md`` and replies with
``Here you go.\nMEDIA:/tmp/report.md``.
2. ``extract_media()`` runs the regex; ``.md`` doesn't match the
allowlist → ``media_files`` stays empty → ``send_document`` is
never invoked.
3. The text-only cleanup regex on the same call path (~line 3476)
also strips ``MEDIA:\s*\S+`` from the visible text.
4. Net effect: the user sees a reply without the MEDIA line AND
without the attachment, and the agent thinks it succeeded. No
log, no warning — just a quietly missing file.
Add the formats that agents producing reports, configs, code, and
logs realistically emit:
- text: md, markdown
- data: json, yaml, yml, toml, tsv
- markup: html, htm, xml
- logs: log
- source: sh, py, js, ts
(``.txt``, ``.csv``, ``.pdf``, all images / audio / video, office
formats, and mobile installers were already in the allowlist; this
PR only adds gaps.)
Companion fixes:
- Comment on the regex calls out the dependency with the
secondary cleanup regex further down the file, so future
contributors don't add one extension to one regex without the
other.
- New ``tests/gateway/test_extract_media.py`` pins the broadened
allowlist (16 newly-supported extensions), regresses every
previously-supported extension (26 of them), and asserts that
unknown extensions like ``.exe`` / ``.dmg`` / ``.iso`` are
still rejected so the regex doesn't overshoot. 52 tests total,
all passing locally.
Discovered while building a Carbon Voice plugin
(hermes-plugin-carbonvoice) whose agent reliably generated
``.md`` reports that never reached the user. Once the plugin's
``send_document`` was confirmed working with ``.pdf``, the gap
narrowed to this regex.
Collaborator
This was referenced May 26, 2026
Author
|
Thanks for the heads-up @alt-glitch — closing in favor of #24384, which is strictly broader (also fixes prefix-glue from |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The MEDIA: directive lets an agent's text reply ship a file as a
native platform attachment. `BasePlatformAdapter.extract_media`
(`gateway/platforms/base.py:~2415`) gates extraction on a hard-coded
extension allowlist that was missing several formats agents commonly
emit — most notably `.md`.
This PR widens the allowlist and adds dedicated test coverage.
The bug, in one paragraph
The failure mode is silent and confusing:
I hit this dogfooding a Carbon Voice plugin where the agent reliably generated `.md` reports that never reached the user. Once `send_document` was confirmed working with `.pdf` (which was already in the allowlist), the gap narrowed to this regex.
What this PR adds to the allowlist
Already supported (untouched here): `txt`, `csv`, `pdf`, all images / audio / video, office formats, `apk` / `ipa`.
Test plan
New file `tests/gateway/test_extract_media.py` (52 cases):
All 52 pass locally. Existing tests in `tests/gateway/test_signal.py` that exercise `extract_media` for `.png`/`.ogg` continue to pass.
Companion comment in the code
Added a comment on the regex calling out the dependency with the cleanup regex on `base.py:~3476` — they share an implicit allowlist, and the silent-drop bug happens whenever they drift. Future contributors who add an extension to one regex will see the note pointing at the other.
Notes
🤖 Generated with Claude Code