Skip to content

fix(gateway): expand extract_media regex to common text/data/source formats#32358

Closed
cristianmgm7 wants to merge 1 commit into
NousResearch:mainfrom
cristianmgm7:fix/extract-media-text-formats
Closed

fix(gateway): expand extract_media regex to common text/data/source formats#32358
cristianmgm7 wants to merge 1 commit into
NousResearch:mainfrom
cristianmgm7:fix/extract-media-text-formats

Conversation

@cristianmgm7

Copy link
Copy Markdown

Summary

The MEDIA: directive lets an agent's text reply ship a file as a
native platform attachment. `BasePlatformAdapter.extract_media`
(`gateway/platforms/base.py:~2415`) gates extraction on a hard-coded
extension allowlist that was missing several formats agents commonly
emit — most notably `.md`.

This PR widens the allowlist and adds dedicated test coverage.

The bug, in one paragraph

The failure mode is silent and confusing:

  1. Agent generates `/tmp/report.md` and replies with `Here you go.\nMEDIA:/tmp/report.md`.
  2. `extract_media()` runs the regex; `.md` doesn't match the allowlist → `media_files` stays empty → `send_document` is never invoked.
  3. The cleanup regex further down the same code path (`base.py:~3476`, `re.sub(r"MEDIA:\s*\S+", "", text_content)`) still strips the `MEDIA:...` line from visible text.
  4. Net effect: the recipient sees a reply without the MEDIA line and without the attachment, and the agent's logic thinks it succeeded. No log, no warning — just a quietly missing file.

I hit this dogfooding a Carbon Voice plugin where the agent reliably generated `.md` reports that never reached the user. Once `send_document` was confirmed working with `.pdf` (which was already in the allowlist), the gap narrowed to this regex.

What this PR adds to the allowlist

Group New extensions Rationale
text `md`, `markdown` reports, READMEs, agent-generated docs
data `json`, `yaml`, `yml`, `toml`, `tsv` config dumps, API payloads, structured exports
markup `html`, `htm`, `xml` rendered HTML, OPML, SVG-less XML
logs `log` log slices the agent pulls for the user
source `sh`, `py`, `js`, `ts` code review files, scripts the agent wrote

Already supported (untouched here): `txt`, `csv`, `pdf`, all images / audio / video, office formats, `apk` / `ipa`.

Test plan

New file `tests/gateway/test_extract_media.py` (52 cases):

  • ✅ All 16 newly-supported extensions extract cleanly via a parametrized test
  • ✅ All 26 previously-supported extensions still extract (regression guard)
  • ✅ Unknown extensions (`xyz`, `exe`, `dmg`, `iso`, `deb`, `rpm`) still rejected
  • ✅ Real-world variants: `.md` with caption preserves the caption text
  • ✅ Multiple mixed-extension MEDIA: directives in one reply (`.md` + `.json` + `.png`)
  • ✅ `[[audio_as_voice]]` flag still flips `is_voice=True` on audio paths

All 52 pass locally. Existing tests in `tests/gateway/test_signal.py` that exercise `extract_media` for `.png`/`.ogg` continue to pass.

Companion comment in the code

Added a comment on the regex calling out the dependency with the cleanup regex on `base.py:~3476` — they share an implicit allowlist, and the silent-drop bug happens whenever they drift. Future contributors who add an extension to one regex will see the note pointing at the other.

Notes

  • No public API change — same function signature, same return shape, only the regex is broader.
  • No behavior change for any extension that already worked.
  • The widened regex doesn't overshoot into platform-incompatible types (negative tests cover `.exe`, `.dmg`, etc.).

🤖 Generated with Claude Code

…ormats

The MEDIA:<path> directive is the canonical way for an agent's text
reply to ship a file as a native platform attachment. The regex in
``BasePlatformAdapter.extract_media`` (gateway/platforms/base.py)
gated extraction on a hard-coded extension allowlist that omitted
several formats agents commonly emit — most notably ``.md``.

The failure mode was silent and confusing:

  1. Agent generates ``/tmp/report.md`` and replies with
     ``Here you go.\nMEDIA:/tmp/report.md``.
  2. ``extract_media()`` runs the regex; ``.md`` doesn't match the
     allowlist → ``media_files`` stays empty → ``send_document`` is
     never invoked.
  3. The text-only cleanup regex on the same call path (~line 3476)
     also strips ``MEDIA:\s*\S+`` from the visible text.
  4. Net effect: the user sees a reply without the MEDIA line AND
     without the attachment, and the agent thinks it succeeded. No
     log, no warning — just a quietly missing file.

Add the formats that agents producing reports, configs, code, and
logs realistically emit:

  - text:   md, markdown
  - data:   json, yaml, yml, toml, tsv
  - markup: html, htm, xml
  - logs:   log
  - source: sh, py, js, ts

(``.txt``, ``.csv``, ``.pdf``, all images / audio / video, office
formats, and mobile installers were already in the allowlist; this
PR only adds gaps.)

Companion fixes:

  - Comment on the regex calls out the dependency with the
    secondary cleanup regex further down the file, so future
    contributors don't add one extension to one regex without the
    other.
  - New ``tests/gateway/test_extract_media.py`` pins the broadened
    allowlist (16 newly-supported extensions), regresses every
    previously-supported extension (26 of them), and asserts that
    unknown extensions like ``.exe`` / ``.dmg`` / ``.iso`` are
    still rejected so the regex doesn't overshoot. 52 tests total,
    all passing locally.

Discovered while building a Carbon Voice plugin
(hermes-plugin-carbonvoice) whose agent reliably generated
``.md`` reports that never reached the user. Once the plugin's
``send_document`` was confirmed working with ``.pdf``, the gap
narrowed to this regex.
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels May 26, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Competing with #22492, #30588, and #24384 for the same extract_media() extension allowlist fix. #24384 is the broadest (also fixes prefix-glue and space-truncation bugs). Recommend consolidating.

@cristianmgm7

Copy link
Copy Markdown
Author

Thanks for the heads-up @alt-glitch — closing in favor of #24384, which is strictly broader (also fixes prefix-glue from \s* consuming newlines, the \|\S+ space-truncation, and the streaming dispatch extract_media cleanup-discard in run.py). Left a +1 there with the discovery context from Carbon Voice in case the failure-mode pattern helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants