Skip to content

feat(gateway): support [[as_document]] directive for skill media routing#19069

Closed
leon7609 wants to merge 1 commit into
NousResearch:mainfrom
leon7609:feat/as-document-directive
Closed

feat(gateway): support [[as_document]] directive for skill media routing#19069
leon7609 wants to merge 1 commit into
NousResearch:mainfrom
leon7609:feat/as-document-directive

Conversation

@leon7609

@leon7609 leon7609 commented May 3, 2026

Copy link
Copy Markdown
Contributor

Why

Skills that produce large/lossless images (e.g. info-graph, where a rendered JPG is 1-2 MB at 3840×2160) currently lose quality when delivered through Telegram, because Hermes' _IMAGE_EXTS membership routes the file through send_multiple_imagessendMediaGroup, which Telegram's server re-encodes to JPEG @ ~1280px max edge. Small-glyph legibility is destroyed (Chinese body text in a 28-32px font collapses to ~9px after compression).

The original bytes only survive when the file goes through send_document. The dispatch tables in three places (_process_message_background, _deliver_media_from_response, and the send_message tool's telegram path) only reach send_document for files whose extension is NOT in _IMAGE_EXTS. Skills can't currently override this from the response side.

This is independent of #15728 / #15837 / #18444 (which handle Telegram-server-side rejections via Photo_invalid_dimensions); those fix the case where Telegram gives up on a photo. This PR fixes the case where the photo path succeeds but compresses too aggressively for the use case.

What

Adds an [[as_document]] directive that mirrors the existing [[audio_as_voice]] shape:

信息图已生成(高密度模块 × 实验室波普 × 横版)
[[as_document]]
MEDIA:/private/tmp/info-graph-x/infographic.jpg

When the directive is present in the agent's response, every image-extension MEDIA: file in that response routes through send_document instead of send_multiple_images / sendPhoto. Telegram's sendDocument doesn't recompress, so the original 1MB JPEG arrives intact.

The directive is detected at the dispatch sites (which see the raw response) and the directive string is stripped from user-visible cleaned text in extract_media, so it never leaks into the user's chat.

Scope

  • All-or-nothing per response (matches [[audio_as_voice]]'s scope). Skills that need fine control can split into two responses.
  • Patches three dispatch paths: BasePlatformAdapter._process_message_background, GatewayRunner._deliver_media_from_response, and send_message tool's _send_telegram / _send_to_platform. The first two fix the agent's normal response path; the third fixes when an agent calls send_message(target='telegram', message='[[as_document]]\n...') explicitly.
  • Telegram is the immediate beneficiary. Discord/Slack/Mattermost/Email send_multiple_images implementations don't recompress on upload, so they're unaffected; the directive is harmlessly honored on those platforms via the same _image_paths_non_image_media swap.

Tests

+3 cases in TestExtractMedia:

  • directive stripped from cleaned text
  • directive doesn't entangle with [[audio_as_voice]]
  • both directives can coexist in one response

All 113 pre-existing media/extract/send tests still pass.

Verification

Tested locally with info-graph (https://github.com/leon7609/info-graph v0.2.2 emits the directive on every successful render). Before patch: 3840×2160 JPG → Telegram delivers compressed photo at ~1280px. After patch: same source → Telegram delivers original 1MB JPEG via sendDocument, fully legible.

Skills that produce large/lossless images (e.g. info-graph, where a
rendered JPG is 1-2 MB) currently lose quality in Telegram delivery
because `_IMAGE_EXTS` membership routes the file through
`send_multiple_images` → `sendMediaGroup`, which Telegram's server
re-encodes to JPEG @ 1280px max edge. The original bytes only survive
when the file goes through `send_document`, which the dispatch tables
in three places (`_process_message_background`, `_deliver_media_from_response`,
and the `send_message` tool's telegram path) only reach for files
whose extension is NOT in `_IMAGE_EXTS`.

This commit adds an `[[as_document]]` directive that mirrors the
existing `[[audio_as_voice]]` shape: a skill emits the directive once
in its response, and every image-extension MEDIA: file in that response
is delivered via `send_document` instead of `send_multiple_images` /
`sendPhoto`. The directive is detected at the dispatch sites (which see
the raw response) and the directive string is stripped from the
user-visible cleaned text in `extract_media` so it never leaks.

Granularity is intentionally all-or-nothing per response, matching
[[audio_as_voice]]'s scope. Skills that need fine control can split into
two responses.

Verified the targeted use case: info-graph emits

    信息图已生成(...)
    [[as_document]]
    MEDIA:/tmp/info-graph-x/infographic.jpg

→ Telegram receives `infographic.jpg` via sendDocument, original 1MB
JPEG bytes preserved, no recompression. Forwarding and download
filenames stay clean (`infographic.jpg`).

Tests: +3 cases in TestExtractMedia covering directive strip, isolation
from voice flag, and coexistence with [[audio_as_voice]]. All
113 pre-existing media/extract/send tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P3 Low — cosmetic, nice to have platform/telegram Telegram bot adapter type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants