Summary
When an agent response combines prose text with MEDIA: tags on Telegram
(in streaming mode), three issues surface in the delivery pipeline. I've
prepared a patch and will open a PR shortly.
Repro
Trigger an agent response shaped like this (via the Telegram gateway,
streaming mode enabled):
Here is a summary of the catalog.
MEDIA:/path/item_a_view1.png
Item A — view 1
MEDIA:/path/item_a_view2.png
Item A — view 2
MEDIA:/path/item_b_view1.png
Item B — view 1
MEDIA:/path/item_b_view2.png
Item B — view 2
MEDIA:/path/featured.png
Featured item
Expected structure after parsing:
- 1
TextBlock: "Here is a summary of the catalog."
- 1
MediaGroupBlock: 4 items, each with its own caption
- 1
ImageBlock: "Featured item"
Observed
- Album doesn't form. The four images arrive as individual photos
instead of a single Telegram album. Only the first image carries a
caption; the rest are bare.
- Caption text leaks into the streamed prose. The initial streamed
message still contains the caption lines ("Item A — view 1", etc.)
even after media delivery runs. The user sees the captions twice —
once in the prose, once on the images (if at all).
- No way to caption each image individually within an album. The
parser assigns at most one caption to an entire MediaGroupBlock,
and placing a caption between two MEDIA: lines splits them into
separate groups — so there is no response shape that produces a
single album with per-image captions, even though the Telegram
sendMediaGroup API supports it natively.
Expected
- One album containing all four images, each with its own caption.
- One single captioned image.
- The streamed prose message contains only the intro paragraph — no
leaked caption text.
Root Causes (from investigation)
TelegramAdapter.send_media_group() opens file handles via
open(fp, "rb"). When the MarkdownV2 attempt fails, the retry
path can't reuse the handles cleanly and the call falls through to
super().send_media_group() (individual sends) while still
returning success=True.
- The delivery path knows to avoid deleting the streamed message when
TextBlocks coexist with media, but nothing edits the message to
strip the caption lines after the fact.
_parse_content_blocks() treats caption text as a single
group-level caption and uses caption-between-MEDIA as a group
separator, so per-item captions within one album are
unrepresentable.
Proposed Fix
- Switch local-file media group items to
file_path.read_bytes() so
the bytes are safely reusable across retry attempts; log a warning
when the individual-send fallback is actually reached.
- When a response contains both
TextBlocks and captioned media, edit
the streamed message down to TextBlock content only (falling back
to delete for pure-media responses).
- Rework the parser so a caption line immediately following a
MEDIA:
line attaches to that item's caption, and a blank line after caption
text ends the current group. Trailing caption attaches to the last
item.
Happy to discuss the API shape and parser semantics before the PR
lands — PR incoming shortly with tests.
Environment
- hermes-agent
main @ b909a9e
- Telegram gateway, streaming mode
Summary
When an agent response combines prose text with
MEDIA:tags on Telegram(in streaming mode), three issues surface in the delivery pipeline. I've
prepared a patch and will open a PR shortly.
Repro
Trigger an agent response shaped like this (via the Telegram gateway,
streaming mode enabled):
Expected structure after parsing:
TextBlock: "Here is a summary of the catalog."MediaGroupBlock: 4 items, each with its own captionImageBlock: "Featured item"Observed
instead of a single Telegram album. Only the first image carries a
caption; the rest are bare.
message still contains the caption lines ("Item A — view 1", etc.)
even after media delivery runs. The user sees the captions twice —
once in the prose, once on the images (if at all).
parser assigns at most one caption to an entire
MediaGroupBlock,and placing a caption between two
MEDIA:lines splits them intoseparate groups — so there is no response shape that produces a
single album with per-image captions, even though the Telegram
sendMediaGroupAPI supports it natively.Expected
leaked caption text.
Root Causes (from investigation)
TelegramAdapter.send_media_group()opens file handles viaopen(fp, "rb"). When the MarkdownV2 attempt fails, the retrypath can't reuse the handles cleanly and the call falls through to
super().send_media_group()(individual sends) while stillreturning
success=True.TextBlocks coexist with media, but nothing edits the message tostrip the caption lines after the fact.
_parse_content_blocks()treats caption text as a singlegroup-level caption and uses caption-between-MEDIA as a group
separator, so per-item captions within one album are
unrepresentable.
Proposed Fix
file_path.read_bytes()sothe bytes are safely reusable across retry attempts; log a warning
when the individual-send fallback is actually reached.
TextBlocks and captioned media, editthe streamed message down to
TextBlockcontent only (falling backto delete for pure-media responses).
MEDIA:line attaches to that item's caption, and a blank line after caption
text ends the current group. Trailing caption attaches to the last
item.
Happy to discuss the API shape and parser semantics before the PR
lands — PR incoming shortly with tests.
Environment
main@ b909a9e